openai / requests-for-research

A living collection of deep learning problems
https://openai.com/requests-for-research
1.69k stars 609 forks source link

Train a language model on a jokes corpus #37

Open pranoyr opened 6 years ago

pranoyr commented 6 years ago

I have successfully created a character level language model using the joke dataset.

AlphaGit commented 6 years ago

@pranoyr I’m really interested in your approach, specially because the model seems to be very simple (i.e. efficient) in regards to training. Did you have any good loss metrics? Was it able to generate sentences, and even better, funny sentences on its own?

Please note I’m not affiliated with Open AI’s research, I’m just genuinely curious about it.

pranoyr commented 6 years ago

When i took a small dataset, the model was fitting good and the loss was decreasing gradually. When trained on a large dataset, the text generation was satisfactory. I am trying out with more deeper models which can produce much more accurate and funny results.

AlphaGit commented 6 years ago

@pranoyr I made these very slight modifications in order to run it myself:

+import sys                                                  
 from pickle import load                                     
 from keras.models import load_model                         
 from keras.utils import to_categorical                      
@@ -34,4 +35,6 @@ model = load_model('model.h5')             
 # load the mapping                                          
 mapping = load(open('mapping.pkl', 'rb'))                   
 # test not in original                                      
-print(generate_seq(model, mapping, 10, 'hello worl', 20))   
+seed_text = sys.argv[1]                                     
+print("Seed text: ", seed_text)                             
+print(generate_seq(model, mapping, 10, seed_text, 140))     
diff --git a/prepare_data.py b/prepare_data.py               
index 0221669..bd9a632 100755                                
--- a/prepare_data.py                                        
+++ b/prepare_data.py                                        
@@ -1,7 +1,7 @@                                              
 # load doc into memory                                      
 def load_doc(filename):                                     
        # open the file as read only                         
-       file = open(filename, 'r')                           
+       file = open(filename, 'r', encoding='utf8')          
        # read all text                                      
        text = file.read()                                   
        # close the file                                     
@@ -11,7 +11,7 @@ def load_doc(filename):                    
 # save tokens to file, one dialog per line                  
 def save_doc(lines, filename):                              
        data = '\n'.join(lines)                              
-       file = open(filename, 'w')                           
+       file = open(filename, 'w', encoding='utf8')          
        file.write(data)                                     
        file.close()                                         

diff --git a/train.py b/train.py                             
index b6a571d..f964019 100755                                
--- a/train.py                                               
+++ b/train.py                                               
@@ -10,7 +10,7 @@ from keras.layers import LSTM              
 # load doc into memory                                      
 def load_doc(filename):                                     
     # open the file as read only                            
-    file = open(filename, 'r')                              
+    file = open(filename, 'r', encoding='utf8')             
     # read all text                                         
     text = file.read()                                      
     # close the file                                        

I got the following results, which are quite interesting:


Seed text: do you know

do you know you want to the bar and said, "I have the bar and said, "I have the bar and said, "I have the bar and said, "I have the bar and said, "I ha


Seed text: how does a

how does a particular and said, "I have the bar and said, "I have the bar and said, "I have the bar and said, "I have the bar and said, "I have the ba


Seed text: what do you call a

what do you call a stated to the bar and said, "I have the bar and said, "I have the bar and said, "I have the bar and said, "I have the bar and said, "I have


While it is definitely working, it seems to be hitting a very strong plateau where it gets stuck in particular repeating sentences. This is specially challenging, because some jokes have a very important component in repetition of certain phrases or words (to preserve context).

pranoyr commented 6 years ago

Yes. This was my issue in training large dataset. Here the model is under fitting and i am trying to make a big enough network for this dataset.