minimaxir / gpt-2-simple

Python package to easily retrain OpenAI's GPT-2 text-generating model on new texts
Other
3.4k stars 675 forks source link

Model returning exact matches to source file #131

Open Engineer-of-Stuff opened 4 years ago

Engineer-of-Stuff commented 4 years ago

I'm training my model but the samples while generating and the generated outputs always match lines in the input files.

My environment is a Paperspace Python3 GPU notebook.

My code:

!pip3 install tensorflow-gpu==1.14.0
!pip3 install gpt-2-simple
import gpt_2_simple as gpt2
import tensorflow as tf

model_name = "124M"
gpt2.download_gpt2(model_name=model_name)
tf.reset_default_graph() # can't rerun without this
sess = gpt2.start_tf_sess()
gpt2.finetune(sess,
              'temp.txt',
              model_name=model_name,
              steps=1000)

gpt2.generate(sess)

This happens with different datasources.

This one started off on near matches, but then started outputting exact matches too. I ran this a week or so ago and it worked as expected and output very unique results from the same data source.

Engineer-of-Stuff commented 4 years ago

The given code isn't my real code, it's a sample that I am running right now and will post the output of.

I'm running the same code in Paperspace and Google Colab and I'll see if there's any inconsistency between the two.

Is there anything that can cause the generated outputs to match the source inputs?

Is this something related to my dataset or format?

Engineer-of-Stuff commented 4 years ago

Ok, so far it looks like both notebooks are either very, very close to the original source or is outputting exact matches.

It is even outputting blocks of multiple lines exactly as in the source file. What's going on?????

sainimohit23 commented 4 years ago

@Engineer-of-Stuff have you tried to adjust temperature?

Engineer-of-Stuff commented 4 years ago

Didn't fix it, is still generating complete lines from the input. The one I ran a few weeks ago was generating unique samples but the current one is not.

minimaxir commented 4 years ago

If it is generating exact matches to the source file, there is a possibility that it's overfitting. How much input data are you training on?

Engineer-of-Stuff commented 4 years ago

My input for the code above is a txt file with 1000 lines. The input for my main model is a 50000 row csv that is put through encode_dataset.

Engineer-of-Stuff commented 4 years ago

I'm going to try with version 0.5.4 of gpt2-simple.

Engineer-of-Stuff commented 4 years ago

bump????

johndpope commented 4 years ago

There’s something wrong / user error. Try using a different file - I gave it a terminator 2 script as txt file and it spat out this https://gist.github.com/johndpope/bc24f4f0ef186277aa6cde85a93cbc8a

johndpope commented 4 years ago

There’s also a google colab file I used in the cloud Some random notes and results here https://gist.github.com/johndpope/9a6a813efa58adc674ff191c934625f6

Engineer-of-Stuff commented 4 years ago

What could I be doing wrong? It isn't working with different inputs and it was working at first. I'll review your code and compare to mine.

johndpope commented 4 years ago

maybe the data is loading from wrong checkpoint? or something not saving? this worked for me https://colab.research.google.com/drive/1_FR6rY52YVMh710oM9MC3LCW1g3o3YHv

Get some results without CSV first.

brendanmroche commented 4 years ago

I had the same issue with 600 rows of Don Draper lines - I'm using the colab notebook which says default learning rate is 1e-4 but to 1e-5 if you have <1MB input data. That fixed it for me. But i was manually checking for duplicates - has anyone automated a plagiarism check after the model has generated text (i.e. compare outputs to original training data). I saw @gwern mention [grep] in another section but wasn't sure how to implement.

Engineer-of-Stuff commented 4 years ago

@johndpope that collab link is set to private. Can you set it to public or put it in a gist so I can take a look at it?