minimaxir / gpt-2-simple

Python package to easily retrain OpenAI's GPT-2 text-generating model on new texts
Other
3.4k stars 675 forks source link

About the finetunning, I want train my new dataset #75

Open wjy979769265 opened 5 years ago

wjy979769265 commented 5 years ago

There is no end sign in my dataset, in order to stop the generation for each sentence. I add start_token and end_token for each row, after finetunning, it generated sort of unreadable texts...It is because the BPE encoding or just the model treat the token as the normal text to train? image After that, in order to achieve this function, I add a "." to each row, I got more readable text, but still not perfect, and I don't know how to decrease the loss... image Any one who have the idea please help me. Thank you very much.

overall all, it's about three quesitions:

  1. how to process dataset, add start of text and end of text token for each row?
  2. how to decrease loss
  3. how to stop a stence
iedmrc commented 5 years ago

Same here. I have ~40k+ text files (~500mb) . What should be the best approach for data preprocessing? Should I add <|startoftext|> and <|endoftext|> tokens to the beginning and the end of files? I use .npz encoder so what would be the best value for the combine parameter? How do I decrease loss?

iedmrc commented 5 years ago

Same here. I have ~40k+ text files (~500mb) . What should be the best approach for data preprocessing? Should I add<|startoftext|> and <|endoftext|> tokens to the beginning and the end of files? I use .npz encoder so what would be the best value for the combine parameter? How do I decrease loss?

I added <|startoftext|> and <|endoftext|> tokens to all of the files respectively. Seems it improved the accuracy (i mean, decreased the loss) a little bit. I didn't change the combine parameter because I thought <|startoftext|> and <|endoftext|> tokens did the job.