Fine tune this model for text generation on my corpus

stefan-it / german-gpt2

German GPT-2 model

MIT License

32 stars 4 forks source link

Fine tune this model for text generation on my corpus #12

Open SudhanshuBlaze opened 2 years ago

SudhanshuBlaze commented 2 years ago

Our team is trying to build a solution for Bavarian Farmers and Consultants. So, I am trying to integrate text auto-completion in German language for documenting solutions. I want to FINE TUNE this model on my corpus for better text generation or text auto-completion. How can I do it? I am a newbie at transfer learning. Please help @stefan-it

stefan-it commented 2 years ago

Hi @SudhanshuBlaze ,

sorry for the late reply!

For fine-tuning this GPT-2 model, you can just follow the steps from the official Transformers documentation:

https://github.com/huggingface/transformers/tree/main/examples/pytorch/language-modeling#gpt-2gpt-and-causal-language-modeling

You just need to adjust the backbone model (passed via model_name_or_path) and your training and evaluation corpus (which can be a simple plain text file).

stefan-it commented 2 years ago

For better generation results, you should use the stefan-it/german-gpt2-larger model. Model card is available here: https://huggingface.co/stefan-it/german-gpt2-larger

SudhanshuBlaze commented 2 years ago

In --train_file I'll pass a simple .txt file with my big geman corpus text.

But what about this parameter? - - validation_file What should I pass here? I'm a total newbie in this field. Please help me.

stefan-it commented 2 years ago

Hi @SudhanshuBlaze , you can use e.g. 90% of training data as train_file and the other 10% as validation data validation_file to monitor the validation loss/accuracy during fine-tuning.

If you pass no argument for validation_file it will perform this sampling automatically:

https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_clm.py#L314-L322

And you can control this percentage value via validation_split_percentage argument. So by default, 95% of you train_file text file will be used for fine-tuning and 5% as validation data.