huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.86k stars 26.98k forks source link

Finetuning GPT-2 on small datasets #12874

Closed Elysium1436 closed 3 years ago

Elysium1436 commented 3 years ago

I have a relatively small dataset that i've scraped on my discord server. I wanted to make a gpt-2 chatbot with it, but the data is relatively small (3782031 characters counting the eos token). Training for a small number of epochs did nothing for any checkpoint related to gpt-2 (I tried distilbert, gpt-2, dialoGPT-small, and other), and training for a large number of epochs absolutely destroyed the whole model, it was barely able to generate coherent at all, it was either special characters, jumble, or nothing at all. I've tested the same script with a much larger dataset and it worked just fine, so I can only assume it's because of the dataset size. I was trying to find ways to freeze the gpt-2 base model and leave just the LMHead, but since the LMHead is somehow tied to the embedding layer, that wouldn't be possible... If there isn't a way to freeze the head of the model, what else should I do then? I've been trying to complete this personal project for quite a while now, and i'm out of options at this point. I'm using a custom TF script from the example folder on TPU, since the pytorch version makes the memory usage blow up on colab.

Elysium1436 commented 3 years ago

I've finally found this article, and it seems promising. Going to try it out i'll say how it went.

NielsRogge commented 3 years ago

For training-related questions, please refer to the forum. We like to keep Github issues for bugs/feature requests.

For example, you can find all fine-tuning GPT-2-related questions here.

Thank you!

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.