openai / gpt-2

Code for the paper "Language Models are Unsupervised Multitask Learners"
https://openai.com/blog/better-language-models/
Other
22.57k stars 5.53k forks source link

[question] reasonable and maximum possible input dataset size for 1.5b(or any other) model? #240

Open miheico opened 4 years ago

miheico commented 4 years ago

What is the largest possible input dataset for fine-tuning of gpt-2 1.5b or any other model (774M)? For example, does it make sense to train the model on 300kk tokens(1.2GB of txt file, or ~500MB npz file)? If yes, then what is the decent limit for any of those models which can still improve it and make sense for fine-tuning? Or does it better to split the input dataset into several parts and then use it one by one for fine-tuning of the same checkpointed model?