huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
129.81k stars 25.79k forks source link

Recommended batch size and epochs for finetuning on large data #660

Closed okgrammer closed 4 years ago

okgrammer commented 5 years ago

In the original paper, BERT model is fine-tuned on downstream NLP tasks, where the number of instances for each task is in the order of thousands to hundreds of thousands. In my case, I have about 5 million samples. I'm curious whether there are recommended batch size and epochs for such training size? I'm fine-tuning bert-base-multilingual on 4 GPUs and there is a lot of unused GPU memory with the default batch size of 32. Even after increasing it to 128 there is still free available memory.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

bwindsor22 commented 4 years ago

@okgrammer Larger batch size often means lower accuracy but faster epochs. You can try it by doing several runs of varying batch size while keeping other params constant.

See, especially, https://arxiv.org/pdf/1804.07612.pdf

chen2000 commented 4 years ago

In the original paper, BERT model is fine-tuned on downstream NLP tasks, where the number of instances for each task is in the order of thousands to hundreds of thousands. In my case, I have about 5 million samples. I'm curious whether there are recommended batch size and epochs for such training size? I'm fine-tuning bert-base-multilingual on 4 GPUs and there is a lot of unused GPU memory with the default batch size of 32. Even after increasing it to 128 there is still free available memory.

I have exactly the same issue. Can anyone help? The pretraining is really slow with more than 90% GPU memory available. No matter how I increase the batch size, the GPU memory usage is minimal.