Closed okgrammer closed 4 years ago
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
@okgrammer Larger batch size often means lower accuracy but faster epochs. You can try it by doing several runs of varying batch size while keeping other params constant.
See, especially, https://arxiv.org/pdf/1804.07612.pdf
In the original paper, BERT model is fine-tuned on downstream NLP tasks, where the number of instances for each task is in the order of thousands to hundreds of thousands. In my case, I have about 5 million samples. I'm curious whether there are recommended batch size and epochs for such training size? I'm fine-tuning bert-base-multilingual on 4 GPUs and there is a lot of unused GPU memory with the default batch size of 32. Even after increasing it to 128 there is still free available memory.
I have exactly the same issue. Can anyone help? The pretraining is really slow with more than 90% GPU memory available. No matter how I increase the batch size, the GPU memory usage is minimal.
In the original paper, BERT model is fine-tuned on downstream NLP tasks, where the number of instances for each task is in the order of thousands to hundreds of thousands. In my case, I have about 5 million samples. I'm curious whether there are recommended batch size and epochs for such training size? I'm fine-tuning bert-base-multilingual on 4 GPUs and there is a lot of unused GPU memory with the default batch size of 32. Even after increasing it to 128 there is still free available memory.