Closed Parskatt closed 2 days ago
Yes, we used 4 GPUs for the training (batch 64). We have tried larger batch size such as 128, which enjoys slightly better results. For the learning rate, indeed, a larger batch size should have a larger learning rate (for example, if batch2, lr should \sqrt{2}). But we did not try larger learning rates in this project. Because most experiments are accomplished with limited GPUs and batch settings. If your batch size is very small, such as 16 or 8 in total, we recommend reducing the learning rate to 5e-4 or 3e-4 and learning more steps instead.
Thanks!
Hiya, great work! From what I can understand you run with
batch_size = 16
. However, when running in DDP this seems to be the per gpu batch size. However, as far as I can tell the learning rate is not multiplied in case world_size > 1. Usually you would want to adjust these parameters based on the total batch size.How many GPUs did you run for the pretrained models? Was it 4 like in the suggestion? In that case I'm assuming I should use a lower learning rate if I'm using fewer GPUs and a higher learning rate if using more. Does that make sense?