Closed aniketrege closed 2 years ago
Hi, the batch size parameter is per process (effectively: per GPU) so if you use only a single GPU your effective batch size is 512. If you use 8 GPUs this will fix the problem as your effective batch size will then be the one used in the benchmarks: 4096 (512 * 8).
You might be able to fix this issue by dividing the batch size (applying the rule from https://arxiv.org/abs/1706.02677) by 8 to keep the ratio between the batch size and the learning rate the same. We haven't tried this though, so the resulting accuracy may not be the same.
In an attempt to replicate your numbers, we trained for 40 epochs on a single A100 GPU with the ffcv dataset files generated from the bash script provided with the config specified in
rn50_40_epochs.yaml
.After training for ~5 hours, we observed top1=0.729 and top5 = 0.915, in contrast to your quoted numbers of 0.772 and 0.932 from the configuration table in the README. The primary difference was we used 1xA100 instead of 8xA100 that you used, and observed a total training roughly 8x of what you quote (35.6 minutes for 8xA100).
I don't believe that using a single GPU instead of 8 should impact validation accuracy to this extent (5.5% for top 1 and 1.5% for top 5). Could you suggest why this might be happening, or if it is indeed due to using a single A100 GPU instead of 8?