Reproducing Validation Numbers

libffcv / ffcv-imagenet

Train ImageNet *fast* in 500 lines of code with FFCV

Apache License 2.0

136 stars 34 forks source link

In an attempt to replicate your numbers, we trained for 40 epochs on a single A100 GPU with the ffcv dataset files generated from the bash script provided with the config specified in rn50_40_epochs.yaml .

After training for ~5 hours, we observed top1=0.729 and top5 = 0.915, in contrast to your quoted numbers of 0.772 and 0.932 from the configuration table in the README. The primary difference was we used 1xA100 instead of 8xA100 that you used, and observed a total training roughly 8x of what you quote (35.6 minutes for 8xA100).

I don't believe that using a single GPU instead of 8 should impact validation accuracy to this extent (5.5% for top 1 and 1.5% for top 5). Could you suggest why this might be happening, or if it is indeed due to using a single A100 GPU instead of 8?

libffcv / ffcv-imagenet

Reproducing Validation Numbers #4