Closed fmassa closed 5 years ago
Here are the software versions used: https://github.com/pytorch/vision/pull/818#issuecomment-508428115
>>> torch.cuda.nccl.version()
2406
>>> torch.version.cuda
'10.1.168'
>>> torch.backends.cudnn.version()
7601
>>> torch.__version__
'1.2.0a0+ffa15d2'
I would think it enough to reproduce the error in HEAD. If it works now, then either the error was fixed, or I did something wrong.
@andravin @fmassa I dug into this a little by running some training jobs from scratch on 8 V100's and using different combinations of pytorch and torchvision versions.
The results were as follows:
pytorch | torchvision | Best top1 acc | Epoch |
---|---|---|---|
master | master | 71.806 | 292 |
master | 0.3 | 71.638 | 279 |
1.1 | master | 71.764 | 300 |
1.1 | 0.3 | 71.676 | 289 |
1.1 | 0.3 | 71.674 | 278 |
1.1 | 0.3 | 71.692 | 284 |
1.1 | 0.3 | 71.512 | 281 |
1.1 | 0.3 | 71.828 | 300 |
1.1 | 0.3 | 71.584 | 295 |
1.1 | 0.3 | 71.874 | 298 |
There are a few points to note here. First, the pytorch master and torchvision master run I ran was able to attain 71.806 top1 accuracy.
Next, I tried running a lot of pytorch 1.1 and torchvision 0.3 runs for 300 epochs each. Most of the time, these were not able to attain numbers close to the advertised 71.878, but some of the runs came close at 71.828 and 71.874. This suggests that there is a lot of variance during training that is probably due to different random initializations and non-determinism.
Finally, I took a look through the PyTorch commit history from 1.1 to master while waiting for the jobs to finish. No commits related to the ops run in MobileNetV2 jumped out to me as suspicious, but it's possible that I missed some more subtle changes.
Here were the nccl/cuda/cudnn versions I used:
>>> torch.cuda.nccl.version()
2402
>>> torch.version.cuda
'10.0.130'
>>> torch.backends.cudnn.version()
7501
closing based on @zou3519 's conclusion. It seems to be more around variance (+-0.2%) than any other factors. Also, he verified that master actually converges if you are on good initialization
Thanks a lot for the investigation @zou3519 !
It might be a good idea to document the expected accuracy. @zou3519 's 7 experiments on pytorch 1.1 and torchvision 0.3 have a mean and standard deviation of 71.691 +/- 0.127.
@andravin this is a very good point. We unfortunately only have point metrics instead of distributions for the numbers. This is the case for most papers as well up to date, but there are some work proposing different ways of reporting metrics for evaluating families of models, e.g., https://arxiv.org/abs/1905.13214
@fmassa yah I think it is good that you currently report the ImageNet accuracy for the pretrained weights here https://pytorch.org/docs/stable/torchvision/models.html
I was making the point that the user does not know what accuracy to expect if they train the model from scratch.
But apparently there is no documentation about how any of the models were trained. So the user really has no way to reproduce your results.
My advice would be to have a separate page for each model that documents the hyperparameters used for training (ie the exact train.py commandline used, hopefully that program was used for all the models!). Additionally, would be great to know the mean accuracy and variance.
I would think that pytorch developers also need this information for regression testing.
@andravin
My advice would be to have a separate page for each model that documents the hyperparameters used for training (ie the exact train.py commandline used, hopefully that program was used for all the models!). Additionally, would be great to know the mean accuracy and variance.
Totally, you know what, I'll be putting a README now with the hyperparameters that I used to train the the models that we have in the modelzoo. Thanks!
Would also be good to know the training time and hardware spec (eg 8xV100).
Reported by @andravin in https://github.com/pytorch/vision/pull/818#issuecomment-509337263
With PyTorch 1.1 and torchvision 0.3, we are able to reach 71.878 top1 accuracy on ImageNet for MobileNetV2. The training command is the following:
with best accuracy at epoch 285.
@andravin tried running the same code with a more recent version of PyTorch and torchvision, and got 71.536 (@andravin do you have maybe the specific versions?), which is too high to just be random variations.
Investigate (and fix) the cause of this.
A few related changes (in torchvision) which I have looked into, but didn't find anything particularly suspicious:
Note: it takes ~35h to train the model on 8-GPU machines.