Accuracy regression on MobileNetV2

fmassa commented 5 years ago

Reported by @andravin in https://github.com/pytorch/vision/pull/818#issuecomment-509337263

With PyTorch 1.1 and torchvision 0.3, we are able to reach 71.878 top1 accuracy on ImageNet for MobileNetV2. The training command is the following:

cd references/classification

python -m torch.distributed.launch --nproc_per_node=8 --use_env train.py\
     --model mobilenet_v2 --epochs 300 --lr 0.045 --wd 0.00004\
     --lr-step-size 1 --lr-gamma 0.98

with best accuracy at epoch 285.

@andravin tried running the same code with a more recent version of PyTorch and torchvision, and got 71.536 (@andravin do you have maybe the specific versions?), which is too high to just be random variations.

Investigate (and fix) the cause of this.

A few related changes (in torchvision) which I have looked into, but didn't find anything particularly suspicious:

https://github.com/pytorch/vision/pull/1005 : this doesn't affect the pre-trained models (which are already using a multiple of 8 for the channels), so I would discard this change as the culprit
https://github.com/pytorch/vision/pull/965 : doesn't change the behavior, so probably not the culprit
https://github.com/pytorch/vision/pull/972 and https://github.com/pytorch/vision/pull/1124 : could be potential culprits, but the functionality is not supposed to change. But I'm not 100% sure...

Note: it takes ~35h to train the model on 8-GPU machines.

andravin commented 5 years ago

Here are the software versions used: https://github.com/pytorch/vision/pull/818#issuecomment-508428115

>>> torch.cuda.nccl.version()
2406
>>> torch.version.cuda
'10.1.168'
>>> torch.backends.cudnn.version()
7601
>>> torch.__version__
'1.2.0a0+ffa15d2'

I would think it enough to reproduce the error in HEAD. If it works now, then either the error was fixed, or I did something wrong.

zou3519 commented 5 years ago

@andravin @fmassa I dug into this a little by running some training jobs from scratch on 8 V100's and using different combinations of pytorch and torchvision versions.

The results were as follows:

pytorch	torchvision	Best top1 acc	Epoch
master	master	71.806	292
master	0.3	71.638	279
1.1	master	71.764	300
1.1	0.3	71.676	289
1.1	0.3	71.674	278
1.1	0.3	71.692	284
1.1	0.3	71.512	281
1.1	0.3	71.828	300
1.1	0.3	71.584	295
1.1	0.3	71.874	298

There are a few points to note here. First, the pytorch master and torchvision master run I ran was able to attain 71.806 top1 accuracy.

Next, I tried running a lot of pytorch 1.1 and torchvision 0.3 runs for 300 epochs each. Most of the time, these were not able to attain numbers close to the advertised 71.878, but some of the runs came close at 71.828 and 71.874. This suggests that there is a lot of variance during training that is probably due to different random initializations and non-determinism.

Finally, I took a look through the PyTorch commit history from 1.1 to master while waiting for the jobs to finish. No commits related to the ops run in MobileNetV2 jumped out to me as suspicious, but it's possible that I missed some more subtle changes.

Here were the nccl/cuda/cudnn versions I used:

>>> torch.cuda.nccl.version()
2402
>>> torch.version.cuda
'10.0.130'
>>> torch.backends.cudnn.version()
7501

soumith commented 5 years ago

closing based on @zou3519 's conclusion. It seems to be more around variance (+-0.2%) than any other factors. Also, he verified that master actually converges if you are on good initialization

fmassa commented 5 years ago

Thanks a lot for the investigation @zou3519 !

andravin commented 5 years ago

It might be a good idea to document the expected accuracy. @zou3519 's 7 experiments on pytorch 1.1 and torchvision 0.3 have a mean and standard deviation of 71.691 +/- 0.127.

fmassa commented 5 years ago

@andravin this is a very good point. We unfortunately only have point metrics instead of distributions for the numbers. This is the case for most papers as well up to date, but there are some work proposing different ways of reporting metrics for evaluating families of models, e.g., https://arxiv.org/abs/1905.13214

andravin commented 5 years ago

@fmassa yah I think it is good that you currently report the ImageNet accuracy for the pretrained weights here https://pytorch.org/docs/stable/torchvision/models.html

I was making the point that the user does not know what accuracy to expect if they train the model from scratch.

But apparently there is no documentation about how any of the models were trained. So the user really has no way to reproduce your results.

My advice would be to have a separate page for each model that documents the hyperparameters used for training (ie the exact train.py commandline used, hopefully that program was used for all the models!). Additionally, would be great to know the mean accuracy and variance.

I would think that pytorch developers also need this information for regression testing.

fmassa commented 5 years ago

@andravin

My advice would be to have a separate page for each model that documents the hyperparameters used for training (ie the exact train.py commandline used, hopefully that program was used for all the models!). Additionally, would be great to know the mean accuracy and variance.

Totally, you know what, I'll be putting a README now with the hyperparameters that I used to train the the models that we have in the modelzoo. Thanks!

andravin commented 5 years ago

Would also be good to know the training time and hardware spec (eg 8xV100).

pytorch / vision

Accuracy regression on MobileNetV2 #1172