pytorch / vision

Datasets, Transforms and Models specific to Computer Vision
https://pytorch.org/vision
BSD 3-Clause "New" or "Revised" License
15.95k stars 6.91k forks source link

quantization-aware training in classification #1729

Open robotcator opened 4 years ago

robotcator commented 4 years ago

Hi, I want to reproduce the result of the quantization awareness training of mobilenet_v2 using this script.

1: It seems that the script will raise launch error when using multi GPU. Do we support multi-GPU when quantization awareness training? 2: Also the readme says that 'Training converges at about 10 epochs.', it seems that after 10 epochs, the test result can not achieve 'acc@top1 71.6' as the pretrained model hosted in hub.

robotcator commented 4 years ago

cc @fmassa

robotcator commented 4 years ago

https://github.com/pytorch/pytorch/issues/32082

jerryzh168 commented 4 years ago

cc @raghuramank100

fmassa commented 4 years ago

@robotcator can you clarify a few things:

robotcator commented 4 years ago

@fmassa Thank you for your response. 1: I use 8 GPUs on a single node. 2: From the code logic, the multi-GPU training for the quantized model is ok, but maybe we need to carefully handle the EMA observer. What do you think of it? 3: I use torch.distributed.launch method to launch the training script. 4: About the convergence, I am not sure the result trains with multi-GPU or not, so I don't know whether the different training script will result in different convergence.

Here, I have some facts about my experiment. 1: I download the imagenet_1k dataset as an example to reproduce the error. 2: I use the command for normal float training, it seems that the program is fine with my environment.

python -m torch.distributed.launch --nproc_per_node=8 --use_env train.py --model mobilenet_v2 --epochs 300 --lr 0.045 --wd 0.00004 --lr-step-size 1 --lr-gamma 0.98 --data-path=~/test/imagenet_1k 3: Then I use the command for the quantized version training, it have some problem with the program. The log description is here and I update the reproduciable steps. python -m torch.distributed.launch --nproc_per_node=8 --use_env train_quantization.py --data-path=~/test/imagenet_1k 4: Also I have some findings which clarify in this issue.

LeeHXAlice commented 4 years ago

@robotcator I meet the same issue. How did you fix this problem? Is there some advices? Thanks.