about result - Githubissues

xiaolonghao commented 2 years ago

Hello, I am very glad to see the latest and best quantitative method. I have some questions while reproducing the results. I hope you can answer your questions. I used the script for training RESNET provided by you to reproduce the 2BIT result, but I repeated it twice, and the result was 68.9, which was much different from the 69.4 in the paper report. However, I could get the result of the paper by using the model provided by you. Therefore, I would like to ask what problems I should pay attention to about the reproduction. What is the reason for my low result? thank you.

HaoKun-Li commented 2 years ago

I run resnet18_2bit_quantize_downsample_True and get the best result as follows:

acc@1 69.260 acc@5 88.588

I notice that in the last few epochs, the top1 accuracy increases significantly. It may be caused by the decreasing of learning rate at the final stage of training. The accuracies of the last few epochs are as follows:

Test set Top1 Accuracy(%) after epoch116: 68.322 Test set Top1 Accuracy(%) after epoch117: 68.204 Test set Top1 Accuracy(%) after epoch118: 68.230 Test set Top1 Accuracy(%) after epoch119: 68.336 Test set Top1 Accuracy(%) after epoch120: 68.492 Test set Top1 Accuracy(%) after epoch121: 68.832 Test set Top1 Accuracy(%) after epoch122: 68.816 Test set Top1 Accuracy(%) after epoch123: 68.858 Test set Top1 Accuracy(%) after epoch124: 68.668 Test set Top1 Accuracy(%) after epoch125: 69.242 Test set Top1 Accuracy(%) after epoch126: 69.024 Test set Top1 Accuracy(%) after epoch127: 69.260

xiaolonghao commented 2 years ago

Did you run the script file directly? Did you make any changes? How many Gpus did you use for training? It took me 119 hours to train on eight Gpus. I repeated the experiment twice, and the test set results from epoch 118 to epoch 127 of the first time were： ' acc@1 68.140 acc@5 87.682', ' acc@1 68.090 acc@5 87.774', ' acc@1 68.188 acc@5 87.944', ' acc@1 68.228 acc@5 87.974', ' acc@1 68.428 acc@5 87.998', ' acc@1 68.444 acc@5 88.230', ' acc@1 68.460 acc@5 88.132', ' acc@1 68.724 acc@5 88.152', ' acc@1 68.850 acc@5 88.194', ' acc@1 68.992 acc@5 88.360'

The result of the second repetition is： ' acc@1 68.068 acc@5 87.886', ' acc@1 68.202 acc@5 88.044', ' acc@1 68.132 acc@5 88.082', ' acc@1 68.272 acc@5 88.044', ' acc@1 68.394 acc@5 88.200', ' acc@1 68.552 acc@5 88.228', ' acc@1 68.564 acc@5 88.128', ' acc@1 68.572 acc@5 88.434', ' acc@1 68.896 acc@5 88.428', ' acc@1 68.948 acc@5 88.382'

The two results are very close, but there is still a gap with your repeated results. The results of the test set in the last few epochs are increasing, which is related to the decrease of the learning rate. I may try to add some epochs to see if it works, but it takes too much time to train once.

HaoKun-Li commented 2 years ago

Did you run the script file directly? Did you make any changes? How many Gpus did you use for training? It took me 119 hours to train on eight Gpus. I repeated the experiment twice, and the test set results from epoch 118 to epoch 127 of the first time were： ' acc@1 68.140 acc@5 87.682', ' acc@1 68.090 acc@5 87.774', ' acc@1 68.188 acc@5 87.944', ' acc@1 68.228 acc@5 87.974', ' acc@1 68.428 acc@5 87.998', ' acc@1 68.444 acc@5 88.230', ' acc@1 68.460 acc@5 88.132', ' acc@1 68.724 acc@5 88.152', ' acc@1 68.850 acc@5 88.194', ' acc@1 68.992 acc@5 88.360'

The result of the second repetition is： ' acc@1 68.068 acc@5 87.886', ' acc@1 68.202 acc@5 88.044', ' acc@1 68.132 acc@5 88.082', ' acc@1 68.272 acc@5 88.044', ' acc@1 68.394 acc@5 88.200', ' acc@1 68.552 acc@5 88.228', ' acc@1 68.564 acc@5 88.128', ' acc@1 68.572 acc@5 88.434', ' acc@1 68.896 acc@5 88.428', ' acc@1 68.948 acc@5 88.382'

The two results are very close, but there is still a gap with your repeated results. The results of the test set in the last few epochs are increasing, which is related to the decrease of the learning rate. I may try to add some epochs to see if it works, but it takes too much time to train once.

(1) I do not change any code except the path to the ImageNet dataset. My bash is： cd /userhome/Nonuniform-to-Uniform-Quantization/resnet && bash run.sh resnet18 2 True

(2) I use two V100 GPUs and 5day16hours, meaning 272 GPU hours. Pytorch version is 1.18. (3) This code uses "nn.DataParallel" for distribution training, which is much slower than "nn.parallel.DistributedDataParallel" when the number of GPUs is big. So I think too many GPUs may slow down your training process.

xiaolonghao commented 2 years ago

Thank you for your reply. I might also use 2 Gpus to reproduce the results again,

Cyber-Neuron commented 7 months ago

The reproduced result by runing the provided code is around 69.2.

liuzechun / Nonuniform-to-Uniform-Quantization

about result #2