theFoxofSky / ddfnet

The official implementation of the CVPR2021 paper: Decoupled Dynamic Filter Networks
MIT License
212 stars 34 forks source link

Issue about the training process #18

Open melohux opened 3 years ago

melohux commented 3 years ago

Thanks for your excellent work and I met some problems when I train your model following your instruction.

  1. You claimed in your paper that you were using batch size 256 for all experimental results but in your instruction ./distributed_train.sh 8 <path_to_imagenet> --model ddf_mul_resnet50 --lr 0.4 \ --warmup-epochs 5 --epochs 120 --sched cosine -b 128 -j 6 --amp --dist-bn reduce it seems that this command will launch a training with batch size 128*8.

  2. When I follow your command ./distributed_train.sh 8 <path_to_imagenet> --model ddf_mul_resnet50 --lr 0.4 \ --warmup-epochs 5 --epochs 120 --sched cosine -b 128 -j 6 --amp --dist-bn reduce, the training process seems to be correct but the validation process has some problems:

Train: 129 [ 0/1251 ( 0%)] Loss: 1.837984 (1.8380) Time: 1.869s, 547.82/s (1.869s, 547.82/s) LR: 1.000e-05 Data: 1.558 (1.558) Train: 129 [ 50/1251 ( 4%)] Loss: 1.915305 (1.8766) Time: 0.339s, 3023.06/s (0.370s, 2768.98/s) LR: 1.000e-05 Data: 0.016 (0.047) Train: 129 [ 100/1251 ( 8%)] Loss: 1.936936 (1.8967) Time: 0.338s, 3028.31/s (0.355s, 2885.79/s) LR: 1.000e-05 Data: 0.019 (0.033) Train: 129 [ 150/1251 ( 12%)] Loss: 1.877319 (1.8919) Time: 0.336s, 3045.61/s (0.349s, 2931.77/s) LR: 1.000e-05 Data: 0.017 (0.028) Train: 129 [ 200/1251 ( 16%)] Loss: 1.827796 (1.8791) Time: 0.340s, 3011.16/s (0.347s, 2953.92/s) LR: 1.000e-05 Data: 0.022 (0.025) Train: 129 [ 250/1251 ( 20%)] Loss: 1.865778 (1.8769) Time: 0.338s, 3031.46/s (0.345s, 2966.32/s) LR: 1.000e-05 Data: 0.018 (0.024) Train: 129 [ 300/1251 ( 24%)] Loss: 1.879160 (1.8772) Time: 0.343s, 2982.10/s (0.344s, 2975.39/s) LR: 1.000e-05 Data: 0.017 (0.023) Train: 129 [ 350/1251 ( 28%)] Loss: 1.857682 (1.8747) Time: 0.337s, 3039.01/s (0.344s, 2980.71/s) LR: 1.000e-05 Data: 0.018 (0.022) Train: 129 [ 400/1251 ( 32%)] Loss: 1.845622 (1.8715) Time: 0.339s, 3017.76/s (0.343s, 2984.77/s) LR: 1.000e-05 Data: 0.019 (0.022) Train: 129 [ 450/1251 ( 36%)] Loss: 1.938300 (1.8782) Time: 0.335s, 3052.68/s (0.343s, 2989.22/s) LR: 1.000e-05 Data: 0.018 (0.021) Train: 129 [ 500/1251 ( 40%)] Loss: 1.805174 (1.8716) Time: 0.339s, 3018.73/s (0.342s, 2992.69/s) LR: 1.000e-05 Data: 0.021 (0.021) Train: 129 [ 550/1251 ( 44%)] Loss: 1.859214 (1.8705) Time: 0.340s, 3013.13/s (0.342s, 2994.75/s) LR: 1.000e-05 Data: 0.018 (0.021) Train: 129 [ 600/1251 ( 48%)] Loss: 1.872183 (1.8707) Time: 0.338s, 3029.93/s (0.342s, 2997.84/s) LR: 1.000e-05 Data: 0.017 (0.020) Train: 129 [ 650/1251 ( 52%)] Loss: 1.859764 (1.8699) Time: 0.351s, 2916.84/s (0.341s, 2999.01/s) LR: 1.000e-05 Data: 0.016 (0.020) Train: 129 [ 700/1251 ( 56%)] Loss: 1.845083 (1.8682) Time: 0.339s, 3021.80/s (0.341s, 3000.37/s) LR: 1.000e-05 Data: 0.019 (0.020) Train: 129 [ 750/1251 ( 60%)] Loss: 1.987917 (1.8757) Time: 0.337s, 3038.19/s (0.341s, 3000.99/s) LR: 1.000e-05 Data: 0.017 (0.020) Train: 129 [ 800/1251 ( 64%)] Loss: 1.889720 (1.8765) Time: 0.344s, 2979.40/s (0.341s, 3002.14/s) LR: 1.000e-05 Data: 0.018 (0.020) Train: 129 [ 850/1251 ( 68%)] Loss: 1.952255 (1.8807) Time: 0.341s, 3006.07/s (0.341s, 3003.34/s) LR: 1.000e-05 Data: 0.019 (0.020) Train: 129 [ 900/1251 ( 72%)] Loss: 1.884332 (1.8809) Time: 0.342s, 2996.99/s (0.341s, 3003.49/s) LR: 1.000e-05 Data: 0.017 (0.019) Train: 129 [ 950/1251 ( 76%)] Loss: 1.888057 (1.8813) Time: 0.336s, 3045.26/s (0.341s, 3004.35/s) LR: 1.000e-05 Data: 0.019 (0.019) Train: 129 [1000/1251 ( 80%)] Loss: 1.835092 (1.8791) Time: 0.342s, 2993.21/s (0.341s, 3004.89/s) LR: 1.000e-05 Data: 0.020 (0.019) Train: 129 [1050/1251 ( 84%)] Loss: 1.847999 (1.8777) Time: 0.336s, 3047.83/s (0.341s, 3005.33/s) LR: 1.000e-05 Data: 0.016 (0.019) Train: 129 [1100/1251 ( 88%)] Loss: 1.849290 (1.8764) Time: 0.336s, 3048.30/s (0.341s, 3005.95/s) LR: 1.000e-05 Data: 0.018 (0.019) Train: 129 [1150/1251 ( 92%)] Loss: 1.883289 (1.8767) Time: 0.334s, 3068.20/s (0.341s, 3006.69/s) LR: 1.000e-05 Data: 0.016 (0.019) Train: 129 [1200/1251 ( 96%)] Loss: 1.855369 (1.8759) Time: 0.335s, 3054.89/s (0.340s, 3007.43/s) LR: 1.000e-05 Data: 0.018 (0.019) Train: 129 [1250/1251 (100%)] Loss: 1.903906 (1.8769) Time: 0.318s, 3224.92/s (0.340s, 3008.18/s) LR: 1.000e-05 Data: 0.000 (0.019) Distributing BatchNorm running means and vars Test: [ 0/48] Time: 1.501 (1.501) Loss: 9.5781 (9.5781) Acc@1: 0.0000 ( 0.0000) Acc@5: 0.2930 ( 0.2930) Test: [ 48/48] Time: 0.068 (0.242) Loss: 9.5078 (9.5792) Acc@1: 0.0000 ( 0.0940) Acc@5: 0.2358 ( 0.3160)

Does the training log match your training process? Do you have any idea for the problem of the testing part?

theFoxofSky commented 3 years ago

"The learning rate is set to 0.1 with batch size 256 and decays to 1e-5 following the cosine schedule. " This line in paper means that I set 0.1 for batch size 256, i.e.

lr = 0.1 * (batch_size / 256),

that is 0.4 for batch size 1024 (4*256).

Sorry for the unclear words.

Since I am working on the next paper based on DDF, I have updated this repo several times. I will check the validation code. You can also verify it by using the released model parameters.

melohux commented 3 years ago

So the experimental results in your paper are obtained by training with batch size 256 or 1024? And if my training log matches yours in terms of the loss value? In addition, if you can check the validation code it would be great, thanks.

theFoxofSky commented 3 years ago

So the experimental results in your paper are obtained by training with batch size 256 or 1024? And if my training log matches yours in terms of the loss value? In addition, if you can check the validation code it would be great, thanks.

I use 1024 for R50, 512 for R101.

xiaoachen98 commented 2 years ago

So the experimental results in your paper are obtained by training with batch size 256 or 1024? And if my training log matches yours in terms of the loss value? In addition, if you can check the validation code it would be great, thanks.

I use 1024 for R50, 512 for R101.

And about the R101, what' your data aumentation schedule? I found random erasing, aotu-augment and color jitter in your train code.