The model can not converge

microsoft / Swin-Transformer

This is an official implementation for "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows".

https://arxiv.org/abs/2103.14030

MIT License

13.89k stars 2.05k forks source link

The model can not converge #100

Open guozhiyao opened 3 years ago

guozhiyao commented 3 years ago

I train the swin_tiny_patch4_window7_224 with one million classes and 100 million images with softmax loss and adamw, the batch size is 600 and train for 400,000 iterations but the model can not converge.

guozhiyao commented 3 years ago

I found the the average grad-norm of my model is about 0.7, which is much smaller than your situation, and make the update of model parameters very slow and can not converge. Do you know how to fix it?

Starboy-at-earth commented 3 years ago

Emmm, I also find the problem!!! The loss cannot go down along with the training process. Can we have a dicuss using QQ? My account is 2667004002.

ancientmooner commented 3 years ago

I train the swin_tiny_patch4_window7_224 with one million classes and 100 million images with softmax loss and adamw, the batch size is 600 and train for 400,000 iterations but the model can not converge.

You may check the same code on a dataset of smaller scale, to fix potential bugs.

JackjackFan commented 3 years ago

fine....I also find this problem....The loss cannot go down even my lr is 1e-7...I do not know how to solve this case...I replace resnet with swin-s in my nets as a new backbone.But my loss can not go down.

guozhiyao commented 3 years ago

My model can converge now. I train the model with softmax loss, and setting warm up iters and batch size large can converge normally.

hdmjdp commented 2 years ago

what is the "warm up iters ", I cannot find it in the config. By the way I have the same problem, the loss cannot go down as below:

[2021-12-05 14:30:04 swin_base_patch4_window7_224](main.py 224): INFO Train: [0/300][80/1895] eta 0:42:20 lr 0.000000194 time 1.4120 (1.3996) loss 9.5960 (9.6270) grad_norm 4.6469 (5.3799) mem 17084MB [2021-12-05 14:30:18 swin_base_patch4_window7_224](main.py 224): INFO Train: [0/300][90/1895] eta 0:42:04 lr 0.000000211 time 1.3961 (1.3988) loss 9.5745 (9.6272) grad_norm 5.0378 (5.4116) mem 17084MB [2021-12-05 14:30:32 swin_base_patch4_window7_224](main.py 224): INFO Train: [0/300][100/1895] eta 0:41:48 lr 0.000000227 time 1.3841 (1.3972) loss 9.6274 (9.6305) grad_norm 5.9137 (5.6181) mem 17084MB [2021-12-05 14:30:46 swin_base_patch4_window7_224](main.py 224): INFO Train: [0/300][110/1895] eta 0:41:41 lr 0.000000244 time 1.3985 (1.4014) loss 9.5727 (9.6291) grad_norm 5.7255 (5.6733) mem 17084MB

@guozhiyao my batch_size=64, my dataset is 14000 class.

ancientmooner commented 2 years ago

Thanks @guozhiyao

ruiyan1995 commented 2 years ago

@hdmjdp How did you solve this issue?

BitCalSaul commented 9 months ago

Mine cannot go down too, for cifar10 with my own framework.

BitCalSaul commented 9 months ago

@guozhiyao Hey， I'm wondering what the point it is from the grad_norm. I have seen some people use this metric with their issue about convergence of Swin. Would you please give a hint, thanks.