lucidrains / transformer-in-transformer

Implementation of Transformer in Transformer, pixel level attention paired with patch level attention for image classification, in Pytorch
MIT License
295 stars 42 forks source link

Why the loss become NaN? #6

Open yt7589 opened 3 years ago

yt7589 commented 3 years ago

It is a great project. I am very interested in Transformer in Transformer model. I had use your model to train on Vehicle-1M dataset. Vehicle-1M is a fine graied visual classification dataset. When I use this model the loss become NaN after some batch iteration. I had decrease the learning rate of AdamOptimizer and clipping the graident torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=2.0, norm_type=2) . But the loss still will become NaN sometimes. It seems that gradients are not big but they are in the same direction for many iterations. How to solve it?

panda1949 commented 3 years ago

I have trained TNT on ImageNet. With the hyper-parameters in paper (https://arxiv.org/pdf/2103.00112.pdf) and DeiT training code (https://github.com/facebookresearch/deit), I reproduced the result: Top-1 acc=81.3 of TNT-S.

Have you tried the default hyper-parameters in TNT paper?

Sleepychord commented 2 years ago

@yt7589 Maybe sandwich-LN and PB-relax in CogView (https://arxiv.org/pdf/2105.13290) can help solve your problem.

haooooooqi commented 2 years ago

Have you observed some datapoints that sandwich-LN helps to the NaN issue? If you could you kindly share your experience?