Open yt7589 opened 3 years ago
I have trained TNT on ImageNet. With the hyper-parameters in paper (https://arxiv.org/pdf/2103.00112.pdf) and DeiT training code (https://github.com/facebookresearch/deit), I reproduced the result: Top-1 acc=81.3 of TNT-S.
Have you tried the default hyper-parameters in TNT paper?
@yt7589 Maybe sandwich-LN and PB-relax in CogView (https://arxiv.org/pdf/2105.13290) can help solve your problem.
Have you observed some datapoints that sandwich-LN helps to the NaN issue? If you could you kindly share your experience?
It is a great project. I am very interested in Transformer in Transformer model. I had use your model to train on Vehicle-1M dataset. Vehicle-1M is a fine graied visual classification dataset. When I use this model the loss become NaN after some batch iteration. I had decrease the learning rate of AdamOptimizer and clipping the graident
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=2.0, norm_type=2)
. But the loss still will become NaN sometimes. It seems that gradients are not big but they are in the same direction for many iterations. How to solve it?