simon-ging / coot-videotext

COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning
Apache License 2.0
288 stars 55 forks source link

NAN error when training the TransformerXL model on yc2 dataset #42

Open robert1015 opened 2 years ago

robert1015 commented 2 years ago

I met this error when training the TransformerXL model on yc2 dataset.

Traceback (most recent call last):
  File "src/train.py", line 635, in <module>
    main()
  File "src/train.py", line 631, in main
    train(model, train_loader, val_loader, device, opt)
  File "src/train.py", line 329, in train
    model, training_data, optimizer, ema, device, opt, writer, epoch_i)
  File "src/train.py", line 130, in train_epoch
    loss.backward()
  File "/home/acb11598pe/anaconda3/envs/MART37/lib/python3.7/site-packages/torch/tensor.py", line 221, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/acb11598pe/anaconda3/envs/MART37/lib/python3.7/site-packages/torch/autograd/__init__.py", line 132, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: Function 'LogSoftmaxBackward' returned nan values in its 0th output.

I notice that you also add a debug code in the model.py to check if there is NAN appearing in the probability tensor. Could you please share the exact reason you found why this error happened? Thank you very much!

simon-ging commented 2 years ago

Hi, so first of all TransformerXL is not officially supported by this repo and not tested well. That being said, reasons for NaN can be: