NAN error when training the TransformerXL model on yc2 dataset

I met this error when training the TransformerXL model on yc2 dataset.

Traceback (most recent call last):
  File "src/train.py", line 635, in <module>
    main()
  File "src/train.py", line 631, in main
    train(model, train_loader, val_loader, device, opt)
  File "src/train.py", line 329, in train
    model, training_data, optimizer, ema, device, opt, writer, epoch_i)
  File "src/train.py", line 130, in train_epoch
    loss.backward()
  File "/home/acb11598pe/anaconda3/envs/MART37/lib/python3.7/site-packages/torch/tensor.py", line 221, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/acb11598pe/anaconda3/envs/MART37/lib/python3.7/site-packages/torch/autograd/__init__.py", line 132, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: Function 'LogSoftmaxBackward' returned nan values in its 0th output.

I notice that you also add a debug code in the model.py to check if there is NAN appearing in the probability tensor. Could you please share the exact reason you found why this error happened? Thank you very much!

simon-ging / coot-videotext

NAN error when training the TransformerXL model on yc2 dataset #42