lucidrains / performer-pytorch

An implementation of Performer, a linear attention-based transformer, in Pytorch
MIT License
1.07k stars 143 forks source link

Decoder randomly outputs NaN tensor. #53

Closed y-rokutan closed 3 years ago

y-rokutan commented 3 years ago

Hi,

I just noticed misbehavior of decoder, seems to output NaN tensor randomly.

Any ideas why this happens?

lucidrains commented 3 years ago

@y-rokutan Hi Yuri! I think there may be something wrong with your training script

If you run the enwik8 example https://github.com/lucidrains/performer-pytorch/tree/main/examples/enwik8_simple, it uses the decoder and I have never personally run into NaN problems

y-rokutan commented 3 years ago

Hi @lucidrains, Thx for your quick reply. I've also checked enwik8 example runs perfectly and assume this problem comes from my change. Let me clarify my situation and problem.

I'll try to modify the enwiki8-simple for multiple gpus and check if this misbehavior reproduces. Please tell me if you have any advice or suggestion.

y-rokutan commented 3 years ago

I've tested some possible reasons why this error happen, and suspect CUDA driver problem because of the following results:

  1. Assigning single GPU to the model, code runs without error.
  2. Assigning two GPUs w/ NVLink bridge, it throws CUDA Runtime Error: illegal memory access.
  3. Assigning two GPUs w/o NVLink bridge, it randomly outputs NaN tensor. (2. and 3. running the same code.)

I'm going to try other GPUs (perhaps GCP V100s) to check if this issue comes from CUDA/GPU or not.

y-rokutan commented 3 years ago

I've tested performer-encdec model on two V100 GPUs and encountered no error. This issue is clearly appears that comes from GPU / driver bug. I think RTX 3090 is not the best option for deep learning at this time.

Anyway, thx for your help @lucidrains.

y-rokutan commented 3 years ago

For readers encountered similar issues: I found exporting CUDA_LAUNCH_BLOCKING=1 when using multiple GPUs (or nvlink). This probably comes from synchronizing GPUs.