Decoder randomly outputs NaN tensor.

y-rokutan commented 3 years ago

Hi,

I just noticed misbehavior of decoder, seems to output NaN tensor randomly.

Problem AutoregressiveWrapper.generate randomly outputs NaN tensor and fails with RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

How to Reproduce the bug Set decoder to cuda device dec = PerformerLM(**dec_kwargs).to('cuda:1'), and repeat decoding inside the AutoregressiveWrapper:

performer_pytorch/autoregressive_wrapper.py(63)

...
for _ in range(seq_len):
x = out[:, -self.max_seq_len:]
input_mask = input_mask[:, -self.max_seq_len:]
logits = self.net(x, mask=input_mask, **kwargs)[:, -1, :] <<-- HERE

output

(Pdb) self.net(x, mask=input_mask, **kwargs)[:, -1, :]
tensor([[-0.6147,  0.4647,  0.8009,  ..., -0.3772, -0.5126, -0.3495]],
 device='cuda:1')
(Pdb) self.net(x, mask=input_mask, **kwargs)[:, -1, :]
tensor([[-0.6792,  0.3940,  0.6685,  ..., -0.5081, -0.4801, -0.2691]],
 device='cuda:1')
(Pdb) self.net(x, mask=input_mask, **kwargs)[:, -1, :]
tensor([[-0.6146,  0.4647,  0.8011,  ..., -0.3772, -0.5128, -0.3496]],
 device='cuda:1')
(Pdb) self.net(x, mask=input_mask, **kwargs)[:, -1, :]
tensor([[-0.0530, -0.0343,  0.0998,  ...,  0.6310, -0.1682, -0.7353]],
 device='cuda:1')
(Pdb) self.net(x, mask=input_mask, **kwargs)[:, -1, :]
tensor([[nan, nan, nan,  ..., nan, nan, nan]], device='cuda:1') <<-- It randomly outputs NaN tensor.

Any ideas why this happens?

lucidrains commented 3 years ago

@y-rokutan Hi Yuri! I think there may be something wrong with your training script

If you run the enwik8 example https://github.com/lucidrains/performer-pytorch/tree/main/examples/enwik8_simple, it uses the decoder and I have never personally run into NaN problems

y-rokutan commented 3 years ago

Hi @lucidrains, Thx for your quick reply. I've also checked enwik8 example runs perfectly and assume this problem comes from my change. Let me clarify my situation and problem.

Model
- PerformerEncDec with SRC_SEQ_LEN=49152 and TGT_SEQ_LEN=4096
- Model split into 2x nvidia RTX 3090 with NVLink (encoder to cuda:0 and decoder to cuda:1)
- Other settings and codes are the same as performer-pytorch/README.md Encoder / Decoder section.
Nan Tensor behavior
- logits = self.net(x, mask=input_mask, **kwargs)[:, -1, :] in autoregressive_wrapper.py randomly returns NaN tensor.
- Re-calculating logits sometimes solve the problem:
```
while True:
logits = self.net(x, mask=input_mask, **kwargs)[:, -1, :]
if torch.isnan(logits).all().item(): # if is NaN tensor
    time.sleep(0.1)
else:
    break
```
- Adding self.net(x) into the while loop drastically suppresses the loop count (why...?)

I'll try to modify the enwiki8-simple for multiple gpus and check if this misbehavior reproduces. Please tell me if you have any advice or suggestion.

y-rokutan commented 3 years ago

I've tested some possible reasons why this error happen, and suspect CUDA driver problem because of the following results:

Assigning single GPU to the model, code runs without error.
Assigning two GPUs w/ NVLink bridge, it throws CUDA Runtime Error: illegal memory access.
Assigning two GPUs w/o NVLink bridge, it randomly outputs NaN tensor. (2. and 3. running the same code.)

I'm going to try other GPUs (perhaps GCP V100s) to check if this issue comes from CUDA/GPU or not.

y-rokutan commented 3 years ago

I've tested performer-encdec model on two V100 GPUs and encountered no error. This issue is clearly appears that comes from GPU / driver bug. I think RTX 3090 is not the best option for deep learning at this time.

Anyway, thx for your help @lucidrains.

y-rokutan commented 3 years ago

For readers encountered similar issues: I found exporting CUDA_LAUNCH_BLOCKING=1 when using multiple GPUs (or nvlink). This probably comes from synchronizing GPUs.

lucidrains / performer-pytorch

Decoder randomly outputs NaN tensor. #53