Closed y-rokutan closed 3 years ago
@y-rokutan Hi Yuri! I think there may be something wrong with your training script
If you run the enwik8 example https://github.com/lucidrains/performer-pytorch/tree/main/examples/enwik8_simple, it uses the decoder and I have never personally run into NaN problems
Hi @lucidrains, Thx for your quick reply. I've also checked enwik8 example runs perfectly and assume this problem comes from my change. Let me clarify my situation and problem.
logits = self.net(x, mask=input_mask, **kwargs)[:, -1, :]
in autoregressive_wrapper.py randomly returns NaN tensor.
while True:
logits = self.net(x, mask=input_mask, **kwargs)[:, -1, :]
if torch.isnan(logits).all().item(): # if is NaN tensor
time.sleep(0.1)
else:
break
self.net(x)
into the while loop drastically suppresses the loop count (why...?)I'll try to modify the enwiki8-simple for multiple gpus and check if this misbehavior reproduces. Please tell me if you have any advice or suggestion.
I've tested some possible reasons why this error happen, and suspect CUDA driver problem because of the following results:
I'm going to try other GPUs (perhaps GCP V100s) to check if this issue comes from CUDA/GPU or not.
I've tested performer-encdec model on two V100 GPUs and encountered no error. This issue is clearly appears that comes from GPU / driver bug. I think RTX 3090 is not the best option for deep learning at this time.
Anyway, thx for your help @lucidrains.
For readers encountered similar issues:
I found exporting CUDA_LAUNCH_BLOCKING=1
when using multiple GPUs (or nvlink).
This probably comes from synchronizing GPUs.
Hi,
I just noticed misbehavior of decoder, seems to output NaN tensor randomly.
Problem AutoregressiveWrapper.generate randomly outputs NaN tensor and fails with
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
How to Reproduce the bug Set decoder to cuda device
dec = PerformerLM(**dec_kwargs).to('cuda:1')
, and repeat decoding inside theAutoregressiveWrapper
:Any ideas why this happens?