Function 'CudnnBatchNormBackward' returned nan values in its 0th output

kenziyuliu / MS-G3D

[CVPR 2020 Oral] PyTorch implementation of "Disentangling and Unifying Graph Convolutions for Skeleton-Based Action Recognition"

https://arxiv.org/abs/2003.14111

MIT License

430 stars 96 forks source link

Function 'CudnnBatchNormBackward' returned nan values in its 0th output #40

Closed saniazahan closed 3 years ago

saniazahan commented 3 years ago

Hi thank you so much for sharing your work. I am trying to recreate the results. I am using the ntu xsub dataset you provided with half precision amp level 1. But at epoch 33 I got NaN out from batchnorm layer. I originated from "out = tempconv(x)" function in ms_tcn.py file. I had autograd_anomally on. All the config settings are kept as your repo. Could you please suggest why this happened.

saniazahan commented 3 years ago

Update: I removed autograd_anomally detection. Its training smoothly for now. Maybe training in half precision is the culprit as autograd detects the gradients overflow as error before GradScaler comes in action.

MagicFrogSJTU commented 2 years ago

Update: I removed autograd_anomally detection. Its training smoothly for now. Maybe training in half precision is the culprit as autograd detects the gradients overflow as error before GradScaler comes in action.

Dude you really save my life

Youth-yang commented 1 year ago

Thanks