raoyongming / PointGLR

[CVPR 2020] Global-Local Bidirectional Reasoning for Unsupervised Representation Learning of 3D Point Clouds
MIT License
117 stars 17 forks source link

RuntimeError: cuda runtime error (11) : invalid argument at /opt/conda/conda-bld/pytorch_1535493744281/work/aten/src/THC/THCGeneral.cpp:663 #5

Open QiangZiBro opened 4 years ago

QiangZiBro commented 4 years ago

pytorch version:0.4.1

raoyongming commented 4 years ago

Thanks for your interest in our work.

Could you provide more details about the error so that I can help you? We use Pytorch 0.4.1 on our server but didn't find the issue.

Hlxwk commented 3 years ago

pytorch version:0.4.1

set torch.backends.cudnn.benchmark = False

Hlxwk commented 3 years ago

Thanks for your interest in our work.

Could you provide more details about the error so that I can help you? We use Pytorch 0.4.1 on our server but didn't find the issue.

loss is nan when training... can you help me?

curiosity654 commented 3 years ago

Thanks for your interest in our work. Could you provide more details about the error so that I can help you? We use Pytorch 0.4.1 on our server but didn't find the issue.

loss is nan when training... can you help me?

the same issue, loss becomes NaN during first epoch

XGQnudt commented 3 years ago

"Input contains NaN, infinity, or a value too large for dtype ('float64')" In the first epoch, loss becomes NaN. The error message is as above. Can you help me, thanks

raoyongming commented 2 years ago

Sorry for the late response.

I didn't see the same issues during my experiments. Could you provide more information like your environment and the more detailed error message?

My environment is Pytorch 0.4.1,1080Ti GPU,Python 3.7.4 and CUDA 9.2.148. The error may also come from the incorrectly installed Pytorch extension. I think you can try to train a supervised RSCNN based on their code (https://github.com/Yochengliu/Relation-Shape-CNN) to check whether the environment is correctly configured since PointGLR and RSCNN use a similar environment.

XGQnudt commented 2 years ago

Sorry for the late response.

I didn't see the same issues during my experiments. Could you provide more information like your environment and the more detailed error message?

My environment is Pytorch 0.4.1,1080Ti GPU,Python 3.7.4 and CUDA 9.2.148. The error may also come from the incorrectly installed Pytorch extension. I think you can try to train a supervised RSCNN based on their code (https://github.com/Yochengliu/Relation-Shape-CNN) to check whether the environment is correctly configured since PointGLR and RSCNN use a similar environment.

Thank you for your help! My environment is Pytorch 0.4.1,2080Ti GPU,Python 3.6.13 and CUDA 11.4. I've been running RSCNN in this environment until now because my other work is based on it.

All messages are as follows:

[epoch 0: 0/615] metric/chamfer/normal loss: 6.763555/0.549132/1.000000 lr: 0.00143 [epoch 0: 20/615] metric/chamfer/normal loss: nan/nan/nan lr: 0.00143 [epoch 0: 40/615] metric/chamfer/normal loss: nan/nan/nan lr: 0.00143 [epoch 0: 60/615] metric/chamfer/normal loss: nan/nan/nan lr: 0.00143 [epoch 0: 80/615] metric/chamfer/normal loss: nan/nan/nan lr: 0.00143 [epoch 0: 100/615] metric/chamfer/normal loss: nan/nan/nan lr: 0.00143 [epoch 0: 120/615] metric/chamfer/normal loss: nan/nan/nan lr: 0.00143 [epoch 0: 140/615] metric/chamfer/normal loss: nan/nan/nan lr: 0.00143 [epoch 0: 160/615] metric/chamfer/normal loss: nan/nan/nan lr: 0.00143 [epoch 0: 180/615] metric/chamfer/normal loss: nan/nan/nan lr: 0.00143 [epoch 0: 200/615] metric/chamfer/normal loss: nan/nan/nan lr: 0.00143 [epoch 0: 220/615] metric/chamfer/normal loss: nan/nan/nan lr: 0.00143 [epoch 0: 240/615] metric/chamfer/normal loss: nan/nan/nan lr: 0.00143 [epoch 0: 260/615] metric/chamfer/normal loss: nan/nan/nan lr: 0.00143 [epoch 0: 280/615] metric/chamfer/normal loss: nan/nan/nan lr: 0.00143 [epoch 0: 300/615] metric/chamfer/normal loss: nan/nan/nan lr: 0.00143 [epoch 0: 320/615] metric/chamfer/normal loss: nan/nan/nan lr: 0.00143 [epoch 0: 340/615] metric/chamfer/normal loss: nan/nan/nan lr: 0.00143 [epoch 0: 360/615] metric/chamfer/normal loss: nan/nan/nan lr: 0.00143 [epoch 0: 380/615] metric/chamfer/normal loss: nan/nan/nan lr: 0.00143 [epoch 0: 400/615] metric/chamfer/normal loss: nan/nan/nan lr: 0.00143 [epoch 0: 420/615] metric/chamfer/normal loss: nan/nan/nan lr: 0.00143 [epoch 0: 440/615] metric/chamfer/normal loss: nan/nan/nan lr: 0.00143 [epoch 0: 460/615] metric/chamfer/normal loss: nan/nan/nan lr: 0.00143 [epoch 0: 480/615] metric/chamfer/normal loss: nan/nan/nan lr: 0.00143 [epoch 0: 500/615] metric/chamfer/normal loss: nan/nan/nan lr: 0.00143 [epoch 0: 520/615] metric/chamfer/normal loss: nan/nan/nan lr: 0.00143 [epoch 0: 540/615] metric/chamfer/normal loss: nan/nan/nan lr: 0.00143 [epoch 0: 560/615] metric/chamfer/normal loss: nan/nan/nan lr: 0.00143 [epoch 0: 580/615] metric/chamfer/normal loss: nan/nan/nan lr: 0.00143 [epoch 0: 600/615] metric/chamfer/normal loss: nan/nan/nan lr: 0.00143 Traceback (most recent call last): File "/home/xgq/文档/PointGLR-master8.13/train.py", line 322, in main() File "/home/xgq/文档/PointGLR-master8.13/train.py", line 169, in main train(ss_dataloader, train_dataloader, test_dataloader, encoder, decoer, optimizer, lr_scheduler, bnm_scheduler, args, num_batch, begin_epoch) File "/home/xgq/文档/PointGLR-master8.13/train.py", line 230, in train svm_acc40 = validate(train_dataloader, test_dataloader, encoder, args) File "/home/xgq/文档/PointGLR-master8.13/train.py", line 309, in validate svm_acc = evaluate_svm(train_features.data.cpu().numpy(), train_label.data.cpu().numpy(), test_features.data.cpu().numpy(), test_label.data.cpu().numpy()) File "/home/xgq/文档/PointGLR-master8.13/train.py", line 247, in evaluate_svm clf.fit(train_features, train_labels) File "/home/xgq/.conda/envs/pytorch0.4/lib/python3.6/site-packages/sklearn/svm/_classes.py", line 230, in fit accept_large_sparse=False) File "/home/xgq/.conda/envs/pytorch0.4/lib/python3.6/site-packages/sklearn/base.py", line 433, in _validate_data X, y = check_X_y(X, y, check_params) File "/home/xgq/.conda/envs/pytorch0.4/lib/python3.6/site-packages/sklearn/utils/validation.py", line 63, in inner_f return f(*args, *kwargs) File "/home/xgq/.conda/envs/pytorch0.4/lib/python3.6/site-packages/sklearn/utils/validation.py", line 878, in check_X_y estimator=estimator) File "/home/xgq/.conda/envs/pytorch0.4/lib/python3.6/site-packages/sklearn/utils/validation.py", line 63, in inner_f return f(args, kwargs) File "/home/xgq/.conda/envs/pytorch0.4/lib/python3.6/site-packages/sklearn/utils/validation.py", line 721, in check_array allow_nan=force_all_finite == 'allow-nan') File "/home/xgq/.conda/envs/pytorch0.4/lib/python3.6/site-packages/sklearn/utils/validation.py", line 106, in _assert_all_finite msg_dtype if msg_dtype is not None else X.dtype) ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

curiosity654 commented 2 years ago

I also met NaN issue using this repo months ago, I simply switched to another machine with a different GPU (in my case, TITAN XP failed and TITAN V worked) and everything worked fine, but I'm still not sure whether it was hardware that caused this issue.

XGQnudt commented 2 years ago

I also met NaN issue using this repo months ago, I simply switched to another machine with a different GPU (in my case, TITAN XP failed and TITAN V worked) and everything worked fine, but I'm still not sure whether it was hardware that caused this issue.

I just tried to turn up the batchsize, and it seems to work. When batchsize = 22, the first line is normal. When batchsize is increased, the number of normal lines is more and more. But my GPU can only set batchsize = 64 and still can't run normally

raoyongming commented 2 years ago

If the error is related to the batch size, NaN may be due to the contrastive learning loss, which is usually less stable than the supervised loss. I think you can try to use a smaller learning rate, use the torch.nn.functional.normalize(input, p=2, dim=1, eps=1e-12) with a larger eps to replace the Normalize method in this line, or use a smaller s=64 to avoid Float overflow. Since everything works well in my environment, I am not sure whether these tricks can help you.

XGQnudt commented 2 years ago

If the error is related to the batch size, NaN may be due to the contrastive learning loss, which is usually less stable than the supervised loss. I think you can try to use a smaller learning rate, use the torch.nn.functional.normalize(input, p=2, dim=1, eps=1e-12) with a larger eps to replace the Normalize method in this line, or use a smaller s=64 to avoid Float overflow. Since everything works well in my environment, I am not sure whether these tricks can help you.

I tried these methods, but they didn't work. Maybe I should try another device.