lizhaoliu-Lec / CPCM

This is the official repo for Contextual Point Cloud Modeling for Weakly-supervised Point Cloud Semantic Segmentation (ICCV 23).
MIT License
34 stars 3 forks source link

merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered terminate called after throwing an instance of 'c10::Error' #1

Closed Chongjie-Si closed 1 year ago

Chongjie-Si commented 1 year ago

Thanks for your code! I have encountered an error when training:

Traceback (most recent call last): File "ddp_train.py", line 116, in main() File "ddp_train.py", line 107, in main trainer.train() File "/home/Point_Cloud/CPCM/trainer/base.py", line 190, in train self.train_one_epoch() File "/home/Point_Cloud/CPCM/trainer/fully_supervised_trainer.py", line 332, in train_one_epoch step_ret = self.step(batch) File "/home/Point_Cloud/CPCM/trainer/fully_supervised_trainer.py", line 1232, in step return self._step_two_and_mask_stream(batch=batch) File "/home/Point_Cloud/CPCM/trainer/fully_supervised_trainer.py", line 1199, in _step_two_and_mask_stream loss.backward() File "/home/.conda/envs/seg/lib/python3.8/site-packages/torch/_tensor.py", line 307, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/home/.conda/envs/seg/lib/python3.8/site-packages/torch/autograd/init.py", line 154, in backward Variable._execution_engine.run_backward( RuntimeError: merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered terminate called after throwing an instance of 'c10::Error'

How can I fix this? I tried this https://github.com/taesungp/contrastive-unpaired-translation/issues/83 but did not success.

xiaoxunlong commented 1 year ago

May I know the details of your computer's hardware components, such as the processor, RAM, storage, and graphics card? Moreover, could you please provide conda environment and gcc version?

Chongjie-Si commented 1 year ago

CUDA 11.1, torch 1.10.1, RTX 3090 24GB, python 3.8.16, gcc 9.4.0

xiaoxunlong commented 1 year ago

I followed the instructions of me054 to setup environment and I downloaded the preprocess s3dis dataset provided by authors in README.md. And I used the command provided by authors in README.md to run experiment. Everything is going smoothly. Have you follow all the instructions provided by authors? My system environment is CUDA 11.7, 3090 24GB, gcc 11.3.0. image

Chongjie-Si commented 1 year ago

Thank you for your comments. I think there were somethings wrong with my environment. I tried to install torch 1.9.0 and everything works fine with me now.

Stay-Naive commented 10 months ago

Hi, I met the same question with you. I wonder that did you just change the torch version and then it worked? I changed my torch version to 1.9.0 and kept the CUDA and GCC version same with you, but my graphics card is RTX 2080Ti 12GB. I tried torch version 1.9.0, 1.9.1, 1.10.0 and 1.10.1 and all of them did not work for me. Do you have any idea about this bug?

xiaoxunlong commented 10 months ago

Hi, I met the same question with you. I wonder that did you just change the torch version and then it worked? I changed my torch version to 1.9.0 and kept the CUDA and GCC version same with you, but my graphics card is RTX 2080Ti 12GB. I tried torch version 1.9.0, 1.9.1, 1.10.0 and 1.10.1 and all of them did not work for me. Do you have any idea about this bug?

There is a high probability that it is a problem with the pytorch environment. Considering follow the instructions of me054 step by step to setup environment.

Stay-Naive commented 10 months ago

Yes, I did follow the instruction you provided only except two steps. In my experiment, python setup.py install --blas=openblas --force_cuda will raise the error "cblas.h: No such file or directory" and I did not find any solution. So I follow the Minkowski Engine official command python setup.py install --blas_include_dirs=${CONDA_PREFIX}/include --blas=openblas. I guess that is because I do not have the sudo so that I install the openblas in conda way. May it causes the final error? Do you have any insight about this?

Stay-Naive commented 10 months ago

By the way, can you discribe exactly how much the runtime(training and evaluating) difference between the me054 and me043 is? If that difference is acceptable to me, I will consider to change the version.

xiaoxunlong commented 10 months ago
  1. If you successfully install Minkowski Engine, conda version of openblas might not cause the error. It seems that the error might be caused by CUDA version.
  2. Approximately, me043 is 2 ~ 3 times slower than me054.
Stay-Naive commented 10 months ago

Thanks, I will try it again so.

Stay-Naive commented 10 months ago

Finally, I changed my CUDA from 11.1 to 11,3 and torch from 1.9.1 to 1.10.1. The problem has been solved.

Thanks for your great work!