Closed Chongjie-Si closed 1 year ago
May I know the details of your computer's hardware components, such as the processor, RAM, storage, and graphics card? Moreover, could you please provide conda environment and gcc version?
CUDA 11.1, torch 1.10.1, RTX 3090 24GB, python 3.8.16, gcc 9.4.0
I followed the instructions of me054 to setup environment and I downloaded the preprocess s3dis dataset provided by authors in README.md. And I used the command provided by authors in README.md to run experiment. Everything is going smoothly. Have you follow all the instructions provided by authors? My system environment is CUDA 11.7, 3090 24GB, gcc 11.3.0.
Thank you for your comments. I think there were somethings wrong with my environment. I tried to install torch 1.9.0 and everything works fine with me now.
Hi, I met the same question with you. I wonder that did you just change the torch version and then it worked? I changed my torch version to 1.9.0 and kept the CUDA and GCC version same with you, but my graphics card is RTX 2080Ti 12GB. I tried torch version 1.9.0, 1.9.1, 1.10.0 and 1.10.1 and all of them did not work for me. Do you have any idea about this bug?
Hi, I met the same question with you. I wonder that did you just change the torch version and then it worked? I changed my torch version to 1.9.0 and kept the CUDA and GCC version same with you, but my graphics card is RTX 2080Ti 12GB. I tried torch version 1.9.0, 1.9.1, 1.10.0 and 1.10.1 and all of them did not work for me. Do you have any idea about this bug?
There is a high probability that it is a problem with the pytorch environment. Considering follow the instructions of me054 step by step to setup environment.
Yes, I did follow the instruction you provided only except two steps.
In my experiment, python setup.py install --blas=openblas --force_cuda
will raise the error "cblas.h: No such file or directory" and I did not find any solution. So I follow the Minkowski Engine official command python setup.py install --blas_include_dirs=${CONDA_PREFIX}/include --blas=openblas
. I guess that is because I do not have the sudo so that I install the openblas in conda way.
May it causes the final error? Do you have any insight about this?
By the way, can you discribe exactly how much the runtime(training and evaluating) difference between the me054 and me043 is? If that difference is acceptable to me, I will consider to change the version.
Thanks, I will try it again so.
Finally, I changed my CUDA from 11.1 to 11,3 and torch from 1.9.1 to 1.10.1. The problem has been solved.
Thanks for your great work!
Thanks for your code! I have encountered an error when training:
Traceback (most recent call last): File "ddp_train.py", line 116, in
main()
File "ddp_train.py", line 107, in main
trainer.train()
File "/home/Point_Cloud/CPCM/trainer/base.py", line 190, in train
self.train_one_epoch()
File "/home/Point_Cloud/CPCM/trainer/fully_supervised_trainer.py", line 332, in train_one_epoch
step_ret = self.step(batch)
File "/home/Point_Cloud/CPCM/trainer/fully_supervised_trainer.py", line 1232, in step
return self._step_two_and_mask_stream(batch=batch)
File "/home/Point_Cloud/CPCM/trainer/fully_supervised_trainer.py", line 1199, in _step_two_and_mask_stream
loss.backward()
File "/home/.conda/envs/seg/lib/python3.8/site-packages/torch/_tensor.py", line 307, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/.conda/envs/seg/lib/python3.8/site-packages/torch/autograd/init.py", line 154, in backward
Variable._execution_engine.run_backward(
RuntimeError: merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered
terminate called after throwing an instance of 'c10::Error'
How can I fix this? I tried this https://github.com/taesungp/contrastive-unpaired-translation/issues/83 but did not success.