microsoft / O-CNN

O-CNN: Octree-based Convolutional Neural Networks for 3D Shape Analysis
MIT License
726 stars 172 forks source link

Met cublas error when running test_all.py after a pytorch build #153

Closed sylyt62 closed 2 years ago

sylyt62 commented 2 years ago

Hi author,

First thank you for sharing such a nice project here. I though the build was successful, but still met cublas error when running test after build octree with pytorch. I searched around but found no clues about this Cublas error in THGpuGemm, so asking for help here.

My enviroment is:

Ubuntu18, Python 3.7.13, Cuda 11.2, cudnn 8.1.1

torch 1.9.1+cu111 torchvision 0.10.1+cu111

Test log:

(ocnn-py37-env) yangtian@yangtian-SS: O-CNN-master/pytorch$ python -W ignore test/test_all.py -v test_backward (test_octree2col.Octree2ColTest) ... ok test_forward (test_octree2col.Octree2ColTest) ... ok test_forwardP1 (test_octree2col.Octree2ColTest) ... ok test_octree2colP (test_octree2col.Octree2ColTest) ... ok test_forward_backward1 (test_octree_align.OctreeAlignTest) ... ok test_forward_backward2 (test_octree_align.OctreeAlignTest) ... ok test_forward_backward3 (test_octree_align.OctreeAlignTest) ... ok test_forward_and_backward (test_octree_conv.OctreeConvTest) ... [F octree_conv.cpp:43] Check failed: status == CUBLAS_STATUS_SUCCESS (13 vs. 0) Cublas error in THGpuGemm Aborted (core dumped)

Some build log:

gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -fPIC -I/media/yangtian/SATA3/Workspace/O-CNN-master/octree -I/media/yangtian/SATA3/PyEnvs/ubuntu/ocnn-py37-env/lib/python3.7/site-packages/torch/include -I/media/yangtian/SATA3/PyEnvs/ubuntu/ocnn-py37-env/lib/pythoite-packages/torch/include/torch/csrc/api/include -I/media/yangtian/SATA3/PyEnvs/ubuntu/ocnn-py37-env/lib/python3.7/site-packages/torch/include/TH -I/media/yangtian/SATA3/PyEnvs/ubuntu/ocnn-py37-env/lib/python3.7/site-packages/torch/include/THC -I/usr/local/cuda/include -I/media/yangtian/SATA3/PyEnvs/ubuntu/ocnn-py37-env/include -I/usr/local/pythons/Python-3.7.13/Include -I/usr/local/pythons/Python-3.7.13 -c ./cpp/transform_octree.cpp -o build/temp.linux-x86_64-3.7/./cpp/transform_octree.o -DKEY64 -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_gcc" -DPYBIND11_STDLIB="_libstdcpp" -DPYBIND11_BUILD_ABI="_cxxabi1011" -DTORCH_EXTENSION_NAME=nn -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14 In file included from /media/yangtian/SATA3/PyEnvs/ubuntu/ocnn-py37-env/lib/python3.7/site-packages/torch/include/ATen/Parallel.h:140:0, from /media/yangtian/SATA3/PyEnvs/ubuntu/ocnn-py37-env/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/utils.h:3, from /media/yangtian/SATA3/PyEnvs/ubuntu/ocnn-py37-env/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/nn/cloneable.h:5, from /media/yangtian/SATA3/PyEnvs/ubuntu/ocnn-py37-env/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/nn.h:3, from /media/yangtian/SATA3/PyEnvs/ubuntu/ocnn-py37-env/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/all.h:13, from /media/yangtian/SATA3/PyEnvs/ubuntu/ocnn-py37-env/lib/python3.7/site-packages/torch/include/torch/extension.h:4, from ./cpp/ocnn.h:5, from ./cpp/transform_octree.cpp:3: /media/yangtian/SATA3/PyEnvs/ubuntu/ocnn-py37-env/lib/python3.7/site-packages/torch/include/ATen/ParallelOpenMP.h:87:0: warning: ignoring #pragma omp parallel [-Wunknown-pragmas]

pragma omp parallel for if ((end - begin) >= grain_size)

gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -fPIC -I/media/yangtian/SATA3/Workspace/O-CNN-master/octree -I/media/yangtian/SATA3/PyEnvs/ubuntu/ocnn-py37-env/lib/python3.7/site-packages/torch/include -I/media/yangtian/SATA3/PyEnvs/ubuntu/ocnn-py37-env/lib/pythoite-packages/torch/include/torch/csrc/api/include -I/media/yangtian/SATA3/PyEnvs/ubuntu/ocnn-py37-env/lib/python3.7/site-packages/torch/include/TH -I/media/yangtian/SATA3/PyEnvs/ubuntu/ocnn-py37-env/lib/python3.7/site-packages/torch/include/THC -I/usr/local/cuda/include -I/media/yangtian/SATA3/PyEnvs/ubuntu/ocnn-py37-env/include -I/usr/local/pythons/Python-3.7.13/Include -I/usr/local/pythons/Python-3.7.13 -c ./cpp/transform_points.cpp -o build/temp.linux-x86_64-3.7/./cpp/transform_points.o -DKEY64 -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_gcc" -DPYBIND11_STDLIB="_libstdcpp" -DPYBIND11_BUILD_ABI="_cxxabi1011" -DTORCH_EXTENSION_NAME=nn -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14 In file included from /media/yangtian/SATA3/PyEnvs/ubuntu/ocnn-py37-env/lib/python3.7/site-packages/torch/include/ATen/Parallel.h:140:0, from /media/yangtian/SATA3/PyEnvs/ubuntu/ocnn-py37-env/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/utils.h:3, from /media/yangtian/SATA3/PyEnvs/ubuntu/ocnn-py37-env/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/nn/cloneable.h:5, from /media/yangtian/SATA3/PyEnvs/ubuntu/ocnn-py37-env/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/nn.h:3, from /media/yangtian/SATA3/PyEnvs/ubuntu/ocnn-py37-env/lib/python3.7/site-packages/torch/include/torch/csrc/api/include/torch/all.h:13, from /media/yangtian/SATA3/PyEnvs/ubuntu/ocnn-py37-env/lib/python3.7/site-packages/torch/include/torch/extension.h:4, from ./cpp/ocnn.h:5, from ./cpp/transform_points.cpp:5: /media/yangtian/SATA3/PyEnvs/ubuntu/ocnn-py37-env/lib/python3.7/site-packages/torch/include/ATen/ParallelOpenMP.h:87:0: warning: ignoring #pragma omp parallel [-Wunknown-pragmas]

pragma omp parallel for if ((end - begin) >= grain_size)

g++ -pthread -shared build/temp.linux-x86_64-3.7/./cpp/octree2col.o build/temp.linux-x86_64-3.7/./cpp/octree_align.o build/temp.linux-x86_64-3.7/./cpp/octree_batch.o build/temp.linux-x86_64-3.7/./cpp/octree_conv.o build/temp.linux-x86_64-3.7/./cpp/octree_grow.o build/temp.linux-x86_64-3.7/./cpp/octree_key.o build/temp.linux-x86_64-3.7/./cpp/octree_pad.o build/temp.linux-x86_64-3.7/./cpp/octree_pool.o build/temp.linux-x86_64-3.7/./cpp/octree_property.o build/temp.linux-x86_64-3.7/./cpp/octree_samples.o build/temp.linux-x86_64-3.7/./cpp/point2octree.o build/temp.linux-x86_64-3.7/./cpp/points_property.o build/temp.linux-x86_64-3.7/./cpp/pybind.o build/temp.linux-x86_64-3.7/./cpp/transform_octree.o build/temp.linux-x86_64-3.7/./cpp/transform_points.o -L/media/yangtian/SATA3/Workspace/O-CNN-master/octree/build -L/media/yangtian/SATA3/PyEnvs/ubuntu/ocnn-py37-env/lib/python3.7/site-packages/torch/libr/local/cuda/lib64 -loctree_lib -lcublas -lc10 -ltorch -ltorch_cpu -ltorch_python -lcudart -lc10_cuda -ltorch_cuda_cu -ltorch_cuda_cpp -o build/lib.linux-x86_64-3.7/ocnn/nn.cpython-37m-x86_64-linux-gnu.so creating build/bdist.linux-x86_64/egg creating build/bdist.linux-x86_64/egg/ocnn copying build/lib.linux-x86_64-3.7/ocnn/lenet.py -> build/bdist.linux-x86_64/egg/ocnn copying build/lib.linux-x86_64-3.7/ocnn/mlp.py -> build/bdist.linux-x86_64/egg/ocnn copying build/lib.linux-x86_64-3.7/ocnn/modules.py -> build/bdist.linux-x86_64/egg/ocnn copying build/lib.linux-x86_64-3.7/ocnn/nn.cpython-37m-x86_64-linux-gnu.so -> build/bdist.linux-x86_64/egg/ocnn copying build/lib.linux-x86_64-3.7/ocnn/octree2col.py -> build/bdist.linux-x86_64/egg/ocnn copying build/lib.linux-x86_64-3.7/ocnn/octree2voxel.py -> build/bdist.linux-x86_64/egg/ocnn copying build/lib.linux-x86_64-3.7/ocnn/octree_align.py -> build/bdist.linux-x86_64/egg/ocnn copying build/lib.linux-x86_64-3.7/ocnn/octree_conv.py -> build/bdist.linux-x86_64/egg/ocnn copying build/lib.linux-x86_64-3.7/ocnn/octree_pad.py -> build/bdist.linux-x86_64/egg/ocnn copying build/lib.linux-x86_64-3.7/ocnn/octree_pool.py -> build/bdist.linux-x86_64/egg/ocnn copying build/lib.linux-x86_64-3.7/ocnn/ounet.py -> build/bdist.linux-x86_64/egg/ocnn copying build/lib.linux-x86_64-3.7/ocnn/resnet.py -> build/bdist.linux-x86_64/egg/ocnn copying build/lib.linux-x86_64-3.7/ocnn/segnet.py -> build/bdist.linux-x86_64/egg/ocnn copying build/lib.linux-x86_64-3.7/ocnn/transforms.py -> build/bdist.linux-x86_64/egg/ocnn copying build/lib.linux-x86_64-3.7/ocnn/unet.py -> build/bdist.linux-x86_64/egg/ocnn copying build/lib.linux-x86_64-3.7/ocnn/init.py -> build/bdist.linux-x86_64/egg/ocnn byte-compiling build/bdist.linux-x86_64/egg/ocnn/lenet.py to lenet.cpython-37.pyc byte-compiling build/bdist.linux-x86_64/egg/ocnn/mlp.py to mlp.cpython-37.pyc byte-compiling build/bdist.linux-x86_64/egg/ocnn/modules.py to modules.cpython-37.pyc byte-compiling build/bdist.linux-x86_64/egg/ocnn/octree2col.py to octree2col.cpython-37.pyc byte-compiling build/bdist.linux-x86_64/egg/ocnn/octree2voxel.py to octree2voxel.cpython-37.pyc byte-compiling build/bdist.linux-x86_64/egg/ocnn/octree_align.py to octree_align.cpython-37.pyc byte-compiling build/bdist.linux-x86_64/egg/ocnn/octree_conv.py to octree_conv.cpython-37.pyc byte-compiling build/bdist.linux-x86_64/egg/ocnn/octree_pad.py to octree_pad.cpython-37.pyc byte-compiling build/bdist.linux-x86_64/egg/ocnn/octree_pool.py to octree_pool.cpython-37.pyc byte-compiling build/bdist.linux-x86_64/egg/ocnn/ounet.py to ounet.cpython-37.pyc byte-compiling build/bdist.linux-x86_64/egg/ocnn/resnet.py to resnet.cpython-37.pyc byte-compiling build/bdist.linux-x86_64/egg/ocnn/segnet.py to segnet.cpython-37.pyc byte-compiling build/bdist.linux-x86_64/egg/ocnn/transforms.py to transforms.cpython-37.pyc byte-compiling build/bdist.linux-x86_64/egg/ocnn/unet.py to unet.cpython-37.pyc byte-compiling build/bdist.linux-x86_64/egg/ocnn/init.py to init.cpython-37.pyc creating stub loader for ocnn/nn.cpython-37m-x86_64-linux-gnu.so byte-compiling build/bdist.linux-x86_64/egg/ocnn/nn.py to nn.cpython-37.pyc creating build/bdist.linux-x86_64/egg/EGG-INFO copying ocnn.egg-info/PKG-INFO -> build/bdist.linux-x86_64/egg/EGG-INFO copying ocnn.egg-info/SOURCES.txt -> build/bdist.linux-x86_64/egg/EGG-INFO copying ocnn.egg-info/dependency_links.txt -> build/bdist.linux-x86_64/egg/EGG-INFO copying ocnn.egg-info/requires.txt -> build/bdist.linux-x86_64/egg/EGG-INFO copying ocnn.egg-info/top_level.txt -> build/bdist.linux-x86_64/egg/EGG-INFO writing build/bdist.linux-x86_64/egg/EGG-INFO/native_libs.txt zip_safe flag not set; analyzing archive contents... ocnn.pycache.nn.cpython-37: module references file creating 'dist/ocnn-1.0-py3.7-linux-x86_64.egg' and adding 'build/bdist.linux-x86_64/egg' to it removing 'build/bdist.linux-x86_64/egg' (and everything under it) Processing ocnn-1.0-py3.7-linux-x86_64.egg creating /media/yangtian/SATA3/PyEnvs/ubuntu/ocnn-py37-env/lib/python3.7/site-packages/ocnn-1.0-py3.7-linux-x86_64.egg Extracting ocnn-1.0-py3.7-linux-x86_64.egg to /media/yangtian/SATA3/PyEnvs/ubuntu/ocnn-py37-env/lib/python3.7/site-packages Adding ocnn 1.0 to easy-install.pth file

Installed /media/yangtian/SATA3/PyEnvs/ubuntu/ocnn-py37-env/lib/python3.7/site-packages/ocnn-1.0-py3.7-linux-x86_64.egg Processing dependencies for ocnn==1.0 Searching for numpy==1.21.5 Best match: numpy 1.21.5 Adding numpy 1.21.5 to easy-install.pth file Installing f2py script to /media/yangtian/SATA3/PyEnvs/ubuntu/ocnn-py37-env/bin Installing f2py3 script to /media/yangtian/SATA3/PyEnvs/ubuntu/ocnn-py37-env/bin Installing f2py3.7 script to /media/yangtian/SATA3/PyEnvs/ubuntu/ocnn-py37-env/bin

Using /media/yangtian/SATA3/PyEnvs/ubuntu/ocnn-py37-env/lib/python3.7/site-packages Searching for torch==1.9.1+cu111 Best match: torch 1.9.1+cu111 Adding torch 1.9.1+cu111 to easy-install.pth file Installing convert-caffe2-to-onnx script to /media/yangtian/SATA3/PyEnvs/ubuntu/ocnn-py37-env/bin Installing convert-onnx-to-caffe2 script to /media/yangtian/SATA3/PyEnvs/ubuntu/ocnn-py37-env/bin

Using /media/yangtian/SATA3/PyEnvs/ubuntu/ocnn-py37-env/lib/python3.7/site-packages Searching for typing-extensions==4.1.1 Best match: typing-extensions 4.1.1 Adding typing-extensions 4.1.1 to easy-install.pth file

Using /media/yangtian/SATA3/PyEnvs/ubuntu/ocnn-py37-env/lib/python3.7/site-packages Finished processing dependencies for ocnn==1.0

Actually I tried python3.8 with torch 1.10+cu113 before but met the same error with a little different which is:

test_forward_and_backward (test_octree_conv.OctreeConvTest) ... [F octree_conv.cpp:43] Check failed: status == CUBLAS_STATUS_SUCCESS (15 vs. 0) Cublas error in THGpuGemm Aborted (core dumped

So I downgrade python and pytorch but still not working. I'm wondering if it is the cuda version causing this error? Only a tiny version differences? Not sure..

sylyt62 commented 2 years ago

I downgraded my cuda version to 11.1, then met other errors while testing:

test_backward (test_octree2col.Octree2ColTest) ... /media/yangtian/SATA3/PyEnvs/ubuntu/ocnn-py37-env/lib/python3.7/site-packages/torch/autograd/gradcheck.py:633: UserWarning: Input #0 requires gradient and is not a double precision floating point or complex. This check will likely fail if all the inputs are not of double precision floating point or complex. f'Input #{idx} requires gradient and ' ok test_forward (test_octree2col.Octree2ColTest) ... ok test_forwardP1 (test_octree2col.Octree2ColTest) ... ok test_octree2colP (test_octree2col.Octree2ColTest) ... ok test_forward_backward1 (test_octree_align.OctreeAlignTest) ... ok test_forward_backward2 (test_octree_align.OctreeAlignTest) ... ok test_forward_backward3 (test_octree_align.OctreeAlignTest) ... ok test_forward_and_backward (test_octree_conv.OctreeConvTest) ... ERROR test_forward_and_backward (test_octree_deconv.OctreeDeconvTest) ... ERROR test_decode_encode_key (test_octree_key.OctreeKeyTest) ... ERROR test_search_key (test_octree_key.OctreeKeyTest) ... ERROR test_xyz_key (test_octree_key.OctreeKeyTest) ... ERROR test_xyz_key_64 (test_octree_key.OctreeKeyTest) ... ERROR test_forward_and_backward_avg_pool (test_octree_pool.OctreePoolTest) ... ERROR test_forward_and_backward_max_pool (test_octree_pool.OctreePoolTest) ... ERROR test_forward_and_backward_max_unpool (test_octree_pool.OctreePoolTest) ... ERROR test_octree_property (test_octree_property.OctreePropertyTest) ... ERROR test_forward1 (test_octree_trilinear.OctreeTrilinearTest) ... ERROR test_points_property (test_points_property.PointsPropertyTest) ... ok

====================================================================== ERROR: test_forward_and_backward (test_octree_conv.OctreeConvTest)

Traceback (most recent call last): File "/media/yangtian/SATA3/Workspace/O-CNN-master/pytorch/test/test_octree_conv.py", line 90, in test_forward_and_backward self.forward_and_backward(kernel_size[j], stride[i]) File "/media/yangtian/SATA3/Workspace/O-CNN-master/pytorch/test/test_octree_conv.py", line 61, in forward_and_backward out3.backward(pesudo_grad2) File "/media/yangtian/SATA3/PyEnvs/ubuntu/ocnn-py37-env/lib/python3.7/site-packages/torch/_tensor.py", line 255, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/media/yangtian/SATA3/PyEnvs/ubuntu/ocnn-py37-env/lib/python3.7/site-packages/torch/autograd/init.py", line 149, in backward allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag RuntimeError: merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered

====================================================================== ERROR: test_forward_and_backward (test_octree_deconv.OctreeDeconvTest)

Traceback (most recent call last): File "/media/yangtian/SATA3/Workspace/O-CNN-master/pytorch/test/test_octree_deconv.py", line 91, in test_forward_and_backward self.forward_and_backward(kernel_size[j], stride[i]) File "/media/yangtian/SATA3/Workspace/O-CNN-master/pytorch/test/test_octree_deconv.py", line 31, in forward_and_backward octree = octree.cuda() RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

====================================================================== ERROR: test_decode_encode_key (test_octree_key.OctreeKeyTest)

Traceback (most recent call last): File "/media/yangtian/SATA3/Workspace/O-CNN-master/pytorch/test/test_octree_key.py", line 10, in test_decode_encode_key octree = ocnn.octree_batch(samples).cuda() RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

====================================================================== ERROR: test_search_key (test_octree_key.OctreeKeyTest)

Traceback (most recent call last): File "/media/yangtian/SATA3/Workspace/O-CNN-master/pytorch/test/test_octree_key.py", line 33, in test_search_key octree = ocnn.octree_batch(samples).cuda() RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

====================================================================== ERROR: test_xyz_key (test_octree_key.OctreeKeyTest)

Traceback (most recent call last): File "/media/yangtian/SATA3/Workspace/O-CNN-master/pytorch/test/test_octree_key.py", line 25, in test_xyz_key octree = ocnn.octree_batch(samples).cuda() RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

====================================================================== ERROR: test_xyz_key_64 (test_octree_key.OctreeKeyTest)

Traceback (most recent call last): File "/media/yangtian/SATA3/Workspace/O-CNN-master/pytorch/test/test_octree_key.py", line 47, in test_xyz_key_64 xyz = torch.cuda.ShortTensor([[2049, 4095, 8011, 1], [511, 4095, 8011, 0]]) RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

====================================================================== ERROR: test_forward_and_backward_avg_pool (test_octree_pool.OctreePoolTest)

Traceback (most recent call last): File "/media/yangtian/SATA3/Workspace/O-CNN-master/pytorch/test/test_octree_pool.py", line 91, in test_forward_and_backward_avg_pool octree = octree.to('cuda') RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

====================================================================== ERROR: test_forward_and_backward_max_pool (test_octree_pool.OctreePoolTest)

Traceback (most recent call last): File "/media/yangtian/SATA3/Workspace/O-CNN-master/pytorch/test/test_octree_pool.py", line 28, in test_forward_and_backward_max_pool octree = octree.to('cuda') RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

====================================================================== ERROR: test_forward_and_backward_max_unpool (test_octree_pool.OctreePoolTest)

Traceback (most recent call last): File "/media/yangtian/SATA3/Workspace/O-CNN-master/pytorch/test/test_octree_pool.py", line 59, in test_forward_and_backward_max_unpool octree = octree.to('cuda') RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

====================================================================== ERROR: test_octree_property (test_octree_property.OctreePropertyTest)

Traceback (most recent call last): File "/media/yangtian/SATA3/Workspace/O-CNN-master/pytorch/test/test_octree_property.py", line 60, in test_octree_property self.octree_property(on_cuda=True) File "/media/yangtian/SATA3/Workspace/O-CNN-master/pytorch/test/test_octree_property.py", line 13, in octree_property octree = octree.cuda() RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

====================================================================== ERROR: test_forward1 (test_octree_trilinear.OctreeTrilinearTest)

Traceback (most recent call last): File "/media/yangtian/SATA3/Workspace/O-CNN-master/pytorch/test/test_octree_trilinear.py", line 11, in test_forward1 octree = ocnn.octree_batch(ocnn.octree_samples(['octree_1', 'octree_1'])).cuda() RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.


Ran 19 tests in 2.602s

FAILED (errors=11)

wang-ps commented 2 years ago

Thanks for your interest in our project.

This is a known issue and I also encountered it before. The error is indeed caused by the version of cublas, please try to build the code with CUDA 10.1 or 10.2. Currently, I have no bandwidth to fix this issue. If you are interested, please help to fix it.

For your reference, the code is tested with the following pytorch versions:

conda install pytorch==1.6.0 torchvision==0.7.0  cudatoolkit=10.1 -c pytorch
conda install pytorch==1.7.0 torchvision==0.8.0  cudatoolkit=10.2 -c pytorch
conda install pytorch==1.7.1 torchvision==0.8.2  cudatoolkit=10.1 -c pytorch
conda install pytorch==1.9.0 torchvision==0.10.0 cudatoolkit=10.2 -c pytorch
conda install pytorch==1.9.1 torchvision==0.10.1 cudatoolkit=10.2 -c pytorch
docker pull pytorch/pytorch:1.6.0-cuda10.1-cudnn7-devel
docker pull pytorch/pytorch:1.8.1-cuda10.2-cudnn7-devel
docker pull pytorch/pytorch:1.9.0-cuda10.2-cudnn7-devel

And the unit test failed with the following pytorch versions in my own experiments:

conda intall pytorch==1.8.0 torchvision==0.9.0 cudatoolkit=11.1 -c pytorch
docker pull pytorch/pytorch:1.7.0-cuda11.0-cudnn8-devel
docker pull pytorch/pytorch:1.7.1-cuda11.0-cudnn8-devel
sylyt62 commented 2 years ago

Appreciate your info :) I'll dig into it a little bit when free.