traveller59 / spconv

Spatial Sparse Convolution Library
Apache License 2.0
1.87k stars 363 forks source link

RuntimeError: merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered #629

Open WYYAHYT opened 1 year ago

WYYAHYT commented 1 year ago

I tried to run example/mnist/mnist_sparse.py, but failed with error:

[Exception|implicit_gemm_pair]indices=torch.Size([4656, 3]),bs=32,ss=[28, 28],algo=ConvAlgo.MaskImplicitGemm,ksize=[3, 3],stride=[1, 1],padding=[0, 0],dilation=[1, 1],subm=True,transpose=False
SPCONV_DEBUG_SAVE_PATH not found, you can specify SPCONV_DEBUG_SAVE_PATH as debug data save path to save debug data which can be attached in a issue.
Traceback (most recent call last):
  File "/home/chenhai-fwxz/anaconda3/envs/spconv/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/chenhai-fwxz/anaconda3/envs/spconv/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/chenhai-fwxz/.vscode-server-insiders/extensions/ms-python.python-2023.14.0/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher/../../debugpy/__main__.py", line 39, in <module>
    cli.main()
  File "/home/chenhai-fwxz/.vscode-server-insiders/extensions/ms-python.python-2023.14.0/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 430, in main
    run()
  File "/home/chenhai-fwxz/.vscode-server-insiders/extensions/ms-python.python-2023.14.0/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 284, in run_file
    runpy.run_path(target, run_name="__main__")
  File "/home/chenhai-fwxz/.vscode-server-insiders/extensions/ms-python.python-2023.14.0/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 321, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "/home/chenhai-fwxz/.vscode-server-insiders/extensions/ms-python.python-2023.14.0/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 135, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "/home/chenhai-fwxz/.vscode-server-insiders/extensions/ms-python.python-2023.14.0/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 124, in _run_code
    exec(code, run_globals)
  File "/home/chenhai-fwxz/zch/spconv/example/mnist/mnist_sparse.py", line 235, in <module>
    main()
  File "/home/chenhai-fwxz/zch/spconv/example/mnist/mnist_sparse.py", line 226, in main
    train(args, model, device, train_loader, optimizer, epoch)
  File "/home/chenhai-fwxz/zch/spconv/example/mnist/mnist_sparse.py", line 75, in train
    output = model(data)
  File "/home/chenhai-fwxz/anaconda3/envs/spconv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/chenhai-fwxz/zch/spconv/example/mnist/mnist_sparse.py", line 54, in forward
    x = self.net(x_sp)
  File "/home/chenhai-fwxz/anaconda3/envs/spconv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/chenhai-fwxz/zch/spconv/spconv/pytorch/modules.py", line 138, in forward
    input = module(input)
  File "/home/chenhai-fwxz/anaconda3/envs/spconv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/chenhai-fwxz/zch/spconv/spconv/pytorch/conv.py", line 755, in forward
    return self._conv_forward(self.training,
  File "/home/chenhai-fwxz/zch/spconv/spconv/pytorch/conv.py", line 408, in _conv_forward
    raise e
  File "/home/chenhai-fwxz/zch/spconv/spconv/pytorch/conv.py", line 385, in _conv_forward
    res = ops.get_indice_pairs_implicit_gemm(
  File "/home/chenhai-fwxz/zch/spconv/spconv/pytorch/ops.py", line 550, in get_indice_pairs_implicit_gemm
    SpconvOps.sort_1d_by_key_allocator(pair_mask_tv[j],
RuntimeError: merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered

env infomation:

spconv=2.23.6  #  built from source
cumm=0.411
cuda=11.3
torch=1.12.1
gpu=NVIDIA TITAN X (Pascal)

script:

cd example/mnist 
python mnist_sparse.py

Really hope someone can help

superpigforever commented 1 year ago

Got it as well, did you solve it?

Aiuan commented 11 months ago
Traceback (most recent call last):
  File "/home/aify/anaconda3/envs/VILNet/lib/python3.9/site-packages/spconv/pytorch/conv.py", line 385, in _conv_forward
    res = ops.get_indice_pairs_implicit_gemm(
  File "/home/aify/anaconda3/envs/VILNet/lib/python3.9/site-packages/spconv/pytorch/ops.py", line 550, in get_indice_pairs_implicit_gemm
    SpconvOps.sort_1d_by_key_allocator(pair_mask_tv[j],
RuntimeError: merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/aify/anaconda3/envs/VILNet/lib/python3.9/site-packages/spconv/pytorch/conv.py", line 402, in _conv_forward
    msg += f"indices={indices.shape},bs={batch_size},ss={spatial_shape},"
  File "/home/aify/anaconda3/envs/VILNet/lib/python3.9/site-packages/torch/_tensor.py", line 872, in __format__
    return self.item().__format__(format_spec)
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

did you solve it? I run my code on cuda:0 and it workied well. However, when I changed to use cuda:1, it occured this problems.

ColsonZhang commented 6 months ago

I have the same problem. It only works well on cuda:0. Meanwhile, it cannot work well with timm.