Silent integer overflows

markweberdev commented 1 year ago

Hi,

I spend the last couple of days debugging an issue in a project that uses spconv and it took me a while to figure it out. So I file this issue/feature request, in hope that maybe there are some things that could be added to spconv to improve the user experience with such bugs. More on this below. The original issue can be found here: https://github.com/dvlab-research/SphereFormer/issues/31.

Environment:

Python 3.7
cuda 11.1
torch 1.8.0
torch-cluster 1.5.9
torch-geometric 1.5.9
torch-scatter 2.0.9
torch-sparse 0.6.12

Original error message

`RuntimeError: CUDA error: an illegal memory access was encountered` (open for full message)

[06/02 14:10:24 main-logger]: Epoch: [1/50][10/1758] Data 0.001 (0.864) Batch 1.513 (3.495) Remain 85:18:56 Loss 1.7657 Lr: [0.00599939, 0.00059994] Accuracy 0.4674. [06/02 14:10:39 main-logger]: Epoch: [1/50][20/1758] Data 0.001 (0.433) Batch 1.465 (2.481) Remain 60:33:22 Loss 1.3113 Lr: [0.00599877, 0.00059988] Accuracy 0.5326. terminate called after throwing an instance of 'c10::Error' what(): CUDA error: an illegal memory access was encountered Exception raised from create_event_internal at /pytorch/c10/cuda/CUDACachingAllocator.cpp:733 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fcf322542f2 in /usr/wiss/webermar/anaconda3/envs/sphere_pip_cuda/lib/python3.7/site-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5b (0x7fcf3225167b in /usr/wiss/webermar/anaconda3/envs/sphere_pip_cuda/lib/python3.7/site-packages/torch/lib/libc10.so) frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x809 (0x7fcf326151f9 in /usr/wiss/webermar/anaconda3/envs/sphere_pip_cuda/lib/python3.7/site-packages/torch/lib/libc10_cuda.so) frame #3: c10::TensorImpl::release_resources() + 0x54 (0x7fcf3223c3a4 in /usr/wiss/webermar/anaconda3/envs/sphere_pip_cuda/lib/python3.7/site-packages/torch/lib/libc10.so) frame #4: + 0x6ea39a (0x7fcfa6cea39a in /usr/wiss/webermar/anaconda3/envs/sphere_pip_cuda/lib/python3.7/site-packages/torch/lib/libtorch_python.so) frame #5: + 0x6ea441 (0x7fcfa6cea441 in /usr/wiss/webermar/anaconda3/envs/sphere_pip_cuda/lib/python3.7/site-packages/torch/lib/libtorch_python.so) frame #6: /usr/wiss/webermar/anaconda3/envs/sphere_pip_cuda/bin/python() [0x4cb472] frame #7: /usr/wiss/webermar/anaconda3/envs/sphere_pip_cuda/bin/python() [0x4b0858] frame #8: /usr/wiss/webermar/anaconda3/envs/sphere_pip_cuda/bin/python() [0x4c5b50] frame #9: /usr/wiss/webermar/anaconda3/envs/sphere_pip_cuda/bin/python() [0x4c5b66] frame #10: /usr/wiss/webermar/anaconda3/envs/sphere_pip_cuda/bin/python() [0x4c5b66] frame #11: /usr/wiss/webermar/anaconda3/envs/sphere_pip_cuda/bin/python() [0x4c576c] frame #12: /usr/wiss/webermar/anaconda3/envs/sphere_pip_cuda/bin/python() [0x4a251c] frame #13: /usr/wiss/webermar/anaconda3/envs/sphere_pip_cuda/bin/python() [0x553803] frame #14: _PyEval_EvalFrameDefault + 0x2856 (0x4a9536 in /usr/wiss/webermar/anaconda3/envs/sphere_pip_cuda/bin/python) frame #15: _PyFunction_FastCallKeywords + 0x106 (0x4b9d16 in /usr/wiss/webermar/anaconda3/envs/sphere_pip_cuda/bin/python) frame #16: /usr/wiss/webermar/anaconda3/envs/sphere_pip_cuda/bin/python() [0x4ae8df] frame #17: _PyEval_EvalFrameDefault + 0xa9e (0x4a777e in /usr/wiss/webermar/anaconda3/envs/sphere_pip_cuda/bin/python) frame #18: _PyFunction_FastCallKeywords + 0x106 (0x4b9d16 in /usr/wiss/webermar/anaconda3/envs/sphere_pip_cuda/bin/python) frame #19: /usr/wiss/webermar/anaconda3/envs/sphere_pip_cuda/bin/python() [0x4ae8df] frame #20: _PyEval_EvalFrameDefault + 0x971 (0x4a7651 in /usr/wiss/webermar/anaconda3/envs/sphere_pip_cuda/bin/python) frame #21: _PyEval_EvalCodeWithName + 0x201 (0x4a5a81 in /usr/wiss/webermar/anaconda3/envs/sphere_pip_cuda/bin/python) frame #22: _PyFunction_FastCallKeywords + 0x29c (0x4b9eac in /usr/wiss/webermar/anaconda3/envs/sphere_pip_cuda/bin/python) frame #23: /usr/wiss/webermar/anaconda3/envs/sphere_pip_cuda/bin/python() [0x4ae8df] frame #24: _PyEval_EvalFrameDefault + 0x15d6 (0x4a82b6 in /usr/wiss/webermar/anaconda3/envs/sphere_pip_cuda/bin/python) frame #25: _PyEval_EvalCodeWithName + 0x201 (0x4a5a81 in /usr/wiss/webermar/anaconda3/envs/sphere_pip_cuda/bin/python) frame #26: PyEval_EvalCodeEx + 0x39 (0x4a5879 in /usr/wiss/webermar/anaconda3/envs/sphere_pip_cuda/bin/python) frame #27: PyEval_EvalCode + 0x1b (0x54a8db in /usr/wiss/webermar/anaconda3/envs/sphere_pip_cuda/bin/python) frame #28: /usr/wiss/webermar/anaconda3/envs/sphere_pip_cuda/bin/python() [0x564e03] frame #29: PyRun_StringFlags + 0x7b (0x561a8b in /usr/wiss/webermar/anaconda3/envs/sphere_pip_cuda/bin/python) frame #30: PyRun_SimpleStringFlags + 0x3b (0x5618fb in /usr/wiss/webermar/anaconda3/envs/sphere_pip_cuda/bin/python) frame #31: /usr/wiss/webermar/anaconda3/envs/sphere_pip_cuda/bin/python() [0x53fc27] frame #32: _Py_UnixMain + 0x3c (0x53fb3c in /usr/wiss/webermar/anaconda3/envs/sphere_pip_cuda/bin/python) frame #33: + 0x29d90 (0x7fcfba17ed90 in /lib/x86_64-linux-gnu/libc.so.6) frame #34: __libc_start_main + 0x80 (0x7fcfba17ee40 in /lib/x86_64-linux-gnu/libc.so.6) frame #35: /usr/wiss/webermar/anaconda3/envs/sphere_pip_cuda/bin/python() [0x53f9ee] Traceback (most recent call last): File "train.py", line 902, in main() File "train.py", line 90, in main mp.spawn(main_worker, nprocs=args.ngpus_per_node, args=(args.ngpus_per_node, args)) File "/usr/wiss/webermar/anaconda3/envs/sphere_pip_cuda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/usr/wiss/webermar/anaconda3/envs/sphere_pip_cuda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes while not context.join(): File "/usr/wiss/webermar/anaconda3/envs/sphere_pip_cuda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 150, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException: -- Process 1 terminated with the following error: Traceback (most recent call last): File "/usr/wiss/webermar/anaconda3/envs/sphere_pip_cuda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap fn(i, *args) File "/storage/user/webermar/external/SphereFormer/train.py", line 410, in main_worker loss_train, mIoU_train, mAcc_train, allAcc_train = train(train_loader, model, criterion, optimizer, epoch, scaler, scheduler, gpu) File "/storage/user/webermar/external/SphereFormer/train.py", line 498, in train output = model(sinput, xyz, batch) File "/usr/wiss/webermar/anaconda3/envs/sphere_pip_cuda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/usr/wiss/webermar/anaconda3/envs/sphere_pip_cuda/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 705, in forward output = self.module(*inputs[0], **kwargs[0]) File "/usr/wiss/webermar/anaconda3/envs/sphere_pip_cuda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/storage/user/webermar/external/SphereFormer/model/unet_spherical_transformer.py", line 285, in forward output = self.unet(output, xyz, batch) File "/usr/wiss/webermar/anaconda3/envs/sphere_pip_cuda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/storage/user/webermar/external/SphereFormer/model/unet_spherical_transformer.py", line 195, in forward xyz_next, batch_next = get_downsample_info(xyz, batch, indice_pairs) File "/storage/user/webermar/external/SphereFormer/model/unet_spherical_transformer.py", line 56, in get_downsample_info valid_pair_in, valid_pair_out = pair_in[valid_mask].long(), pair_out[valid_mask].long() RuntimeError: CUDA error: an illegal memory access was encountered

Integer overflow Unfortunately, the error message is quite misleading, as the error does not really happen where the message indicates. After some debugging, I found out that the error depends on the used batchsize (which felt weird). And with much more debugging, I found out that undetected error actually happens in Spconv's SparseConvolution. Click here. When the index pairs are computed, they silently overflow (somewhere in this block). Since these overflows are not in python, which would mean we probably get NaNs, but in C/C++/CUDA, these overflows are undetected. I assume (just guessing), that probably somewhere flattened indices are used, maybe for hashing, and that this overflows. Hence, the different behaviour based on the batchsize.

Potential solutions Since GPU errors are so tedious to debug and are often misleading, it would be super helpful, if spconv could avoid silent overflows. Maybe some of these ideas could be used?

Support 64bit indices.
Check if the voxel resolution and batchsize are too high, so they would overflow.
Have a sanity check on the returned indices (e.g. a negative batch index might indicate sth is wrong).

Probably you have even more ideas regarding what's the best approach.

Thanks a lot! Mark

FindDefinition commented 1 year ago

int32 is enough for indices, the possible problem is indices scalar overflow. that function already have code that switch to int64 if needed to avoid overflow. Please provide more information of this problem, such as conv params, input spatial shape, number of input points and batch size.

JonathanPaul10 commented 1 year ago

int32 is enough for indices, the possible problem is indices scalar overflow. that function already have code that switch to int64 if needed to avoid overflow. Please provide more information of this problem, such as conv params, input spatial shape, number of input points and batch size.

I did meet the bug too. I find my spconv-based network works with 10-bit input (input sp-tensor is in [0, 1023]) but that occurs some errors when the input is in [0,2047] (11-bit). Please check this kind of situation.

And I guess it is because batch_size * (2048)^3 > INT_MAX and overflow. Maybe it is a problem of indices.

traveller59 / spconv

Silent integer overflows #607