CUDA memory access error on multiple GPUs

JasonSheng-atp commented 2 years ago

Hello, I am using the average forward warp on multiple GPUs, but I encountered one terrible error which is: File "xxx/softsplat.py", line 354, in FunctionSoftsplat tenNormalize[tenNormalize == 0.0] = 1.0 RuntimeError: CUDA error: an illegal memory access was encountered

I am quite confused that it should be one change value on one tensor but it caused illegal memory access error. Could you please help me with it? it happens after several epochs

sniklaus commented 2 years ago

Try running your script with CUDA_LAUNCH_BLOCKING=1 python yourscript.py and let me know what happens.

JasonSheng-atp commented 2 years ago

Thank you for your reply, after I add it, it turns out:

File "xxx/FlowNet.py", line 93, in forward warped_img0 = FunctionSoftsplat(tenInput=img0, tenFlow=flow[:,:2], tenMetric=None, strType='average') File "xxx/softsplat.py", line 350, in FunctionSoftsplat tenOutput = _FunctionSoftsplat.apply(tenInput, tenFlow) File "xxx/softsplat.py", line 258, in forward cupy_launch('kernel_Softsplat_updateOutput', cupy_kernel('kernel_Softsplat_updateOutput', { File "cupy/cuda/function.pyx", line 201, in cupy.cuda.function.Function.call File "cupy/cuda/function.pyx", line 183, in cupy.cuda.function._launch File "cupy_backends/cuda/api/driver.pyx", line 306, in cupy_backends.cuda.api.driver.launchKernel File "cupy_backends/cuda/api/driver.pyx", line 125, in cupy_backends.cuda.api.driver.check_status cupy_backends.cuda.api.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered

I found that before the error, the net experienced a shape change on the loss, like:

epoch:6 705/761 time:0.00+0.55 loss_l1:1.9941e-02 epoch:6 706/761 time:0.00+0.51 loss_l1:1.3551e-02 epoch:6 707/761 time:0.00+0.60 loss_l1:1.9836e-02 epoch:6 708/761 time:0.00+0.52 loss_l1:1.9157e-02 epoch:6 709/761 time:0.00+0.55 loss_l1:9.4212e-02 epoch:6 710/761 time:0.00+0.54 loss_l1:4.6343e-02 epoch:6 711/761 time:0.00+0.53 loss_l1:9.0796e-02 epoch:6 712/761 time:0.00+0.52 loss_l1:2.4395e-01 epoch:6 713/761 time:0.00+0.51 loss_l1:3.0267e-01 epoch:6 714/761 time:0.00+0.52 loss_l1:1.7686e-01 epoch:6 715/761 time:0.00+0.53 loss_l1:1.6426e-01

I guess there maybe some boundary on the gradient? I test it on three 2080Tis. And I am testing it with gradient cliping.

sniklaus commented 2 years ago

I just updated the softsplat.py, can you try again with the new version?

JasonSheng-atp commented 2 years ago

Sorry for replying this late, and the error remains the same after I use the new version:

File "xxx/FlowNet.py", line 87, in forward warped_img1 = FunctionSoftsplat(tenInput=img1, tenFlow=flow[:,2:], tenMetric=None, strType='average') File "xxx/softsplat.py", line 359, in FunctionSoftsplat tenOutput = _FunctionSoftsplat.apply(tenInput, tenFlow) File "xxx/softsplat.py", line 267, in forward cupy_launch('kernel_Softsplat_updateOutput', cupy_kernel('kernel_Softsplat_updateOutput', { File "cupy/cuda/function.pyx", line 201, in cupy.cuda.function.Function.call File "cupy/cuda/function.pyx", line 183, in cupy.cuda.function._launch File "cupy_backends/cuda/api/driver.pyx", line 306, in cupy_backends.cuda.api.driver.launchKernel File "cupy_backends/cuda/api/driver.pyx", line 125, in cupy_backends.cuda.api.driver.check_status cupy_backends.cuda.api.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered Traceback (most recent call last): File "cupy_backends/cuda/api/driver.pyx", line 260, in cupy_backends.cuda.api.driver.moduleUnload File "cupy_backends/cuda/api/driver.pyx", line 125, in cupy_backends.cuda.api.driver.check_status cupy_backends.cuda.api.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered Exception ignored in: 'cupy.cuda.function.Module.dealloc' Traceback (most recent call last): File "cupy_backends/cuda/api/driver.pyx", line 260, in cupy_backends.cuda.api.driver.moduleUnload File "cupy_backends/cuda/api/driver.pyx", line 125, in cupy_backends.cuda.api.driver.check_status cupy_backends.cuda.api.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered Traceback (most recent call last): File "cupy_backends/cuda/api/driver.pyx", line 260, in cupy_backends.cuda.api.driver.moduleUnload File "cupy_backends/cuda/api/driver.pyx", line 125, in cupy_backends.cuda.api.driver.check_status cupy_backends.cuda.api.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered Exception ignored in: 'cupy.cuda.function.Module.dealloc' Traceback (most recent call last): File "cupy_backends/cuda/api/driver.pyx", line 260, in cupy_backends.cuda.api.driver.moduleUnload File "cupy_backends/cuda/api/driver.pyx", line 125, in cupy_backends.cuda.api.driver.check_status cupy_backends.cuda.api.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered

and there core.8 core.9 core.10 core.11 four files generated under the folder, which is not planned, and they are very large. Error always appears after a sharp change on the loss or end of an epoch. I will try to lower down the learning rate, cliping gradient also failed. Thank you for your reply.

sniklaus commented 2 years ago

I just pushed some more changes, maybe those make it work. :man_shrugging:

JasonSheng-atp commented 2 years ago

Thank you. Problem solved. I use the new version and decrease the learning rate from 3e-4 to 1e-4 and there is no error.

za-cheng commented 2 years ago

I ran into the very same issue today and the root cause seems to be the float to int conversion on the c++ side (e.g. int intNorthwestX = (int) (floor(fltX))) overflows with very large negative fltX. If fltX<=-2^31, intNorthwestX will be cast to minimum int32 value of -2^31, and the boundary condition intNorthwestX >= 0 could evaluate to true due to subsequent signed to unsigned integer conversion. From my test this results in illegal memory access error on centos and cupy cuda 10.2, but is error free on ubuntu and newer cupy+cuda, probably because intNorthwestX >= 0 is handled differently.

There are two ways to get around this:

either clamp tenFlow: e.g. tenFlow = tenFlow.clamp(-10000, 10000)
or specify data type for intNorthwestX >= 0 condition: e,g, intNorthwestX >= (int)0

sniklaus commented 2 years ago

Thanks for sharing your findings, clamping the flow is definitely a good idea! :+1:

sniklaus / softmax-splatting

CUDA memory access error on multiple GPUs #46