Closed JasonSheng-atp closed 2 years ago
Try running your script with CUDA_LAUNCH_BLOCKING=1 python yourscript.py
and let me know what happens.
Thank you for your reply, after I add it, it turns out:
File "xxx/FlowNet.py", line 93, in forward warped_img0 = FunctionSoftsplat(tenInput=img0, tenFlow=flow[:,:2], tenMetric=None, strType='average') File "xxx/softsplat.py", line 350, in FunctionSoftsplat tenOutput = _FunctionSoftsplat.apply(tenInput, tenFlow) File "xxx/softsplat.py", line 258, in forward cupy_launch('kernel_Softsplat_updateOutput', cupy_kernel('kernel_Softsplat_updateOutput', { File "cupy/cuda/function.pyx", line 201, in cupy.cuda.function.Function.call File "cupy/cuda/function.pyx", line 183, in cupy.cuda.function._launch File "cupy_backends/cuda/api/driver.pyx", line 306, in cupy_backends.cuda.api.driver.launchKernel File "cupy_backends/cuda/api/driver.pyx", line 125, in cupy_backends.cuda.api.driver.check_status cupy_backends.cuda.api.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
I found that before the error, the net experienced a shape change on the loss, like:
epoch:6 705/761 time:0.00+0.55 loss_l1:1.9941e-02 epoch:6 706/761 time:0.00+0.51 loss_l1:1.3551e-02 epoch:6 707/761 time:0.00+0.60 loss_l1:1.9836e-02 epoch:6 708/761 time:0.00+0.52 loss_l1:1.9157e-02 epoch:6 709/761 time:0.00+0.55 loss_l1:9.4212e-02 epoch:6 710/761 time:0.00+0.54 loss_l1:4.6343e-02 epoch:6 711/761 time:0.00+0.53 loss_l1:9.0796e-02 epoch:6 712/761 time:0.00+0.52 loss_l1:2.4395e-01 epoch:6 713/761 time:0.00+0.51 loss_l1:3.0267e-01 epoch:6 714/761 time:0.00+0.52 loss_l1:1.7686e-01 epoch:6 715/761 time:0.00+0.53 loss_l1:1.6426e-01
I guess there maybe some boundary on the gradient? I test it on three 2080Tis. And I am testing it with gradient cliping.
I just updated the softsplat.py
, can you try again with the new version?
Sorry for replying this late, and the error remains the same after I use the new version:
File "xxx/FlowNet.py", line 87, in forward warped_img1 = FunctionSoftsplat(tenInput=img1, tenFlow=flow[:,2:], tenMetric=None, strType='average') File "xxx/softsplat.py", line 359, in FunctionSoftsplat tenOutput = _FunctionSoftsplat.apply(tenInput, tenFlow) File "xxx/softsplat.py", line 267, in forward cupy_launch('kernel_Softsplat_updateOutput', cupy_kernel('kernel_Softsplat_updateOutput', { File "cupy/cuda/function.pyx", line 201, in cupy.cuda.function.Function.call File "cupy/cuda/function.pyx", line 183, in cupy.cuda.function._launch File "cupy_backends/cuda/api/driver.pyx", line 306, in cupy_backends.cuda.api.driver.launchKernel File "cupy_backends/cuda/api/driver.pyx", line 125, in cupy_backends.cuda.api.driver.check_status cupy_backends.cuda.api.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered Traceback (most recent call last): File "cupy_backends/cuda/api/driver.pyx", line 260, in cupy_backends.cuda.api.driver.moduleUnload File "cupy_backends/cuda/api/driver.pyx", line 125, in cupy_backends.cuda.api.driver.check_status cupy_backends.cuda.api.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered Exception ignored in: 'cupy.cuda.function.Module.dealloc' Traceback (most recent call last): File "cupy_backends/cuda/api/driver.pyx", line 260, in cupy_backends.cuda.api.driver.moduleUnload File "cupy_backends/cuda/api/driver.pyx", line 125, in cupy_backends.cuda.api.driver.check_status cupy_backends.cuda.api.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered Traceback (most recent call last): File "cupy_backends/cuda/api/driver.pyx", line 260, in cupy_backends.cuda.api.driver.moduleUnload File "cupy_backends/cuda/api/driver.pyx", line 125, in cupy_backends.cuda.api.driver.check_status cupy_backends.cuda.api.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered Exception ignored in: 'cupy.cuda.function.Module.dealloc' Traceback (most recent call last): File "cupy_backends/cuda/api/driver.pyx", line 260, in cupy_backends.cuda.api.driver.moduleUnload File "cupy_backends/cuda/api/driver.pyx", line 125, in cupy_backends.cuda.api.driver.check_status cupy_backends.cuda.api.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
and there core.8 core.9 core.10 core.11 four files generated under the folder, which is not planned, and they are very large. Error always appears after a sharp change on the loss or end of an epoch. I will try to lower down the learning rate, cliping gradient also failed. Thank you for your reply.
I just pushed some more changes, maybe those make it work. :man_shrugging:
Thank you. Problem solved. I use the new version and decrease the learning rate from 3e-4 to 1e-4 and there is no error.
I ran into the very same issue today and the root cause seems to be the float to int conversion on the c++ side (e.g. int intNorthwestX = (int) (floor(fltX))
) overflows with very large negative fltX. If fltX<=-2^31, intNorthwestX will be cast to minimum int32 value of -2^31, and the boundary condition intNorthwestX >= 0
could evaluate to true due to subsequent signed to unsigned integer conversion. From my test this results in illegal memory access error on centos and cupy cuda 10.2, but is error free on ubuntu and newer cupy+cuda, probably because intNorthwestX >= 0
is handled differently.
There are two ways to get around this:
tenFlow = tenFlow.clamp(-10000, 10000)
intNorthwestX >= 0
condition: e,g, intNorthwestX >= (int)0
Thanks for sharing your findings, clamping the flow is definitely a good idea! :+1:
Hello, I am using the average forward warp on multiple GPUs, but I encountered one terrible error which is: File "xxx/softsplat.py", line 354, in FunctionSoftsplat tenNormalize[tenNormalize == 0.0] = 1.0 RuntimeError: CUDA error: an illegal memory access was encountered
I am quite confused that it should be one change value on one tensor but it caused illegal memory access error. Could you please help me with it? it happens after several epochs