Illegal memory access during back propagation unit test

5had3z commented 4 years ago

Hi, I am having issues running correlaton_native.py during the backward phase: RuntimeError: CUDA error: an illegal memory access was encountered I first modified your implementation to update it to PyTorch 1.6.0 and ran into this issue. So then I tried to use your docker file, however jonathonf removed his python3.6 repository for ubuntu 16.04. Consequently I made the following changes to the docker file:

FROM nvidia/cuda:10.0-cudnn7-devel-ubuntu18.04 RUN pip3 install https://download.pytorch.org/whl/cu100/torch-1.1.0-cp36-cp36m-linux_x86_64.whl

This still resulted in errors during the backpropation stages, specifically during: correlation_backward_input1 correlation_backward_input2

I tried printing the dims to make sure the tensor shapes were correct in some of the functions: (Pytorch backward) Grad Output torch.Size([4, 81, 120, 120]) Input Dims torch.Size([4, 256, 128, 128])

(correlation_backward_cuda) Input batch: 4 ch: 256 h: 128 w: 128

(correlation_backward_cuda_kernel, after channels_first calls) rInput batch: 4 ch: 128 h: 128 w: 256 gradInput1 batch: 4 ch: 256 h: 128 w: 128 gradOutput batch: 4 ch: 81 h: 120 w: 120

Any idea where the issue is arising from? Is there a subtle difference in changing CUDA9->10 in the docker image?

5had3z commented 4 years ago

If I change the max_displacement to 1 or 2 and C=H=W=64, it works fine. But if I have C=H=W=128 it doesn't work (with max_displacement 1 or 2)

lliuz commented 4 years ago

This is an issue about correlation_cuda package. I am not very familiar with cuda programming, so I may not be able to help you to solve this problem.

If you have trouble with this package during training, you can alternatively use my PyTorch implementation (It is correct although kind of slower.)

Since the correlation_cuda package is widely used in other projects, such as ClementPinard/Pytorch-Correlation-extension, NVIDIA/flownet2-pytorch, you can refer to these repos for help.

5had3z commented 4 years ago

From some testing that I did, there are access requests for index -1 during some operations. correlaiton forward kernel during the element wise product sum: prod_sum += rInput1*rInput2 And in correlation_backward_input1 when reading from rInput2

In my own code I skip these operations during a boundary check and consequently don't have this issue anymore.

lliuz commented 4 years ago

Thanks for sharing and I'm glad you could find a workaround in the end!

sun0215 commented 2 years ago

Hi, I have met the same issue as you. May I ask how do you use boundary check to skip these operations you mentioned above? @5had3z

5had3z commented 2 years ago

@sun0215 I've got the checks in my re-implementation but they're commented out as it turns out it this issue is due to insufficient padding (the pad_size variable). You won't access out of bounds if this is large enough, just search for the smallest number that works for you.

Bound checks are commented out as cuda cores are dumb afaik, there's no branch predicting, so you're paying the full cost of these checks, hence why I've commented out, but still there for future reference.

sun0215 commented 2 years ago

I have solved this issue by setting padding as 20 for 256*256 image. Thank you very much.

sunnysun0215

---- 回复的原邮件 ----

发件人
日期	2021年12月22日 19:29
收件人
抄送至	、
主题	Re: [lliuz/ARFlow] Illegal memory access during back propagation unit test (#10)

I've got the checks in my re-implementation but they're commented out as it turns out it this issue is due to insufficient padding (the pad_size variable). You won't access out of bounds if this is large enough, just search for the smallest number that works for you.

Bound checks are commented out as cuda cores are dumb afaik, there's no branch predicting, so you're paying the full cost of these checks, hence why I've commented out, but still there for future reference.

—
Reply to this email directly, view it on GitHub, or unsubscribe.
Triage notifications on the go with GitHub Mobile for iOS or Android.
You are receiving this because you were mentioned.

lliuz / ARFlow

Illegal memory access during back propagation unit test #10