sniklaus / pytorch-pwc

a reimplementation of PWC-Net in PyTorch that matches the official Caffe version
GNU General Public License v3.0
608 stars 122 forks source link

Receiving a random error, CUDA_ERROR_ILLEGAL_ADDRESS #55

Closed Etienne66 closed 2 years ago

Etienne66 commented 2 years ago

Let me start by saying that I have already resolved this but it was related to your commit 85b59fa.

The original error which repeated multiple times:

Exception ignored in: 'cupy.cuda.function.Module.__dealloc__'
Traceback (most recent call last):
  File "cupy_backends\cuda\api\driver.pyx", line 260, in cupy_backends.cuda.api.driver.moduleUnload
  File "cupy_backends\cuda\api\driver.pyx", line 125, in cupy_backends.cuda.api.driver.check_status
cupy_backends.cuda.api.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered

I copied your move of the multiply by 20.0 to the return of the forward in the Network class. However I had also implemented torch.utils.checkpoint. I had missed that there was a layer in the return statement and it seems having the other layers checkpointed caused a random memory error once that was manipulated by the multiplication.

My resolution was to put the multiply by 20.0 back where it was and then I moved the self.netRefiner(objEstimate['tenFeat']) out of the return and placed on the line above so that I could checkpoint it. This freed up even more GPU memory than I had before which only allowed a patch size of 576.

It took me 3 weeks to figure this out because it was taking a couple of days to run an epoch and sometimes the error didn't popup until it was almost done. It was very random. Plus I wasn't sure if it was this change or one of the updates I had done. Plus I just couldn't fathom that this could have caused a memory error. I still find it hard to believe but this was the last thing that I changed and it had worked fine for several epochs with checkpointing before I made that change.

In my repository your run.py is under code/model/flow_pwc.py which I forked from csbhr/CDVD-TSP

I'm using checkpoint so that I can increase the patch size to 720 pixels. Otherwise I can only run it at 256 pixels and I have a feeling in particular for the flow calculations from your model need a higher resolution. I'm not saving your model separately but I might update the code to do that in the future because it is being trained and I have a lot more frame pairs that are being used for training. Not sure if it will do any better of a job of training the pytorch-PWC model but I am curious what the differences would be in the end result. Plus a lot of the frames contain blur so I'm sure that makes the job of calculating flow much more difficult. In total I am using 29,844 frames for my training with random flip vertical and horizontal as well as a random 90 degree rotation. Most of the frames are 1280x720 but all are cropped to 720 with the cropped position being random.

Thanks for all of your work @sniklaus. After learning more about all of the work you have done in this model I have to admit that I am very impressed with your work as well as the original authors.

Etienne66 commented 2 years ago

I can't help but wonder if that 196 channels in self.netSix is part of the cause of my memory error since the correlation in kernel_Correlation_updateOutput indicates that it is stepping through 32 channels at a time. It should stop at 160 but instead is stopping at 192 instead and that would indicate that it is treating like it has 224 channels while reducing it to 81. I can't imagine what data it is using in those extra channels but it makes me wonder if that was why the original paper indicated they were having a lot of edge problems in their model. I'm just guessing though. I know even less about C at this point than python but I am learning.

I started training from scratch with a corrected self.netSix but I'm training to deblur and not directly training the flow. I should probably go back and train the pytorch-pwc model by itself instead but I'm curious if this will work.

shengjie-lin commented 2 years ago

For me, I always get this error when I am using any gpu other than gpu:0. I tried my best to make sure everything is on the same gpu device, but this error won't go away. So I ended up mapping whichever available device to gpu:0 when running docker.

Etienne66 commented 2 years ago

I only have one GPU. It wasn't because of that. It is because of a memory leak caused by having a channels set to 196 instead of 192 while stepping through 32 channels at a time in the CuPy module I mentioned above. I have not had this error anymore. Instead of training from scratch I am just modifying the model after I have loaded the pretrained model. I'm still doing tests but I think it is going to outperform the original deblur model I'm working on. I haven't had a single error since I changed this code. If I put it back to 196 and put the multiply by 20 I will get it sometime during an epoch. You know CuPy better than me. Take a close look at it. Is it not reading memory locations that don't have data defined?

Etienne66 commented 2 years ago

I think I understand the CuPy code a little better now and there is no memory leak in regards to having 196 channels. The issue seems to have been purely having a part of the model layer in the return statement. I moved the netRefiner layer out of the return statement like the following and the memory errors ceased. My only guess is that PyTorch does not handle layers in the return statement exactly like it does in the rest of the forward block

@sniklaus, I do have one question about the CuPy code though and I haven't found an answer on the internet. Is the block=tuple([ 32, 1, 1 ]) specifying that there are 32 threads for kernel_Correlation_updateOutput or is it specified somewhere else? I assume that threadIdx.x is only from 0 to 31 depending on which thread is running.

    def forward(self, tenOne, tenTwo):
        tenOne = self.netExtractor(tenOne)
        tenTwo = self.netExtractor(tenTwo)

        objEstimate = self.netSix(tenOne[-1], tenTwo[-1], None)
        objEstimate = self.netFiv(tenOne[-2], tenTwo[-2], objEstimate)
        objEstimate = self.netFou(tenOne[-3], tenTwo[-3], objEstimate)
        objEstimate = self.netThr(tenOne[-4], tenTwo[-4], objEstimate)
        objEstimate = self.netTwo(tenOne[-5], tenTwo[-5], objEstimate)
                objEstimate['tenFeat'] = self.netRefiner(objEstimate['tenFeat'])

        return (objEstimate['tenFlow'] + objEstimate['tenFeat']) * 20.0
    # end
Etienne66 commented 2 years ago

I'm starting to think I have a hardware issue. It was working fine for135 epochs, gone though millions of iterations and now I can't make any more progress because of this exact same error. No code changes.

I probably need a new video card with some water cooling. I was running 80C for many months and I bet that has degraded the GPU some. I can't come up with another explanation of why it would be working so well and then stop working at all.

Etienne66 commented 2 years ago

I think I have found a resolution. I am using MSi Afterburner to under-clock my NVIDIA GeForce RTX 2060 Super. It has a boost clock of 1680Mhz. I reduced the Core Clock by 100 MHz. I also set the GPU Temp Limit to 76C and since the Power Limit is linked it was reduced to 88%. It has been running a few hours with no more of this error.

sniklaus commented 2 years ago

Oh wow, what a nightmare. I am happy you found the culprit though, I am hence closing this issue for now since I am under the impression that it wasn't an issue with the code in the end. Please feel free to reopen the issue in case I am mistaken. And thanks for keeping us updated throughout all of this!

sniklaus commented 2 years ago

And sorry for forgetting to answer your questions.

Is the block=tuple([ 32, 1, 1 ]) specifying that there are 32 threads for kernel_Correlation_updateOutput or is it specified somewhere else?

It means that there are 32 threads per warp and it is originally defined here: https://github.com/lmb-freiburg/flownet2/blob/b92e198b56b0e52e1ba0a5a98dc0e39fa5ae70cc/src/caffe/layers/correlation_layer.cu#L17

tsogkas commented 2 years ago

First of all, thank you again @sniklaus, for making this code available. I have a workstation with 2 Nvidia 1080 Ti on it, and I still get the same error whenever I have this code running on one GPU and I try to run a separate experiment on the 2nd GPU. As a sidenote, I've never encountered this issue when trying to run any other code on both GPUs simultaneously. I think my issue is related to what @StArchon94 posted earlier:

For me, I always get this error when I am using any gpu other than gpu:0. I tried my best to make sure everything is on the same gpu device, but this error won't go away. So I ended up mapping whichever available device to gpu:0 when running docker.

This a very weird issue and can be quite problematic since I cannot debug while, say, training a model, which can take a couple of days. @StArchon94 did you end up finding a solution after all?