Issues executing examples. CUDA_ERROR_ILLEGAL_ADDRESS and torch.bmm received an invalid combination of arguments

dhorka commented 6 years ago

Hi,

I have some issues executing your code. First, I tried to execute your example with modelnet 10 using the command provided. It seemed to work but an advanced epoch the code crash with this error:

Traceback (most recent call last):
  File "main.py", line 315, in <module>
    main()
  File "main.py", line 217, in main
    acc_train, loss, t_loader, t_trainer = train(epoch)
  File "main.py", line 155, in train
    loss_meter.add(loss.data[0])
RuntimeError: cuda runtime error (77) : an illegal memory access was encountered at /pytorch/torch/lib/THC/generic/THCStorage.c:32

Traceback (most recent call last):
  File "cupy/cuda/driver.pyx", line 159, in cupy.cuda.driver.moduleUnload
  File "cupy/cuda/driver.pyx", line 75, in cupy.cuda.driver.check_status
cupy.cuda.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
Exception ignored in: 'cupy.cuda.function.Module.__dealloc__'
Traceback (most recent call last):
  File "cupy/cuda/driver.pyx", line 159, in cupy.cuda.driver.moduleUnload
  File "cupy/cuda/driver.pyx", line 75, in cupy.cuda.driver.check_status
cupy.cuda.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
Traceback (most recent call last):
  File "cupy/cuda/driver.pyx", line 159, in cupy.cuda.driver.moduleUnload
  File "cupy/cuda/driver.pyx", line 75, in cupy.cuda.driver.check_status
cupy.cuda.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
Exception ignored in: 'cupy.cuda.function.Module.__dealloc__'
Traceback (most recent call last):
  File "cupy/cuda/driver.pyx", line 159, in cupy.cuda.driver.moduleUnload
  File "cupy/cuda/driver.pyx", line 75, in cupy.cuda.driver.check_status
cupy.cuda.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
Traceback (most recent call last):
  File "cupy/cuda/driver.pyx", line 159, in cupy.cuda.driver.moduleUnload
  File "cupy/cuda/driver.pyx", line 75, in cupy.cuda.driver.check_status
cupy.cuda.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
Exception ignored in: 'cupy.cuda.function.Module.__dealloc__'
Traceback (most recent call last):
  File "cupy/cuda/driver.pyx", line 159, in cupy.cuda.driver.moduleUnload
  File "cupy/cuda/driver.pyx", line 75, in cupy.cuda.driver.check_status
cupy.cuda.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
Traceback (most recent call last):
  File "cupy/cuda/driver.pyx", line 159, in cupy.cuda.driver.moduleUnload
  File "cupy/cuda/driver.pyx", line 75, in cupy.cuda.driver.check_status
cupy.cuda.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
Exception ignored in: 'cupy.cuda.function.Module.__dealloc__'
Traceback (most recent call last):
  File "cupy/cuda/driver.pyx", line 159, in cupy.cuda.driver.moduleUnload
  File "cupy/cuda/driver.pyx", line 75, in cupy.cuda.driver.check_status
cupy.cuda.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
=

I executed the code several times and the error appears randomly, it is not always in the same epoch, also it is not appearing in the same part of the code, here you can see an other example of the error:

File "main.py", line 315, in <module>
    main()
  File "main.py", line 217, in main
    acc_train, loss, t_loader, t_trainer = train(epoch)
  File "main.py", line 152, in train
    loss.backward()
  File "/projects/env/ecc/lib/python3.6/site-packages/torch/autograd/variable.py", line 167, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, retain_variables)
  File "/projects/env/ecc/lib/python3.6/site-packages/torch/autograd/__init__.py", line 99, in backward
    variables, grad_variables, retain_graph)
RuntimeError: cublas runtime error : an internal operation failed at /pytorch/torch/lib/THC/THCBlas.cu:247

Traceback (most recent call last):
  File "cupy/cuda/driver.pyx", line 159, in cupy.cuda.driver.moduleUnload
  File "cupy/cuda/driver.pyx", line 75, in cupy.cuda.driver.check_status
cupy.cuda.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
Exception ignored in: 'cupy.cuda.function.Module.__dealloc__'
Traceback (most recent call last):
  File "cupy/cuda/driver.pyx", line 159, in cupy.cuda.driver.moduleUnload
  File "cupy/cuda/driver.pyx", line 75, in cupy.cuda.driver.check_status
cupy.cuda.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
Traceback (most recent call last):
  File "cupy/cuda/driver.pyx", line 159, in cupy.cuda.driver.moduleUnload
  File "cupy/cuda/driver.pyx", line 75, in cupy.cuda.driver.check_status
cupy.cuda.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
Exception ignored in: 'cupy.cuda.function.Module.__dealloc__'
Traceback (most recent call last):
  File "cupy/cuda/driver.pyx", line 159, in cupy.cuda.driver.moduleUnload
  File "cupy/cuda/driver.pyx", line 75, in cupy.cuda.driver.check_status
cupy.cuda.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
Traceback (most recent call last):
  File "cupy/cuda/driver.pyx", line 159, in cupy.cuda.driver.moduleUnload
  File "cupy/cuda/driver.pyx", line 75, in cupy.cuda.driver.check_status
cupy.cuda.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
Exception ignored in: 'cupy.cuda.function.Module.__dealloc__'
Traceback (most recent call last):
  File "cupy/cuda/driver.pyx", line 159, in cupy.cuda.driver.moduleUnload
  File "cupy/cuda/driver.pyx", line 75, in cupy.cuda.driver.check_status
cupy.cuda.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
Traceback (most recent call last):
  File "cupy/cuda/driver.pyx", line 159, in cupy.cuda.driver.moduleUnload
  File "cupy/cuda/driver.pyx", line 75, in cupy.cuda.driver.check_status
cupy.cuda.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
Exception ignored in: 'cupy.cuda.function.Module.__dealloc__'
Traceback (most recent call last):
  File "cupy/cuda/driver.pyx", line 159, in cupy.cuda.driver.moduleUnload
  File "cupy/cuda/driver.pyx", line 75, in cupy.cuda.driver.check_status
cupy.cuda.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered

I tried different versions of pytorch: 0.2 0.3 0.4. The three versions was installed using pip, and also I tried to execute the code with a compiled from source version (0.2) the same error appears. I am using a machine with: 60gb of ram, Intel Xeon and a titan X with 12gb of ram. Moreover I tried to use different versions of open3d: 0.2.0 and 0.3.0. Finally I modified your sample command and I add edge_mem_limit in order to limit the memory used on the gpu without success.

Also I tested the code using the Sydney Urban Objects example, but in this case, this error is appearing at the begging of the execution:

File "main.py", line 315, in <module>
    main()
  File "main.py", line 217, in main
    acc_train, loss, t_loader, t_trainer = train(epoch)
  File "main.py", line 148, in train
    outputs = model(inputs)
  File "/project/env/ecc/lib/python3.6/site-packages/torch/nn/modules/module.py", line 224, in __call__
    result = self.forward(*input, **kwargs)
  File "/project/code/ecc/models.py", line 103, in forward
    input = module(input)
  File "/project/env/ecc/lib/python3.6/site-packages/torch/nn/modules/module.py", line 224, in __call__
    result = self.forward(*input, **kwargs)
  File "/project/code/ecc/ecc/GraphConvModule.py", line 171, in forward
    return GraphConvFunction(self._in_channels, self._out_channels, idxn, idxe, degs, degs_gpu, self._edge_mem_limit)(input, weights)
  File "/project/code/ecc/ecc/GraphConvModule.py", line 63, in forward
    self._multiply(sel_input, sel_weights, products, lambda a: a.unsqueeze(1))
  File "/project/code/ecc/ecc/GraphConvModule.py", line 36, in _multiply
    torch.bmm(f_a(a) if f_a else a, f_b(b) if f_b else b, out=out)
TypeError: torch.bmm received an invalid combination of arguments - got (torch.DoubleTensor, torch.FloatTensor, out=torch.DoubleTensor), but expected (torch.DoubleTensor source, torch.DoubleTensor mat2, *, torch.DoubleTensor out)

Please can you give me some hint in order to solve the issues?

Thanks,

mys007 commented 6 years ago

Hi, first of all I'm sorry about the delay, somehow I haven't received any notification from github.

Sydney Urban Objects: Indeed, I have introduced a bug when removing dependency on PCL recently. I have added a missing cast to pointcloud_dataset.py. Please pull the change and try it again, it should be working now. Thanks for reporting.
CUDA_ERROR_ILLEGAL_ADDRESS: That's very weird and thank you so much for spending time with that. It shouldn't be a memory problem (12 GB is enough and what I also have), I would also exclude open3d (that's just used to read point clouds), I guess it also shouldn't be a problem in pytorch itself... Could you perhaps try to run it like CUDA_LAUNCH_BLOCKING=1 python main.py ...? This should allow to pin point the problem better, as by default the stack trace doesn't coincide with the actual crash. Thanks ahead!

dhorka commented 6 years ago

Hi, For sure, during this week I will try to reproduce the error. I will execute several experiments with the CUDA_LAUNCH_BLOCKING=1 and I will come back to you.

EDIT:

This is the log with the CUDA_LAUNCH_BLOCKING=1

File "main.py", line 315, in <module>
    main()
  File "main.py", line 217, in main
    acc_train, loss, t_loader, t_trainer = train(epoch)
  File "main.py", line 148, in train
    outputs = model(inputs)
  File "/work/env/ecc/lib/python3.6/site-packages/torch/nn/modules/module.py", line 357, in __call__
    result = self.forward(*input, **kwargs)
  File "/work/code/ecc/models.py", line 103, in forward
    input = module(input)
  File "/work/env/ecc/lib/python3.6/site-packages/torch/nn/modules/module.py", line 357, in __call__
    result = self.forward(*input, **kwargs)
  File "/work/code/ecc/ecc/GraphConvModule.py", line 171, in forward
    return GraphConvFunction(self._in_channels, self._out_channels, idxn, idxe, degs, degs_gpu, self._edge_mem_limit)(input, weights)
  File "/work/code/ecc/ecc/GraphConvModule.py", line 67, in forward
    cuda_kernels.conv_aggregate_fw(output.narrow(0,startd,numd), products.view(-1,self._out_channels), self._degs_gpu.narrow(0,startd,numd))
  File "/work/code/ecc/ecc/cuda_kernels.py", line 122, in conv_aggregate_fw
    block=(CUDA_NUM_THREADS,1,1), grid=(GET_BLOCKS(w),n//blockDimY+1,1), stream=stream)
  File "cupy/cuda/function.pyx", line 147, in cupy.cuda.function.Function.__call__
  File "cupy/cuda/function.pyx", line 129, in cupy.cuda.function._launch
  File "cupy/cuda/driver.pyx", line 195, in cupy.cuda.driver.launchKernel
  File "cupy/cuda/driver.pyx", line 75, in cupy.cuda.driver.check_status
cupy.cuda.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered

Traceback (most recent call last):
  File "cupy/cuda/driver.pyx", line 159, in cupy.cuda.driver.moduleUnload
  File "cupy/cuda/driver.pyx", line 75, in cupy.cuda.driver.check_status
cupy.cuda.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
Exception ignored in: 'cupy.cuda.function.Module.__dealloc__'
Traceback (most recent call last):
  File "cupy/cuda/driver.pyx", line 159, in cupy.cuda.driver.moduleUnload
  File "cupy/cuda/driver.pyx", line 75, in cupy.cuda.driver.check_status
cupy.cuda.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
Traceback (most recent call last):
  File "cupy/cuda/driver.pyx", line 159, in cupy.cuda.driver.moduleUnload
  File "cupy/cuda/driver.pyx", line 75, in cupy.cuda.driver.check_status
cupy.cuda.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
Exception ignored in: 'cupy.cuda.function.Module.__dealloc__'
Traceback (most recent call last):
  File "cupy/cuda/driver.pyx", line 159, in cupy.cuda.driver.moduleUnload
  File "cupy/cuda/driver.pyx", line 75, in cupy.cuda.driver.check_status
cupy.cuda.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
Traceback (most recent call last):
  File "cupy/cuda/driver.pyx", line 159, in cupy.cuda.driver.moduleUnload
  File "cupy/cuda/driver.pyx", line 75, in cupy.cuda.driver.check_status
cupy.cuda.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
Exception ignored in: 'cupy.cuda.function.Module.__dealloc__'
Traceback (most recent call last):
  File "cupy/cuda/driver.pyx", line 159, in cupy.cuda.driver.moduleUnload
  File "cupy/cuda/driver.pyx", line 75, in cupy.cuda.driver.check_status
cupy.cuda.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
Traceback (most recent call last):
  File "cupy/cuda/driver.pyx", line 159, in cupy.cuda.driver.moduleUnload
  File "cupy/cuda/driver.pyx", line 75, in cupy.cuda.driver.check_status
cupy.cuda.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
Exception ignored in: 'cupy.cuda.function.Module.__dealloc__'
Traceback (most recent call last):
  File "cupy/cuda/driver.pyx", line 159, in cupy.cuda.driver.moduleUnload
  File "cupy/cuda/driver.pyx", line 75, in cupy.cuda.driver.check_status
cupy.cuda.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered

mys007 commented 6 years ago

Thanks a lot! It's weird it happens in the forward pass, there should be still enough memory available no matter what, especially that you've said you have full 12GB available. Just to check:

what is your version of pynvrtc and cupy?
what is your cuda version?
what command are you exactly running, the original one for ModelNet10?

dhorka commented 6 years ago

Hi,

Cupy version 4.3 pynvrtc 9.2
Cuda version: 8 cudnn: 6
I am running the command provided in the documentation for the modelnet10 this one:

python main.py \ --dataset modelnet10 --test_nth_epoch 25 --lr 0.1 --lr_steps '[50,100,150]' --epochs 175 --batch_size 64 --batch_parts 4 \ --model_config 'i_1_2, c_16,b,r, c_32,b,r, m_2.5_7.5, c_32,b,r, c_32,b,r, m_7.5_22.5, c_64,b,r, m_1e10_1e10, f_64,b,r,d_0.2,f_10' \ --fnet_llbias 0 --fnet_widths '[16,32]' --pc_augm_scale 1.2 --pc_augm_mirror_prob 0.2 --pc_augm_input_dropout 0.1 \ --nworkers 3 --edgecompaction 1 --odir results/modelnet10

mys007 commented 6 years ago

Thanks! I've upgraded to your latest version of cupy (btw, it seems there is now a cleaner way how to define custom kernels with https://docs-cupy.chainer.org/en/latest/reference/generated/cupy.RawKernel.html), so my setup should be the same as yours. But I'm sorry but I can't reproduce it, I haven't received the error during training. I have no idea, sorry:(. Perhaps using RawKernel and rewriting the pytorch functions in the modern way with ctx parameter could help, but it's just a guess...

dhorka commented 6 years ago

Thanks! The error is completely random, it is not appearing always. Are you using also cuda 8, cudnn 6? Can you tell me wich driver of nvidia are you using?

And my last question it is just for curiosity. Why do you choose use cupy instead of using the methodology of pytorch to do custom extension with cuda? Is there a technical reason?

I suspect that the error is related to the management of the gpu memory done by pytorch. I mean, as you know, pytorch is using caching memory allocator to speed up memory allocations. This allows fast memory deallocation without device synchronizations. However, the unused memory managed by the allocator will still show as if used in other applications. That maybe can be a conflict with the code executed by cupy. What do you think? But, I do not understand why it is only happening in my setup and not in yours...

mys007 commented 6 years ago

My driver is 384.130, cuda 8.0.61, cudnn 6021. But I'm unable to shuffle around with drivers and cuda, as I'm using a shared computer.

Thanks for the tip, the support for extensions in pytorch seems to have improved a lot, there is even a JIT solution in particular! The reason I went the cupy way about 1.5y ago was because the pytorch way was more rudimentary and explicit compilation only. I think my current code could surely be rewritten to use just pytorch/JIT. I might give it a quick shot in the next few days...

Regarding the interactions between pytorch and cupy, who knows... but I would assume they both use some standard cuda allocation calls in the end so it should not happen that memory gets assigned twice. But removing the dependency on cupy would sort it out anyway. Another explanation would be that a physical part of your GPU memory is somehow corrupted - but I assume you have otherwise no problems training other large networks, don't you.

dhorka commented 6 years ago

I tested with different gpu's in order to discard the hardware error and I had the same issue. Regarding of the adaptation to the pytorch extension, if I can help in somehow tell me. Btw, another interesting thing of this adaptation is that allows to use the multigpu training in order to train with a bigger batch_size :) Thanks for your time!

dhorka commented 6 years ago

Today I tried to adapt your code to use pytorch extension. here you can find the modified files used in my first try: https://github.com/dhorka/ecc_cuda_extension. I am not using JIT. You need to compile the kernel using the provided setup.py. At this moment the code is failing in run time with a segmentation fault. I was not able to figure out what is going on. But maybe the skeleton can help you. I will check it again later. Thanks.

mys007 commented 6 years ago

Thanks for your hard work on this, that was definitely a great starting point! I've fixed your code (the main problem was that not all types were supposed to be floats and that the grid-block parameters were not right), written it as JIT, added backward aggregation and ported the code to pytorch 0.4. It's in branch https://github.com/mys007/ecc/tree/pytorch4_cuda_extensions . Could you perhaps try to run it on your machine with pytorch 0.4.1 and see if works now? If CUDA_ERROR_ILLEGAL_ADDRESS appears only with other kernels, we might be on the right track. One weird thing now is that now the training takes all GPU memory (12 GB), instead of about 8 GB with pytorch 0.3 and cupy, but whatever:).

dhorka commented 6 years ago

Hi mys, thanks also for your work on this issue!! I launched right now 2 training processes in order to be sure that the issue has disappeared :) On Monday I will tell you the results. Regarding the gpu consumption, if you are looking the consumption on nvidia-smi, we can not be sure that it is the real consumption, because pytorch uses a caching memory allocator to speed up memory allocations but the unused memory managed by the allocator will still show as if used in nvidia-smi. In order to check wich is the real memory used, we need to use some of the methods provided by pytorch like max_memory_allocated(). If I will not have the Illegal Address issue on Monday I can check it the memory used.

mys007 commented 6 years ago

launched right now 2 training processes in order to be sure that the issue has disappeared

Thanks. But it may still crash in the other kernels (pooling), perhaps I should have ported all of them when I was at it... Can you please run the processes with CUDA_LAUNCH_BLOCKING=1 so that one can get the right stack trace?

nvidia-smi, we can not be sure that it is the real consumption,

Indeed, but I though this has been a feature of pytorch from the beginning, but maybe they have changed the laziness of deallocations...

dhorka commented 6 years ago

Thanks. But it may still crash in the other kernels (pooling), perhaps I should have ported all of them when I was at it... Can you please run the processes with CUDA_LAUNCH_BLOCKING=1 so that one can get the right stack trace?

Sure, I re-executed the experiment with CUDA_LAUNCH_BLOCKING=1, also in parallel I will try to port the other kernels, using as example your ported kernels.

Indeed, but I though this has been a feature of pytorch from the beginning, but maybe they have changed the laziness of deallocations...

I am not sure what is happening with the gpu memory, because far as I saw when I launch the experiments (with cupy) at the beggining of the training the gpu memory consumption it is more or less at 8GB, but in advanced epochs, I can see that sometimes the memory consumption is 4GB and other times 12GB ...

dhorka commented 6 years ago

All kernels ported, here you can find it https://github.com/dhorka/ecc_cuda_extension. I was not able to test if in runtime all the kernels works properly (at this moment I do not have any gpu avaibale) but atleast the compilation is working.

mys007 commented 6 years ago

Wow, what a great effort! Let's wait for the result of your jobs and if it's good, I can merge & clean up everything.

amosella commented 6 years ago

Hi Mys, I'm Dhorka, this is my main account. (I was not able to post with this account because my account was flagged several times, github automated security mechanisms are incorrectly triggered but now seems to be solved). I have some results. First at all, the kernels that I ported yesterday are working, atleast at this moment some experiments are running without errors. On another hand, the experiments that I executed yesterday with only the kernels of the convolution ported to pytorch04 was crashed with the following error:

check FAIL file=/pytorch/aten/src/THC/generic/THCTensorMathPairwise.cu line=21 error=77 : an illegal memory access was encountered
Traceback (most recent call last):
  File "main.py", line 314, in <module>
    main()
  File "main.py", line 216, in main
    acc_train, loss, t_loader, t_trainer = train(epoch)
  File "main.py", line 147, in train
    outputs = model(inputs)
  File "/work/env/ecc_torch0.4.1_py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/work/code/ecc/models.py", line 103, in forward
    input = module(input)
  File "/work/env/ecc_torch0.4.1_py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/work/env/ecc_torch0.4.1_py36/lib/python3.6/site-packages/torch/nn/modules/batchnorm.py", line 57, in forward
    self.num_batches_tracked += 1
RuntimeError: cuda runtime error (77) : an illegal memory access was encountered at /pytorch/aten/src/THC/generic/THCTensorMathPairwise.cu:21

Traceback (most recent call last):
  File "cupy/cuda/driver.pyx", line 159, in cupy.cuda.driver.moduleUnload
  File "cupy/cuda/driver.pyx", line 75, in cupy.cuda.driver.check_status
cupy.cuda.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
Exception ignored in: 'cupy.cuda.function.Module.__dealloc__'
Traceback (most recent call last):
  File "cupy/cuda/driver.pyx", line 159, in cupy.cuda.driver.moduleUnload
  File "cupy/cuda/driver.pyx", line 75, in cupy.cuda.driver.check_status
cupy.cuda.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
Traceback (most recent call last):
  File "cupy/cuda/driver.pyx", line 159, in cupy.cuda.driver.moduleUnload
  File "cupy/cuda/driver.pyx", line 75, in cupy.cuda.driver.check_status
cupy.cuda.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
Exception ignored in: 'cupy.cuda.function.Module.__dealloc__'
Traceback (most recent call last):
  File "cupy/cuda/driver.pyx", line 159, in cupy.cuda.driver.moduleUnload
  File "cupy/cuda/driver.pyx", line 75, in cupy.cuda.driver.check_status
cupy.cuda.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered

It seems like the error is cupy related, right? At this moment I have two more experiments with all the kernels ported running. I will let you know about the results obtained when the experiments finishes.

amosella commented 6 years ago

Hi,

It seems is not related to cupy... Below you can see the output error of one of the experiments with all the kernels ported to pytorch041:

Epoch 166/175 (results/modelnet10_all_cuda_kernels):
 48%|█████████████████████████████                                | 119/250 [02:00<02:33,  1.17s/it]THCudaCheck FAIL file=/pytorch/aten/src/THC/generic/THCTensorMathPairwise.cu line=21 error=77 : an illegal memory access was encountered
Traceback (most recent call last):
  File "main.py", line 314, in <module>
    main()
  File "main.py", line 216, in main
    acc_train, loss, t_loader, t_trainer = train(epoch)
  File "main.py", line 147, in train
    outputs = model(inputs)
  File "/work/env/ecc_torch0.4.1_py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/work/code/ecc/models.py", line 103, in forward
    input = module(input)
  File "/work/env/ecc_torch0.4.1_py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/work/env/ecc_torch0.4.1_py36/lib/python3.6/site-packages/torch/nn/modules/batchnorm.py", line 57, in forward
    self.num_batches_tracked += 1
RuntimeError: cuda runtime error (77) : an illegal memory access was encountered at /pytorch/aten/src/THC/generic/THCTensorMathPairwise.cu:21

mys007 commented 6 years ago

Damn, that's really frustrating. I guess it must be some bug in the kernels demonstrating itself only under some rare condition of the input data. Could you perhaps run the training with --nworkers 0, which will be slow but should be deterministic? I will run the same (for now with half cupy, half extension).

amosella commented 6 years ago

Sure! I launched one experiment with 0 workers. Tomorrow I will come back with the results.

amosella commented 6 years ago

Well I got some results... to be honest this starts to be weird.... I got the error in the epoch 9 as you can see in the following trace:

Epoch 7/175 (results/modelnet10_all_cuda_kernels_w0):
 66%|████████████████████████████████████████                     | 164/250 [05:09<03:21,  2.35s/it]THCudaCheck FAIL file=/pytorch/aten/src/THC/generic/THCTensorMathPairwise.cu line=21 error=77 : an illegal memory access was encountered
Traceback (most recent call last):
  File "main.py", line 314, in <module>
    main()
  File "main.py", line 216, in main
    acc_train, loss, t_loader, t_trainer = train(epoch)
  File "main.py", line 147, in train
    outputs = model(inputs)
  File "/work/env/ecc_torch0.4.1_py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/work/code/ecc/models.py", line 103, in forward
    input = module(input)
  File "/work/env/ecc_torch0.4.1_py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/work/env/ecc_torch0.4.1_py36/lib/python3.6/site-packages/torch/nn/modules/batchnorm.py", line 57, in forward
    self.num_batches_tracked += 1
RuntimeError: cuda runtime error (77) : an illegal memory access was encountered at /pytorch/aten/src/THC/generic/THCTensorMathPairwise.cu:21

This is the result of this command:

python main.py --dataset modelnet10 --test_nth_epoch 25 --lr 0.1 --lr_steps '[50,100,150]' --epochs 175 --batch_size 64 --batch_parts 4 --model_config 'i_1_2, c_16,b,r, c_32,b,r, m_2.5_7.5, c_32,b,r, c_32,b,r, m_7.5_22.5, c_64,b,r, m_1e10_1e10, f_64,b,r,d_0.2,f_10' --fnet_llbias 0 --fnet_widths '[16,32]' --pc_augm_scale 1.2 --pc_augm_mirror_prob 0.2 --pc_augm_input_dropout 0.1 --nworkers 0 --edgecompaction 1 --odir results/modelnet10_all_cuda_kernels_w0

After that I tried to resume the experiment in order to check if I can reproduce the error but... after resume the training continue without problems... This is the command that I use to resume the experiment:

python main.py --dataset modelnet10 --test_nth_epoch 25 --lr 0.1 --lr_steps '[50,100,150]' --epochs 175 --batch_size 64 --batch_parts 4 --resume results/modelnet10_all_cuda_kernels_w0/model.pth.tar - -fnet_llbias 0 --fnet_widths '[16,32]' --pc_augm_scale 1.2 --pc_augm_mirror_prob 0.2 --pc_augm_input_dropout 0.1 --nworkers 0 --edgecompaction 1 --odir results/modelnet10_all_cuda_kernels_w0_resume

Now I am thinking to run an experiment forcing the seed of the data loader and also setting CUDNN in deterministic mode.

EDIT: I didn't realize that the declaration of the dataloader in your code is inside the epoch loop. The reason for that is to do a shuffle in each epoch? Or something like this? Normally the codes I saw this declaration is outside the epoch loop. Because far as I know the shuffle for each epoch is done by dataloader without the needs to be initialized it on each epoch.

mys007 commented 6 years ago

Thanks for your report. Is the crash reproducible at your side, meaning that if you rerun the training from scratch (the first command line above), will it break during episode 7 again? In my case, I received no crash:(.

Resuming will not produce the same results as training straight without resuming because the states of random generators are not saved/restored (too complicated). But data loading should basically start again and crash in episode 14 then, weird that it didn't.

Now I am thinking to run an experiment forcing the seed of the data loader and also setting CUDNN in deterministic mode.

If 'nworkers=0, the worker is in the same thread as the main function, which is seeded inseed()` call. Any non-determinism in activations/weight updates should not matter because the weights don't affect the control flow. Besides, graph convolution is also not deterministic due to aggregation functions.

I didn't realize that the declaration of the dataloader in your code is inside the epoch loop. The reason for that is to do a shuffle in each epoch?

I think it's because DataLoaders are not infinite (they return StopIteration), so I have to restart them - is it not the case? But anyway, with nworkers=0 it shouldn't matter...

amosella commented 6 years ago

Hi Mys,

Thanks for your report. Is the crash reproducible at your side, meaning that if you rerun the training from scratch (the first command line above), will it break during episode 7 again? In my case, I received no crash:(.

I launched two experiments and always crash in the iteration 164 of the epoch 7. Far as I saw I think it is reproducible on my side.

Resuming will not produce the same results as training straight without resuming because the states of random generators are not saved/restored (too complicated). But data loading should basically start again and crash in episode 14 then, weird that it didn't.

Yep, it is weird... I do not understand what is different after the resume...

If 'nworkers=0, the worker is in the same thread as the main function, which is seeded in seed()` call. Any non-determinism in activations/weight updates should not matter because the weights don't affect the control flow. Besides, graph convolution is also not deterministic due to aggregation functions.

I understand. Thanks for the explanation!

I think it's because DataLoaders are not infinite (they return StopIteration), so I have to restart them - is it not the case? But anyway, with nworkers=0 it shouldn't matter...

Far as I know (also I tested), You do not need to restart them, because DataLoaders are able to manage the different epochs.

To sum up, I think I can reproduce the error. Maybe I can debug on my side, following your instructions.

mys007 commented 6 years ago

I launched two experiments and always crash in the iteration 164 of the epoch 7.

Great news! Could you please pickle (inputs, targets, GIs, PIs) in https://github.com/mys007/ecc/blob/8fbc9019ca8bc2a620617d7477ac5d17ba65e4bf/main.py#L136 and make it available for me to download? Then I can look at the particular case:). I think the easiest implementation is just to keep pickling and when it crashes, the violating batch will have been stored.

amosella commented 6 years ago

Hi Mys,

Done it! Here you can find the file.

The code used it to pickle(inputs,targets,GIs,PIS) is this one: torch.save({'inputs': inputs, 'targets': targets, 'GIs': GIs, 'PIs' : PIs}, os.path.join(args.odir, 'inputs_targets_GIs_PIs.pth.tar') In line 136 of the main file.

I runned the experiment with cudnn in descriminative mode. (I forgot to disable) But doesn't matter the error is the same without descriminative.

EDIT: Here you can find another file, generated with w=3. Also, I was trying to reproduce the error with sydney dataset but in sydney dataset I do not have this error...

mys007 commented 6 years ago

Hi, thanks a lot... but when I load the batch on my computer (from either of your files) so that each training iteration runs on it, I get no crash (w/pytorch 0.4.1). I'm sorry but I just think that resolving this issue is beyond my powers :(.

mys007 commented 6 years ago

Hi, I was wondering: if you're in a very experimental mood, could you try to run https://github.com/mys007/ecc/tree/crazy_fix with pytorch 0.3? There is just one extra line which touches dest. I remembered some other project where a similar hack helped to "resolve" a crashing kernel by probably reordering something...;)

amosella commented 6 years ago

Hi Mys,

Sorry for didn't answer your last comment. I have a cold and I was not able to check the e-mail. Yes, of course I will try it. Also I would like to try if with the files that I send to you I can reproduce the error on my setup, because I send it to you but I did not try to reproduce the error, I was thiking that maybe it something related with the state of the rng ... But anyway, this weekend I will test your fix and also, I will try to reproduce again the error. Thanks for your dedication!

amosella commented 6 years ago

Hi Mys, I tried your fix and it doesn't work :(. On another hand, you are right, with the files that I send to you if I resume using them, I haven't errors.. It is weird... Also I tried to save all the rng states in order to reproduce the error but... I haven't errors when I resume it.. I do not know but I think that after all the things that we done it we can assume that the error is due for something of my setup. maybe nvidia drivers or something like this.

mys007 commented 6 years ago

Damn, but thanks a lot. Well, actually, there has been one other user who has contacted me per email with the same issue in the meantime (though on Sydney; CUDA 9.1, TITAN X Pascal). It's so difficult to debug. Perhaps I could rewrite the whole aggregation with sparse matrix operations but I need to have a look what's the current support in pytorch.

amosella commented 6 years ago

If I can do anything else do not hesitate to ask :) By the way, on my side I ran several experiments using the sydney dataset without errors...

ShengyuH commented 5 years ago

hi all, I have the same issue, quite randomly. I'm now using SPG to benchmark on ScanNet, I think they just adopt your codes of master branch. I'm now trying to use your pytorch4_cuda_extension branch. I use CUDA 10.2, Driver Version: 430.26, 1080 Ti, cupy-cuda100 6.3.0, pytorch1.2.0. I will come to you later.

update: File "train.py", line 132, in train loss.backward() File "/scratch/shengyu/anaconda/envs/venv_spg/lib/python3.7/site-packages/torch/tensor.py", line 118, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/scratch/shengyu/anaconda/envs/venv_spg/lib/python3.7/site-packages/torch/autograd/init.py", line 93, in backward allow_unreachable=True) # allow_unreachable flag RuntimeError: CUDA out of memory. Tried to allocate 9.98 GiB (GPU 0; 10.91 GiB total capacity; 185.49 MiB already allocated; 9.98 GiB free; 6.51 MiB cached)

error occured again. Orz

mys007 commented 5 years ago

@HenrryBryant Thanks for your report and thanks for the effort of trying out the experimental branch. I'm sorry that the problem has not been solved. Although the new error message is "CUDA out of memory" rather than "CUDA_ERROR_ILLEGAL_ADDRESS"?

ShengyuH commented 5 years ago

@HenrryBryant Thanks for your report and thanks for the effort of trying out the experimental branch. I'm sorry that the problem has not been solved. Although the new error message is "CUDA out of memory" rather than "CUDA_ERROR_ILLEGAL_ADDRESS"?

hi Martin, thank you for your quick reply, actually these two errors just randomly take turns to occur, I will try with CUDA_LAUNCH_BLOCKING, otherwise i just have to turn to Pytorch_Geometric.

btw, I really like several works from you and Loic, they are really beautiful.

mys007 commented 5 years ago

Well, what I meant is that "CUDA out of memory" might not be a bug but rather indeed running out of memory. Is the GPU completely free before running the code? Otherwise, you can try decreasing the batch size just for the sake. And yeah, I wish I can really make a rewrite in Pytorch_Geometric one day!

ShengyuH commented 5 years ago

hi, If you have the same problem as dhorka mentioned and you are using DataLoader, please consider replacing your DataLoader object with for loop(though make it difficult to run on multiple GPUs), this works magically for me so far. Btw, you may also replace tensor with Tensor in lines 225 & 227 in GraphConvModule.py, I'm not sure if this also contributes to finally fixing this random bug, but my supervisor told me this could also cause wired memory leaks problem. I use CUDA10.2, Driver Version is 430.26, PyTorch 1.2.0 installed by pip.

mys007 commented 5 years ago

@HenrryBryant Thanks for the investigations.

That's an interesting note with DataLoader but I believe this is just a random workaround causing the timing of kernel runs being changed and more spread (since there is no parallel loading, meaning also the training must be very slow). I guess one could achieve the same effect with setting the number of workers to 0 or 1?

The tensor vs Tensor syntax was introduced in Pytorch 0.4, so it shouldn't matter for the original setting (with Pytorch 0.3). But frankly, I'm really amazed the code works with 1.2.0!

Nevertheless, I think the out of memory issue and the illegal address crash are two different things.

mys007 / ecc

Issues executing examples. CUDA_ERROR_ILLEGAL_ADDRESS and torch.bmm received an invalid combination of arguments #1