vlfeat / matconvnet

MatConvNet: CNNs for MATLAB
Other
1.4k stars 753 forks source link

CUDA_ERROR_ILLEGAL_ADDRESS #65

Open guosheng opened 9 years ago

guosheng commented 9 years ago

When I try to train a model with many layers (e.g., >=20), an error will come up after some iterations during training:

Warning: An unexpected error occurred during CUDA execution. The CUDA error was: CUDA_ERROR_ILLEGAL_ADDRESS

I only update some layers in the later part of the network and keep those starting layers fixed. I didn't keep the responses from the fixed layers to save GPU memory. I have checked that the GPU memory is well enough for running.

I tried on K40, Titan and 780, running in MATLAB 2014b. I also tried MATLAB 2014a, but the error changes to: CUDA_ERROR_LAUNCH_FAILED

I added wait(gpuDevice) to those palces that invove GPU computation, but it doesn't help. Any suggestions for solving this?

lenck commented 9 years ago

Hmm, this should not happen, especially after few successful iterations... Can you try to run it with verbose mode on (e.g. by adding verbose flag to vl_nnconv and vl_nnpool calls in vl_simplenn) and post the output in e.g. pastebin? It's just to check that the size of the internal buffers remained unchanged, so that the issue with memory allocation can be ruled out...

guosheng commented 9 years ago

Thanks Karel, I will check this later and let you know.

AruniRC commented 8 years ago

I have the same problem

An unexpected error occurred during CUDA execution. The CUDA error was: CUDA_ERROR_ILLEGAL_ADDRESS

At other times: nnfullyconnected_forward: nnfullyconnected_forward_impl<>: : gemm<>: [cublas error] [cublas:cublasSgemm (CUBLAS_STATUS_EXECUTION_FAILED)]

This occurs when I use multi-GPU code (on two Titan X GPUs, CUDA version 7.0) to train VGG-16 on ImageNet. After two iterations there is this error.

Here's the pastebin of error: http://pastebin.com/gVckG44r

jrruijli commented 8 years ago

Hi,

Same problem here on a GTX Titan, only for the large VGG16 network, not for small ones. I also found a possible solution.

The problem was there when compiling without cudnn support. Problem was resolved when enabling cudnn support.

Jasper

vedaldi commented 8 years ago

I have been training VGG VD (and now inception) on two Titan X and cuDNN without any problems — however, I first had to switch to cuDNN 4.0.4 (the latest version available now I believe) as I was running an older preview that had some issues. Perhaps upgrading could cure it.

On 14 Dec 2015, at 17:08, Jasper Uijlings notifications@github.com wrote:

Hi,

Same problem here on a GTX Titan, but I found a possible solution.

The problem was there when compiling without cudnn support. Problem was resolved when enabling cudnn support.

Jasper

— Reply to this email directly or view it on GitHub https://github.com/vlfeat/matconvnet/issues/65#issuecomment-164496925.

kevjshih commented 8 years ago

Hi, I started running into this problem since upgrading from nvidia driver 361.x to 364.x (probably not a lot of people using this except arch linux users). The only workaround was to reduce the batch size down to about 50, which is approximately 4.7GB for this particular model with a custom layer. I recently switched from cuDNN 4 to 5 and ended up having to further reduce to a batch size of about 40. I also experienced the same issue without cuDNN.

I've tested this on Matlab R2015a,b and R2016a with matconvnet beta 12,18, and 19 and experienced the same issue in all cases. I'm currently running cuda 7.5.18 on a gtx titan X. The output of gpuDevice(1) is:

    CUDADevice with properties:

                  Name: 'GeForce GTX TITAN X'
                 Index: 1
     ComputeCapability: '5.2'
        SupportsDouble: 1
         DriverVersion: 8
        ToolkitVersion: 7.5000
    MaxThreadsPerBlock: 1024
      MaxShmemPerBlock: 49152
    MaxThreadBlockSize: [1024 1024 64]
           MaxGridSize: [2.1475e+09 65535 65535]
             SIMDWidth: 32
           TotalMemory: 1.2881e+10
       AvailableMemory: 1.2685e+10
   MultiprocessorCount: 24
          ClockRateKHz: 1076000
           ComputeMode: 'Default'
  GPUOverlapsTransfers: 1
KernelExecutionTimeout: 0
      CanMapHostMemory: 1
       DeviceSupported: 1
        DeviceSelected: 1

I'm guessing this problem may be resolved by Cuda8 based on the DriverVersion field, but for now, I haven't found a solution using the current setup. It's probably more of a driver issue than a bug in MatConvNet, but I'm curious if anyone might have more insight as to what's happening here.

jvlmdr commented 8 years ago

@kevjshih Thanks! I ran into the same problem and rolling back from 364 to 361 resolved it! (I'm using the driver from https://launchpad.net/~graphics-drivers/+archive/ubuntu/ppa and CUDA installed from the .run file cuda_7.5.18_linux.run on Ubuntu 14.04)

ZiangYan commented 8 years ago

@guosheng @jrruijli I have the same problem on ubuntu 14.04, cuda 7.5, Tesla K80, cudnn 5.

This problem appears if I compile matconvnet without cudnn support, and it goes if I compile matconvnet with cudnn 5.

I'm not sure the reason behind this, but you may temporarily fix this problem by re-compiling matconvnet with cudnn support.

DIPRECXY commented 7 years ago

I solved the problem by upgrading cuDNN to v5.1 version. You can have a try. cuDNN download : https://developer.nvidia.com/rdp/cudnn-download

RomainBUAA commented 7 years ago

@AruniRC how do you solve the error CUBLAS_STATUS_EXECUTION_FAILED ,i compile with cudnn but no solved

ngonthier commented 6 years ago

I got the same message error with on a machine with a Tesla P100 with Cuda 8.0 and cudnn 5.1.5.

It seems to be correlated with the size of the input image of my network. If I use a small input image I don't get any problem but with a bigger image I got this error message.: " Error using vl_nnconv An error occurred during PTX compilation of . The information log was:

The error log was:

The CUDA error code was: CUDA_ERROR_ILLEGAL_ADDRESS. "

It could be memory problem.

Do you have any solution ?