torch / cutorch

A CUDA backend for Torch7
Other
337 stars 208 forks source link

cublas runtime error : the GPU program failed to execute #818

Closed tastyminerals closed 6 years ago

tastyminerals commented 6 years ago

Just a few days ago I was able to train and run my model on GPU. After recent update I am getting the following error:

Epoch #1    
training... 
/home/pavel/torch/extra/cutorch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [71,0,0], thread: [32,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/home/pavel/torch/extra/cutorch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [71,0,0], thread: [33,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/home/pavel/torch/extra/cutorch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [71,0,0], thread: [34,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/home/pavel/torch/extra/cutorch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [71,0,0], thread: [35,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/home/pavel/torch/extra/cutorch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [71,0,0], thread: [36,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/home/pavel/torch/extra/cutorch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [71,0,0], thread: [37,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/home/pavel/torch/extra/cutorch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [71,0,0], thread: [38,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/home/pavel/torch/extra/cutorch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [71,0,0], thread: [39,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/home/pavel/torch/extra/cutorch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [71,0,0], thread: [40,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/home/pavel/torch/extra/cutorch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [71,0,0], thread: [41,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/home/pavel/torch/extra/cutorch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [71,0,0], thread: [42,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/home/pavel/torch/extra/cutorch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [71,0,0], thread: [43,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/home/pavel/torch/extra/cutorch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [71,0,0], thread: [44,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/home/pavel/torch/extra/cutorch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [71,0,0], thread: [45,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/home/pavel/torch/extra/cutorch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [71,0,0], thread: [46,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/home/pavel/torch/extra/cutorch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [71,0,0], thread: [47,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/home/pavel/torch/extra/cutorch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [71,0,0], thread: [48,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/home/pavel/torch/extra/cutorch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [71,0,0], thread: [49,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/home/pavel/torch/extra/cutorch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [71,0,0], thread: [50,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/home/pavel/torch/extra/cutorch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [71,0,0], thread: [51,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/home/pavel/torch/extra/cutorch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [71,0,0], thread: [52,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/home/pavel/torch/extra/cutorch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [71,0,0], thread: [53,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/home/pavel/torch/extra/cutorch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [71,0,0], thread: [54,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/home/pavel/torch/extra/cutorch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [71,0,0], thread: [55,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/home/pavel/torch/extra/cutorch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [71,0,0], thread: [56,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/home/pavel/torch/extra/cutorch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [71,0,0], thread: [57,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/home/pavel/torch/extra/cutorch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [71,0,0], thread: [58,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/home/pavel/torch/extra/cutorch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [71,0,0], thread: [59,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/home/pavel/torch/extra/cutorch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [71,0,0], thread: [60,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/home/pavel/torch/extra/cutorch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [71,0,0], thread: [61,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/home/pavel/torch/extra/cutorch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [71,0,0], thread: [62,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/home/pavel/torch/extra/cutorch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [71,0,0], thread: [63,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/home/pavel/torch/install/bin/luajit: /home/pavel/torch/install/share/lua/5.1/nn/Container.lua:67: 
In 1 module of nn.Sequential:
In 2 module of nn.ParallelTable:
In 2 module of nn.Sequential:
In 1 module of nn.MapTable:
/home/pavel/torch/install/share/lua/5.1/nn/Linear.lua:66: cublas runtime error : the GPU program failed to execute at /home/pavel/torch/extra/cutorch/lib/THC/THCBlas.cu:246
stack traceback:
    [C]: in function 'addmm'
    /home/pavel/torch/install/share/lua/5.1/nn/Linear.lua:66: in function </home/pavel/torch/install/share/lua/5.1/nn/Linear.lua:53>
    [C]: in function 'xpcall'
    /home/pavel/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
    /home/pavel/torch/install/share/lua/5.1/nn/MapTable.lua:47: in function </home/pavel/torch/install/share/lua/5.1/nn/MapTable.lua:43>
    [C]: in function 'xpcall'
    /home/pavel/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
    /home/pavel/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function </home/pavel/torch/install/share/lua/5.1/nn/Sequential.lua:41>
    [C]: in function 'xpcall'
    /home/pavel/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
    ...e/pavel/torch/install/share/lua/5.1/nn/ParallelTable.lua:12: in function <...e/pavel/torch/install/share/lua/5.1/nn/ParallelTable.lua:10>
    [C]: in function 'xpcall'
    /home/pavel/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
    /home/pavel/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
    main.lua:278: in function 'opfunc'
    /home/pavel/torch/install/share/lua/5.1/optim/adam.lua:37: in function 'adam'
    main.lua:299: in main chunk
    [C]: in function 'dofile'
    ...avel/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
    [C]: at 0x00405c90

WARNING: If you see a stack trace below, it doesn't point to the place where this error occurred. Please use only the one above.
stack traceback:
    [C]: in function 'error'
    /home/pavel/torch/install/share/lua/5.1/nn/Container.lua:67: in function 'rethrowErrors'
    /home/pavel/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
    main.lua:278: in function 'opfunc'
    /home/pavel/torch/install/share/lua/5.1/optim/adam.lua:37: in function 'adam'
    main.lua:299: in main chunk
    [C]: in function 'dofile'
    ...avel/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
    [C]: at 0x00405c90

I am running Linux machine with GTX1070, CUDA9.0 and linux414-nvidia 1:390.25-9 package. Identical model in Tensorflow runs just fine on GPU:

2018-03-15 15:02:56.178412: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2018-03-15 15:02:56.291383: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:895] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-03-15 15:02:56.291661: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 0 with properties: 
name: GeForce GTX 1070 major: 6 minor: 1 memoryClockRate(GHz): 1.695
pciBusID: 0000:01:00.0
totalMemory: 7.92GiB freeMemory: 7.60GiB
2018-03-15 15:02:56.291676: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 1070, pci bus id: 0000:01:00.0, compute capability: 6.1)

>>> Start test loss: 320.5631351470947

Epoch: 1
> lr update: 0.0497500005
> Train loss: 54.85749292001128
> Valid loss: 7.513679893687367
> Best valid loss so far: 320.5631351470947
> Stopping in (35) epochs if no new minima!
! New local minima found, saving the model...
tastyminerals commented 6 years ago

Figured it out. I have been debugging the model and inserted some Padding into DataLoader class which then I successfully forgot about :man_facepalming: