torch / cutorch

A CUDA backend for Torch7
Other
336 stars 208 forks source link

cutorch always reports error at wrong place #755

Closed zsrkmyn closed 7 years ago

zsrkmyn commented 7 years ago

I am not sure whether the issue should be owned by cutorch or cunn.

When I am using LookupTable with cuda, and the given index is out of range of the lookup table, torch doesn't report the error at once, instead, the error occurs when the following layer begin to forward.

Here is an example:

require 'cunn'

lut = nn.LookupTable(10, 20):cuda()
lin = nn.Linear(20, 1):cuda()

ii = torch.Tensor{1, 2, 3, 0}:cuda() -- contains 0, out of range
oo = lut:forward(ii)
print(oo:size())
oo = lin:forward(oo)

the output is:

$ luajit test.lua
  4
 20
[torch.LongStorage of size 2]

/tmp/makepkg/torch7-cutorch-git/src/torch7-cutorch-git/lib/THC/THCTensorIndex.cu:275: void indexSelectSmallIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]:
block: [0,0,0], thread: [0,0,0] Assertion  failed.
/tmp/makepkg/torch7-cutorch-git/src/torch7-cutorch-git/lib/THC/THCTensorIndex.cu:275: void indexSelectSmallIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]:
block: [0,0,0], thread: [1,0,0] Assertion  failed.
/tmp/makepkg/torch7-cutorch-git/src/torch7-cutorch-git/lib/THC/THCTensorIndex.cu:275: void indexSelectSmallIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]:
block: [0,0,0], thread: [2,0,0] Assertion  failed.
/tmp/makepkg/torch7-cutorch-git/src/torch7-cutorch-git/lib/THC/THCTensorIndex.cu:275: void indexSelectSmallIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]:
block: [0,0,0], thread: [3,0,0] Assertion  failed.
...(too many same things, omitted)
/tmp/makepkg/torch7-cutorch-git/src/torch7-cutorch-git/lib/THC/THCTensorIndex.cu:275: void indexSelectSmallIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]:
block: [0,0,0], thread: [19,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
luajit: /usr/share/lua/5.1/nn/Linear.lua:66: cublas runtime error : library not initialized at /tmp/makepkg/torch7-cutorch-git/src/torch7-cutorch-git/lib/THC/THCGeneral.c:394
stack traceback:
        [C]: in function 'addmm'
        /usr/share/lua/5.1/nn/Linear.lua:66: in function 'forward'
        test.lua:9: in main chunk
        [C]: at 0x00404750

As we can see in the output, we can even print the size of the result from the lut and only when the linear layer begins to forward it, the error occurs. if I change GPU to CPU, there will be no such problem. Although this is not a big issue, but it can be really confusing when debugging the code.

I am using CUDA 8.0, torch7 built from torch/torch7@7c26baf, cunn built from torch/cunn@b9ab0f7, cutorch built from torch/cutorch@181a869, nn built from torch/nn@22ffc4f.

I am pleasure to provide more details if you need :-)

albanD commented 7 years ago

Hi,

This is expected behavior because of how device side assert work on CUDA. If you want the error to be raised at the correct place, you can set the following environment variable CUDA_LAUNCH_BLOCKING=1 to make the cuda api synchronous and raise the error when it occurs but that will slow down your code.

zsrkmyn commented 7 years ago

@albanD thanks a lot! I think I will set CUDA_LAUNCH_BLOCKING=1 when I am debugging and unset it when running.