torch / cunn

Other
215 stars 174 forks source link

void cunn_ClassNLLCriterion_updateOutput_kernel Assertion failed #448

Closed zsrkmyn closed 7 years ago

zsrkmyn commented 7 years ago

Hi,

I built the cunn with the latest codes from GitHub and run the codes here with the command luajit train.lua -gpuid 0, then the following error occured:

$ luajit train.lua -gpuid 0                                                                                   [602/656]
table: 0x435d5dd8
DataLoader loading h5 image file:       data/vqa_data_img_vgg_train.h5
DataLoader loading h5 image file:       data/vqa_data_img_vgg_test.h5
DataLoader loading h5 question file:    data/vqa_data_prepro.h5
DataLoader loading json file:   data/vqa_data_prepro.json
assigned 215375 images to split 0
assigned 121512 images to split 2
Building the model...
total number of parameters in word_level:       8031747
total number of parameters in phrase_level:     2889219
total number of parameters in ques_level:       5517315
constructing clones inside the ques_level
total number of parameters in recursive_attention:      2862056
iter 0: 6.958066, 0.011597, 0.000400, 3.811794
/build/torch7-cunn-git/src/torch7-cunn-git/lib/THCUNN/ClassNLLCriterion.cu:52: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype
*, long *, Dtype *, int, int, int, int) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [2,0,0] Assertion `t >= 0 && t < n_classes` fail
ed.
/build/torch7-cunn-git/src/torch7-cunn-git/lib/THCUNN/ClassNLLCriterion.cu:52: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype
*, long *, Dtype *, int, int, int, int) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [13,0,0] Assertion `t >= 0 && t < n_classes` fai
led.
THCudaCheck FAIL file=/build/torch7-cutorch-git/src/torch7-cutorch-git/lib/THC/generic/THCStorage.c line=32 error=59 : device-side assert triggered
luajit: cuda runtime error (59) : device-side assert triggered at /build/torch7-cutorch-git/src/torch7-cutorch-git/lib/THC/generic/THCStorage.c:32
stack traceback:
        [C]: at 0x7f645a0f1ac0
        [C]: in function '__index'
        /usr/share/lua/5.1/nn/ClassNLLCriterion.lua:52: in function 'updateOutput'
        /usr/share/lua/5.1/nn/CrossEntropyCriterion.lua:20: in function 'forward'
        train.lua:204: in function 'eval_split'
        train.lua:330: in main chunk
        [C]: at 0x00404750

When I (randomly) checked out the cunn codes to commit 27479c372040b8cab4e53e9338e8ce840bdb67dd and rebuilt the package, the error disappeared.

I am really sorry I cannot isolate the problem since I am just a newbie to both lua and torch, and I am now just runnig codes from others to get some results.

I am pleasure to provide more details if needed.

Thanks for building a such great software!

soumith commented 7 years ago

the targets given to your loss function contain bad values. targets contain values that are not between 1 and nClasses. We added assertions for this case recently. Your older code does not error out but it will be buggy.