Error when testing on 100k dataset

mahshad92 commented 5 years ago

Hi,

I am trying to replicate your results. Although I have no issues loading the trained model and test it on the toy test samples (100), when I try to use the same model to get the accuracy on all test samples in the 100K dataset(10355) the test accuracy becomes NAN after some time and I get an error after 2000 samples. I do not understand this behavior. I changed the token length to get rid of warnings, but that is no help. Please let me know if you faced the same issue. log.txt

[01/27/19 17:43:52] 1.046239
[01/27/19 17:43:52] Number of samples 2000 - Accuracy = nan
[01/27/19 17:43:54] 1.082996
[01/27/19 17:43:58] 1.228099
[01/27/19 17:44:00] 1.140648
[01/27/19 17:44:03] 1.131666
[01/27/19 17:44:06] 1.043551
[01/27/19 17:44:09] 1.162436
[01/27/19 17:44:11] 1.087319
[01/27/19 17:44:14] 1.575318
THCudaCheck FAIL file=/tmp/luarocks_cutorch-scm-1-2331/cutorch/lib/THC/generated/../generic/THCTensorMathPointwise.cu line=163 error=59 : device-side assert triggered /home/mxm7832/torch/install/bin/luajit: /home/mxm7832/torch/install/share/lua/5.1/nn/THNN.lua:110: cuda runtime error (59) : device-side assert triggered at /tmp/luarocks_cutorch-scm-1-2331/cutorch/lib/THC/generated/../generic/THCTensorMathPointwise.cu:163 stack traceback: [C]: in function 'v' /home/mxm7832/torch/install/share/lua/5.1/nn/THNN.lua:110: in function 'Sigmoid_updateOutput' /home/mxm7832/torch/install/share/lua/5.1/nn/Sigmoid.lua:4: in function 'func' .../mxm7832/torch/install/share/lua/5.1/nngraph/gmodule.lua:345: in function 'neteval' .../mxm7832/torch/install/share/lua/5.1/nngraph/gmodule.lua:380: in function 'forward' src/model/model.lua:360: in function 'feval' src/model/model.lua:885: in function 'step' src/train.lua:111: in function 'train' src/train.lua:289: in function 'main' src/train.lua:295: in main chunk [C]: in function 'dofile' ...7832/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk

da03 commented 5 years ago

Hmm weird, can you figure out which minibatch/image triggered the error? E.g., by printing image names and trial and error.

mahshad92 commented 5 years ago

@da03 So I printed image names in each batch, looks like "75dc30bf82.png" is an empty image and cause the acc=NAN.

Also from the log file I attached, you can see the batch that causes the error (I checked the token size before it crashes). I also used CUDA_LAUNCH_BLOCKING=1 to get a better description of the error:

{ 1 : "2b80174519.png" 2 : "5712d3adfe.png" 3 : "1aad846709.png" 4 : "1380c58267.png" ---> token length 391 5 : "7d032bac62.png" ---> token length 249 } /tmp/luarocks_cutorch-scm-1-2331/cutorch/lib/THC/THCTensorIndex.cu:275: void indexSelectSmallIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [3,0,0], thread: [0,0,0] Assertion srcIndex < srcSelectDimSize failed.

I checked the two samples that might cause this, still confused.

nan_log.txt mainError_log.txt

da03 commented 5 years ago

Hmm acc is not a big issue here since we only used val ppl to select models and use another evaluation script afterwards to calculate image accuracy. For blank images i suspect there's a zero divided by zero issue which caused nan's.

For the more serious error, it looks like some problem with lookuptable? There could be two places where we used lookuptable: positional embedding and decoder word embedding. Can you check the code to pinpoint whether the issue occurred during encoder forward pass or during decoder forward pass?

da03 commented 5 years ago

BTW, how did you get the test dataset? It's weird since I've fully tested that with this code. Here is my processed dataset: http://lstm.seas.harvard.edu/latex/data/ (the processed section)

mahshad92 commented 5 years ago

@da03 I downloaded the data from the link provided in the repo: https://zenodo.org/record/56198#.XFhyiXXwYph

Thanks for providing the processed data, I get the same issue on some of the inputs from these as well (same pattern). I should be able to finish it after some cleaning.

pouyan-sh commented 3 years ago

Hi,

@mahshad92, did you find a way to overcome this issue? I faced the same problem. @da03, can you help me with this issue? its a bit weird, because you said that you had tested the code with the dataset while it is raising error for me!( At first some nan for acc, and after few more steps it will raise an error like what @mahshad92 mentioned before)

pouyan-sh commented 3 years ago

I resolved the problem by retrain the model on my own device.

harvardnlp / im2markup

Error when testing on 100k dataset #18