simonwsw / deep-soli

Gesture Recognition Using Neural Networks with Google's Project Soli Sensor
MIT License
140 stars 52 forks source link

Error after second batch is evaluated #6

Closed graulef closed 7 years ago

graulef commented 7 years ago

When going through evaluation, the code crashes after the second batch. I get the following error:

imi@imi-All-Series:~/graulef/deep-soli$ th net/main.lua --file ../datapre --list config/file_half.json --load ../uni_image_np_50.t7 --inputsize 32 --inputch 4 --label 11 --datasize 32 --datach 4 --batch 16 --maxseq 40 --cuda --cudnn
[eval] data with 1364 seq
[net] loading model ../uni_image_np_50.t7
nn.Sequencer @ nn.Recursor @ nn.MaskZero @ nn.Sequential {
  [input -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> (10) -> (11) -> (12) -> (13) -> (14) -> (15) -> (16) -> (17) -> (18) -> (19) -> (20) -> (21) -> output]
  (1): cudnn.SpatialConvolution(4 -> 32, 3x3, 2,2)
  (2): nn.SpatialBatchNormalization (4D) (32)
  (3): cudnn.ReLU
  (4): cudnn.SpatialConvolution(32 -> 64, 3x3, 2,2)
  (5): nn.SpatialBatchNormalization (4D) (64)
  (6): cudnn.ReLU
  (7): nn.SpatialDropout(0.400000)
  (8): cudnn.SpatialConvolution(64 -> 128, 3x3, 2,2)
  (9): nn.SpatialBatchNormalization (4D) (128)
  (10): cudnn.ReLU
  (11): nn.SpatialDropout(0.400000)
  (12): nn.Reshape(1152)
  (13): nn.Linear(1152 -> 512)
  (14): nn.BatchNormalization (2D) (512)
  (15): cudnn.ReLU
  (16): nn.Dropout(0.5, busy)
  (17): nn.Linear(512 -> 512)
  (18): nn.LSTM(512 -> 512)
  (19): nn.Dropout(0.5, busy)
  (20): nn.Linear(512 -> 13)
  (21): cudnn.LogSoftMax
}
/home/imi/graulef/datapre/0_12_20/label.json
/home/imi/graulef/datapre/0_10_18/label.json
/home/imi/graulef/datapre/0_6_8/label.json
/home/imi/graulef/datapre/0_3_0/label.json
/home/imi/graulef/datapre/0_12_22/label.json
/home/imi/graulef/datapre/0_13_10/label.json
/home/imi/graulef/datapre/0_12_3/label.json
/home/imi/graulef/datapre/0_5_14/label.json
/home/imi/graulef/datapre/0_10_15/label.json
/home/imi/graulef/datapre/0_2_5/label.json
/home/imi/graulef/datapre/0_3_5/label.json
/home/imi/graulef/datapre/0_3_18/label.json
/home/imi/graulef/datapre/0_12_21/label.json
/home/imi/graulef/datapre/0_5_21/label.json
/home/imi/graulef/datapre/0_10_20/label.json
/home/imi/graulef/datapre/0_13_6/label.json
Evaluation passed
/home/imi/graulef/datapre/0_8_22/label.json
/home/imi/graulef/datapre/0_8_11/label.json
/home/imi/graulef/datapre/0_9_0/label.json
/home/imi/graulef/datapre/0_3_7/label.json
/home/imi/graulef/datapre/0_8_8/label.json
/home/imi/graulef/datapre/0_5_10/label.json
/home/imi/graulef/datapre/0_10_6/label.json
/home/imi/graulef/datapre/0_9_20/label.json
/home/imi/graulef/datapre/0_6_20/label.json
/home/imi/graulef/datapre/0_13_15/label.json
/home/imi/graulef/datapre/0_6_2/label.json
/home/imi/graulef/datapre/0_9_13/label.json
/home/imi/graulef/datapre/0_13_17/label.json
/home/imi/graulef/datapre/0_12_23/label.json
/home/imi/graulef/datapre/0_2_4/label.json
/home/imi/graulef/datapre/0_12_10/label.json
Evaluation passed
/home/imi/torch/install/bin/luajit: /home/imi/torch/install/share/lua/5.1/nn/Container.lua:67:
In 1 module of nn.Sequential:
/home/imi/torch/install/share/lua/5.1/cudnn/init.lua:91: attempt to index a nil value
stack traceback:
        /home/imi/torch/install/share/lua/5.1/cudnn/init.lua:91: in function 'scalar'
        ...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:195: in function <...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:186>
        [C]: in function 'xpcall'
        /home/imi/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
        /home/imi/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'updateOutput'
        /home/imi/torch/install/share/lua/5.1/rnn/MaskZero.lua:97: in function 'updateOutput'
        /home/imi/torch/install/share/lua/5.1/rnn/Recursor.lua:27: in function 'updateOutput'
        /home/imi/torch/install/share/lua/5.1/rnn/Sequencer.lua:94: in function 'forward'
        ./net/rnntrain.lua:31: in function 'batchEval'
        ./net/train.lua:25: in function 'epochEval'
        ./net/train.lua:47: in function 'train'
        net/main.lua:45: in main chunk
        [C]: in function 'dofile'
        .../imi/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
        [C]: at 0x004065d0

WARNING: If you see a stack trace below, it doesn't point to the place where this error occurred. Please use only the one above.
stack traceback:
        [C]: in function 'error'
        /home/imi/torch/install/share/lua/5.1/nn/Container.lua:67: in function 'rethrowErrors'
        /home/imi/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'updateOutput'
        /home/imi/torch/install/share/lua/5.1/rnn/MaskZero.lua:97: in function 'updateOutput'
        /home/imi/torch/install/share/lua/5.1/rnn/Recursor.lua:27: in function 'updateOutput'
        /home/imi/torch/install/share/lua/5.1/rnn/Sequencer.lua:94: in function 'forward'
        ./net/rnntrain.lua:31: in function 'batchEval'
        ./net/train.lua:25: in function 'epochEval'
        ./net/train.lua:47: in function 'train'
        net/main.lua:45: in main chunk
        [C]: in function 'dofile'
        .../imi/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
        [C]: at 0x004065d0

This persists if the batch size is altered to let's say 4:

imi@imi-All-Series:~/graulef/deep-soli$ th net/main.lua --file ../datapre --list config/file_half.json --load ../uni_image_np_50.t7 --inputsize 32 --inputch 4 --label 11 --datasize 32 --datach 4 --batch 4 --maxseq 40 --cuda --cudnn
[eval] data with 1364 seq
[net] loading model ../uni_image_np_50.t7
nn.Sequencer @ nn.Recursor @ nn.MaskZero @ nn.Sequential {
  [input -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> (10) -> (11) -> (12) -> (13) -> (14) -> (15) -> (16) -> (17) -> (18) -> (19) -> (20) -> (21) -> output]
  (1): cudnn.SpatialConvolution(4 -> 32, 3x3, 2,2)
  (2): nn.SpatialBatchNormalization (4D) (32)
  (3): cudnn.ReLU
  (4): cudnn.SpatialConvolution(32 -> 64, 3x3, 2,2)
  (5): nn.SpatialBatchNormalization (4D) (64)
  (6): cudnn.ReLU
  (7): nn.SpatialDropout(0.400000)
  (8): cudnn.SpatialConvolution(64 -> 128, 3x3, 2,2)
  (9): nn.SpatialBatchNormalization (4D) (128)
  (10): cudnn.ReLU
  (11): nn.SpatialDropout(0.400000)
  (12): nn.Reshape(1152)
  (13): nn.Linear(1152 -> 512)
  (14): nn.BatchNormalization (2D) (512)
  (15): cudnn.ReLU
  (16): nn.Dropout(0.5, busy)
  (17): nn.Linear(512 -> 512)
  (18): nn.LSTM(512 -> 512)
  (19): nn.Dropout(0.5, busy)
  (20): nn.Linear(512 -> 13)
  (21): cudnn.LogSoftMax
}
/home/imi/graulef/datapre/0_12_20/label.json
/home/imi/graulef/datapre/0_10_18/label.json
/home/imi/graulef/datapre/0_6_8/label.json
/home/imi/graulef/datapre/0_3_0/label.json
Evaluation passed
/home/imi/graulef/datapre/0_12_22/label.json
/home/imi/graulef/datapre/0_13_10/label.json
/home/imi/graulef/datapre/0_12_3/label.json
/home/imi/graulef/datapre/0_5_14/label.json
Evaluation passed
/home/imi/torch/install/bin/luajit: /home/imi/torch/install/share/lua/5.1/nn/Container.lua:67:
In 1 module of nn.Sequential:
/home/imi/torch/install/share/lua/5.1/cudnn/init.lua:91: attempt to index a nil value
stack traceback:
        /home/imi/torch/install/share/lua/5.1/cudnn/init.lua:91: in function 'scalar'
        ...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:195: in function <...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:186>
        [C]: in function 'xpcall'
        /home/imi/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
        /home/imi/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'updateOutput'
        /home/imi/torch/install/share/lua/5.1/rnn/MaskZero.lua:97: in function 'updateOutput'
        /home/imi/torch/install/share/lua/5.1/rnn/Recursor.lua:27: in function 'updateOutput'
        /home/imi/torch/install/share/lua/5.1/rnn/Sequencer.lua:94: in function 'forward'
        ./net/rnntrain.lua:31: in function 'batchEval'
        ./net/train.lua:25: in function 'epochEval'
        ./net/train.lua:47: in function 'train'
        net/main.lua:45: in main chunk
        [C]: in function 'dofile'
        .../imi/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
        [C]: at 0x004065d0

WARNING: If you see a stack trace below, it doesn't point to the place where this error occurred. Please use only the one above.
stack traceback:
        [C]: in function 'error'
        /home/imi/torch/install/share/lua/5.1/nn/Container.lua:67: in function 'rethrowErrors'
        /home/imi/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'updateOutput'
        /home/imi/torch/install/share/lua/5.1/rnn/MaskZero.lua:97: in function 'updateOutput'
        /home/imi/torch/install/share/lua/5.1/rnn/Recursor.lua:27: in function 'updateOutput'
        /home/imi/torch/install/share/lua/5.1/rnn/Sequencer.lua:94: in function 'forward'
        ./net/rnntrain.lua:31: in function 'batchEval'
        ./net/train.lua:25: in function 'epochEval'
        ./net/train.lua:47: in function 'train'
        net/main.lua:45: in main chunk
        [C]: in function 'dofile'
        .../imi/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
        [C]: at 0x004065d0

The segment of code that causes the issue is again in MaskZero.lua, which is weird. The line that causes the error is line 70 in rnn/MaskZero.lua (https://github.com/Element-Research/rnn/blob/master/MaskZero.lua). My line number differs due to comments. As the error is caused by accessing a nil value and only occurs after the second iteration, I think it's some sort of memory issue.

Has anyone had a similar issue or can reproduce it? If not, what versions of the packages were you running?

graulef commented 7 years ago

I could fix this issue by setting the boolean useCuda to true manually. Somehow, the argument was not properly passed from main to the RnnTrain constructor. Still finding out why...