limbee / NTIRE2017

Torch implementation of "Enhanced Deep Residual Networks for Single Image Super-Resolution"
652 stars 146 forks source link

Segmentation fault when training baseline model #27

Open fragileness opened 6 years ago

fragileness commented 6 years ago

I got message below when training baseline model:

... [Iter: 299.1k / lr: 5.00e-5] Time: 66.29 (Data: 61.42) Err: 3.234126
[Iter: 299.2k / lr: 5.00e-5] Time: 65.32 (Data: 60.11) Err: 3.496183
[Iter: 299.3k / lr: 5.00e-5] Time: 66.40 (Data: 61.23) Err: 3.399313
[Iter: 299.4k / lr: 5.00e-5] Time: 64.99 (Data: 60.01) Err: 3.379927
[Iter: 299.5k / lr: 5.00e-5] Time: 65.95 (Data: 60.72) Err: 3.503887
[Iter: 299.6k / lr: 5.00e-5] Time: 66.23 (Data: 61.05) Err: 3.338660
[Iter: 299.7k / lr: 5.00e-5] Time: 65.30 (Data: 59.97) Err: 3.448611
[Iter: 299.8k / lr: 5.00e-5] Time: 65.69 (Data: 60.95) Err: 3.330575
[Iter: 299.9k / lr: 5.00e-5] Time: 66.04 (Data: 61.20) Err: 3.350167
[Iter: 300.0k / lr: 5.00e-5] Time: 65.34 (Data: 59.59) Err: 3.413485
[Epoch 300 (iter/epoch: 1000)] Test time: 25.48 (scale 2) Average PSNR: 35.5833 (Highest ever: 35.5902 at epoch = 288)

Segmentation fault (core dumped)

I'm not sure it the training process is successfully completed or not. If it is, where is the trained model?

limbee commented 6 years ago

I'm not sure why it prints the segmentation fault message, but the experiment is done successfully. Trained models are saved at experiment/

fragileness commented 6 years ago

I'm trying next step of training (item 1 in training.sh) th main.lua -scale 2 -nFeat 256 -nResBlock 36 -patchSize 96 -scaleRes 0.1 -skipBatch 3 but seeing out of memory as below. I've tried othee chopSize such as: th main.lua -scale 2 -nFeat 256 -nResBlock 36 -patchSize 96 -scaleRes 0.1 -skipBatch 3 -chopSize 16e0 but the situation remains the same. How small can chopSize be set? Or is there any other options I can try?

loading model and criterion...
Creating model from file: models/baseline.lua
Creating data loader... loading data... Initializing data loader for train set...
Initializing data loader for val set... Train start THCudaCheck FAIL file=/tmp/luarocks_cutorch-scm-1-9315/cutorch/lib/THC/generic/THCStorage.cu line=66 error=2 : out of memory /home/onegin/torch/install/bin/luajit: /home/onegin/torch/install/share/lua/5.1/nn/Container.lua:67: In 3 module of nn.Sequential: In 1 module of nn.Sequential: In 1 module of nn.ConcatTable: In 22 module of nn.Sequential: In 1 module of nn.Sequential: In 1 module of nn.ConcatTable: In 1 module of nn.Sequential: ...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:216: cuda runtime error (2) : out of memory at /tmp/luarocks_cutorch-scm-1-9315/cutorch/lib/THC/generic/THCStorage.cu:66 stack traceback: [C]: in function 'resizeAs' ...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:216: in function 'updateGradInput' /home/onegin/torch/install/share/lua/5.1/nn/Module.lua:31: in function </home/onegin/torch/install/share/lua/5.1/nn/Module.lua:29> [C]: in function 'xpcall' /home/onegin/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors' /home/onegin/torch/install/share/lua/5.1/nn/Sequential.lua:88: in function </home/onegin/torch/install/share/lua/5.1/nn/Sequential.lua:78> [C]: in function 'xpcall' /home/onegin/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors' /home/onegin/torch/install/share/lua/5.1/nn/ConcatTable.lua:66: in function </home/onegin/torch/install/share/lua/5.1/nn/ConcatTable.lua:30> [C]: in function 'xpcall' ... /home/onegin/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors' /home/onegin/torch/install/share/lua/5.1/nn/Sequential.lua:88: in function </home/onegin/torch/install/share/lua/5.1/nn/Sequential.lua:78> [C]: in function 'xpcall' /home/onegin/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors' /home/onegin/torch/install/share/lua/5.1/nn/Sequential.lua:84: in function 'backward' ./train.lua:89: in function 'train' main.lua:33: in main chunk [C]: in function 'dofile' ...egin/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk [C]: at 0x00405d50

WARNING: If you see a stack trace below, it doesn't point to the place where this error occurred. Please use only the one above. stack traceback: [C]: in function 'error' /home/onegin/torch/install/share/lua/5.1/nn/Container.lua:67: in function 'rethrowErrors' /home/onegin/torch/install/share/lua/5.1/nn/Sequential.lua:84: in function 'backward' ./train.lua:89: in function 'train' main.lua:33: in main chunk [C]: in function 'dofile' ...egin/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk [C]: at 0x00405d50

limbee commented 6 years ago

Try nResBlock=32 instead of 36, if you're using TitanX. We used 32 residual blocks when writing a paper since sometimes 12GB of GPU memory is not enough for 36 resblocks.