Open fragileness opened 6 years ago
I'm not sure why it prints the segmentation fault message, but the experiment is done successfully. Trained models are saved at experiment/
loading model and criterion...
Creating model from file: models/baseline.lua
Creating data loader...
loading data...
Initializing data loader for train set...
Initializing data loader for val set...
Train start
THCudaCheck FAIL file=/tmp/luarocks_cutorch-scm-1-9315/cutorch/lib/THC/generic/THCStorage.cu line=66 error=2 : out of memory
/home/onegin/torch/install/bin/luajit: /home/onegin/torch/install/share/lua/5.1/nn/Container.lua:67:
In 3 module of nn.Sequential:
In 1 module of nn.Sequential:
In 1 module of nn.ConcatTable:
In 22 module of nn.Sequential:
In 1 module of nn.Sequential:
In 1 module of nn.ConcatTable:
In 1 module of nn.Sequential:
...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:216: cuda runtime error (2) : out of memory at /tmp/luarocks_cutorch-scm-1-9315/cutorch/lib/THC/generic/THCStorage.cu:66
stack traceback:
[C]: in function 'resizeAs'
...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:216: in function 'updateGradInput'
/home/onegin/torch/install/share/lua/5.1/nn/Module.lua:31: in function </home/onegin/torch/install/share/lua/5.1/nn/Module.lua:29>
[C]: in function 'xpcall'
/home/onegin/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
/home/onegin/torch/install/share/lua/5.1/nn/Sequential.lua:88: in function </home/onegin/torch/install/share/lua/5.1/nn/Sequential.lua:78>
[C]: in function 'xpcall'
/home/onegin/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
/home/onegin/torch/install/share/lua/5.1/nn/ConcatTable.lua:66: in function </home/onegin/torch/install/share/lua/5.1/nn/ConcatTable.lua:30>
[C]: in function 'xpcall'
...
/home/onegin/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
/home/onegin/torch/install/share/lua/5.1/nn/Sequential.lua:88: in function </home/onegin/torch/install/share/lua/5.1/nn/Sequential.lua:78>
[C]: in function 'xpcall'
/home/onegin/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
/home/onegin/torch/install/share/lua/5.1/nn/Sequential.lua:84: in function 'backward'
./train.lua:89: in function 'train'
main.lua:33: in main chunk
[C]: in function 'dofile'
...egin/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: at 0x00405d50
Try nResBlock=32 instead of 36, if you're using TitanX. We used 32 residual blocks when writing a paper since sometimes 12GB of GPU memory is not enough for 36 resblocks.
I got message below when training baseline model:
... [Iter: 299.1k / lr: 5.00e-5] Time: 66.29 (Data: 61.42) Err: 3.234126
[Iter: 299.2k / lr: 5.00e-5] Time: 65.32 (Data: 60.11) Err: 3.496183
[Iter: 299.3k / lr: 5.00e-5] Time: 66.40 (Data: 61.23) Err: 3.399313
[Iter: 299.4k / lr: 5.00e-5] Time: 64.99 (Data: 60.01) Err: 3.379927
[Iter: 299.5k / lr: 5.00e-5] Time: 65.95 (Data: 60.72) Err: 3.503887
[Iter: 299.6k / lr: 5.00e-5] Time: 66.23 (Data: 61.05) Err: 3.338660
[Iter: 299.7k / lr: 5.00e-5] Time: 65.30 (Data: 59.97) Err: 3.448611
[Iter: 299.8k / lr: 5.00e-5] Time: 65.69 (Data: 60.95) Err: 3.330575
[Iter: 299.9k / lr: 5.00e-5] Time: 66.04 (Data: 61.20) Err: 3.350167
[Iter: 300.0k / lr: 5.00e-5] Time: 65.34 (Data: 59.59) Err: 3.413485
[Epoch 300 (iter/epoch: 1000)] Test time: 25.48 (scale 2) Average PSNR: 35.5833 (Highest ever: 35.5902 at epoch = 288)
Segmentation fault (core dumped)
I'm not sure it the training process is successfully completed or not. If it is, where is the trained model?