Open wangyujiajia opened 8 years ago
I changed CudaTensor to FloatTensor and it seems start training. But after 10 iters, the loss information become nan. Is this normal?
iter 0: mid_box_reg_loss: 0.001, captioning_loss: 55.520, end_objectness_loss: 0.089, mid_objectness_loss: 0.149, end_box_reg_loss: 0.003, [total: 111.522] iter 1: mid_box_reg_loss: 0.002, captioning_loss: 55.364, end_objectness_loss: 14.402, mid_objectness_loss: 0.137, end_box_reg_loss: 9.152, [total: 148.963] iter 2: mid_box_reg_loss: 0.004, captioning_loss: 40.627, end_objectness_loss: 0.336, mid_objectness_loss: 0.149, end_box_reg_loss: 0.094, [total: 82.328] iter 3: mid_box_reg_loss: 0.002, captioning_loss: 39.576, end_objectness_loss: 0.073, mid_objectness_loss: 0.136, end_box_reg_loss: 0.121, [total: 79.694] iter 4: mid_box_reg_loss: 0.001, captioning_loss: 35.889, end_objectness_loss: 0.325, mid_objectness_loss: 0.183, end_box_reg_loss: 0.924, [total: 73.721] iter 5: mid_box_reg_loss: 0.002, captioning_loss: 29.476, end_objectness_loss: 0.439, mid_objectness_loss: 0.130, end_box_reg_loss: 0.248, [total: 60.344] iter 6: mid_box_reg_loss: 0.003, captioning_loss: 24.685, end_objectness_loss: 0.257, mid_objectness_loss: 0.143, end_box_reg_loss: 0.320, [total: 50.496] iter 7: mid_box_reg_loss: 0.002, captioning_loss: 32.415, end_objectness_loss: 0.196, mid_objectness_loss: 0.147, end_box_reg_loss: 0.157, [total: 65.678] iter 8: mid_box_reg_loss: 0.002, captioning_loss: 45.235, end_objectness_loss: 0.152, mid_objectness_loss: 0.141, end_box_reg_loss: 0.095, [total: 91.155] iter 9: mid_box_reg_loss: 0.001, captioning_loss: 35.458, end_objectness_loss: 0.268, mid_objectness_loss: 0.138, end_box_reg_loss: 0.944, [total: 72.674] iter 10: mid_box_reg_loss: nan, captioning_loss: nan, end_objectness_loss: nan, mid_objectness_loss: nan, end_box_reg_loss: nan, [total: nan] iter 11: mid_box_reg_loss: nan, captioning_loss: nan, end_objectness_loss: nan, mid_objectness_loss: nan, end_box_reg_loss: nan, [total: nan] iter 12: mid_box_reg_loss: nan, captioning_loss: nan, end_objectness_loss: nan, mid_objectness_loss: nan, end_box_reg_loss: nan, [total: nan] WARNING: Masking out 1 boxes in LocalizationLayer iter 13: mid_box_reg_loss: 0.000, captioning_loss: nan, end_objectness_loss: nan, mid_objectness_loss: nan, end_box_reg_loss: nan, [total: nan]
Thanks.
Your initial learning rate seems too high - that might be causing it to blow up. Training already takes days on a GPU so I didn't expect anyone to want to train with CPU, but replacing CudaTensor with FloatTensor should make it possible.
What learning rate do you suggest? Thanks.
2016-06-15 14:54 GMT-07:00 Justin notifications@github.com:
Your initial learning rate seems too high - that might be causing it to blow up. Training already takes days on a GPU so I didn't expect anyone to want to train with CPU, but replacing CudaTensor with FloatTensor should make it possible.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jcjohnson/densecap/issues/15#issuecomment-226331688, or mute the thread https://github.com/notifications/unsubscribe/ATCXM01WxhsyTbrOwYAlrKmvc9uCcdKOks5qMHSCgaJpZM4I2z2x .
王煜
How slow is CPU compared to GPU? Thanks.
I used 1e-6 for initial learning rate. It's hard to say exactly how much slower CPU will be than GPU, but it could be as much as 10x slower.
Is there any visualization during training so I can know the performance on both training and testing set?
The .json checkpoint files saved during training contain training loss history as well as validation set mAP. You can use those to watch the training process.
Since it use torch.CudaTensor, so it require GPU, right? Is there any way I can train it only using CPU?
BTW, here is my command line for training th train.lua -learning_rate 0.003 -data_json data/training_data/training_json -data_h5 data/training_data/training_h5 -gpu -1 -checkpoint_path data/training/cp.v1 -id densecapv1 -backend 'nn'
And here is the error I got. /usr/local/google/home/wangyu/torch/install/bin/luajit: ...gle/home/wangyu/.luarocks/share/lua/5.1/torch/Tensor.lua:238: attempt to index a nil value stack traceback: ...gle/home/wangyu/.luarocks/share/lua/5.1/torch/Tensor.lua:238: in function 'type' .../google/home/wangyu/.luarocks/share/lua/5.1/nn/utils.lua:52: in function 'recursiveType' ...google/home/wangyu/.luarocks/share/lua/5.1/nn/Module.lua:126: in function 'type' .../google/home/wangyu/.luarocks/share/lua/5.1/nn/utils.lua:45: in function 'recursiveType' ...google/home/wangyu/.luarocks/share/lua/5.1/nn/Module.lua:126: in function 'type' train.lua:48: in main chunk [C]: in function 'dofile' ...ngyu/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk [C]: at 0x00406670