jcjohnson / densecap

Dense image captioning in Torch
MIT License
1.58k stars 430 forks source link

Does training require GPU? #15

Open wangyujiajia opened 8 years ago

wangyujiajia commented 8 years ago

Since it use torch.CudaTensor, so it require GPU, right? Is there any way I can train it only using CPU?

BTW, here is my command line for training th train.lua -learning_rate 0.003 -data_json data/training_data/training_json -data_h5 data/training_data/training_h5 -gpu -1 -checkpoint_path data/training/cp.v1 -id densecapv1 -backend 'nn'

And here is the error I got. /usr/local/google/home/wangyu/torch/install/bin/luajit: ...gle/home/wangyu/.luarocks/share/lua/5.1/torch/Tensor.lua:238: attempt to index a nil value stack traceback: ...gle/home/wangyu/.luarocks/share/lua/5.1/torch/Tensor.lua:238: in function 'type' .../google/home/wangyu/.luarocks/share/lua/5.1/nn/utils.lua:52: in function 'recursiveType' ...google/home/wangyu/.luarocks/share/lua/5.1/nn/Module.lua:126: in function 'type' .../google/home/wangyu/.luarocks/share/lua/5.1/nn/utils.lua:45: in function 'recursiveType' ...google/home/wangyu/.luarocks/share/lua/5.1/nn/Module.lua:126: in function 'type' train.lua:48: in main chunk [C]: in function 'dofile' ...ngyu/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk [C]: at 0x00406670

wangyujiajia commented 8 years ago

I changed CudaTensor to FloatTensor and it seems start training. But after 10 iters, the loss information become nan. Is this normal?

iter 0: mid_box_reg_loss: 0.001, captioning_loss: 55.520, end_objectness_loss: 0.089, mid_objectness_loss: 0.149, end_box_reg_loss: 0.003, [total: 111.522] iter 1: mid_box_reg_loss: 0.002, captioning_loss: 55.364, end_objectness_loss: 14.402, mid_objectness_loss: 0.137, end_box_reg_loss: 9.152, [total: 148.963] iter 2: mid_box_reg_loss: 0.004, captioning_loss: 40.627, end_objectness_loss: 0.336, mid_objectness_loss: 0.149, end_box_reg_loss: 0.094, [total: 82.328] iter 3: mid_box_reg_loss: 0.002, captioning_loss: 39.576, end_objectness_loss: 0.073, mid_objectness_loss: 0.136, end_box_reg_loss: 0.121, [total: 79.694] iter 4: mid_box_reg_loss: 0.001, captioning_loss: 35.889, end_objectness_loss: 0.325, mid_objectness_loss: 0.183, end_box_reg_loss: 0.924, [total: 73.721] iter 5: mid_box_reg_loss: 0.002, captioning_loss: 29.476, end_objectness_loss: 0.439, mid_objectness_loss: 0.130, end_box_reg_loss: 0.248, [total: 60.344] iter 6: mid_box_reg_loss: 0.003, captioning_loss: 24.685, end_objectness_loss: 0.257, mid_objectness_loss: 0.143, end_box_reg_loss: 0.320, [total: 50.496] iter 7: mid_box_reg_loss: 0.002, captioning_loss: 32.415, end_objectness_loss: 0.196, mid_objectness_loss: 0.147, end_box_reg_loss: 0.157, [total: 65.678] iter 8: mid_box_reg_loss: 0.002, captioning_loss: 45.235, end_objectness_loss: 0.152, mid_objectness_loss: 0.141, end_box_reg_loss: 0.095, [total: 91.155] iter 9: mid_box_reg_loss: 0.001, captioning_loss: 35.458, end_objectness_loss: 0.268, mid_objectness_loss: 0.138, end_box_reg_loss: 0.944, [total: 72.674] iter 10: mid_box_reg_loss: nan, captioning_loss: nan, end_objectness_loss: nan, mid_objectness_loss: nan, end_box_reg_loss: nan, [total: nan] iter 11: mid_box_reg_loss: nan, captioning_loss: nan, end_objectness_loss: nan, mid_objectness_loss: nan, end_box_reg_loss: nan, [total: nan] iter 12: mid_box_reg_loss: nan, captioning_loss: nan, end_objectness_loss: nan, mid_objectness_loss: nan, end_box_reg_loss: nan, [total: nan] WARNING: Masking out 1 boxes in LocalizationLayer iter 13: mid_box_reg_loss: 0.000, captioning_loss: nan, end_objectness_loss: nan, mid_objectness_loss: nan, end_box_reg_loss: nan, [total: nan]

Thanks.

jcjohnson commented 8 years ago

Your initial learning rate seems too high - that might be causing it to blow up. Training already takes days on a GPU so I didn't expect anyone to want to train with CPU, but replacing CudaTensor with FloatTensor should make it possible.

wangyujiajia commented 8 years ago

What learning rate do you suggest? Thanks.

2016-06-15 14:54 GMT-07:00 Justin notifications@github.com:

Your initial learning rate seems too high - that might be causing it to blow up. Training already takes days on a GPU so I didn't expect anyone to want to train with CPU, but replacing CudaTensor with FloatTensor should make it possible.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jcjohnson/densecap/issues/15#issuecomment-226331688, or mute the thread https://github.com/notifications/unsubscribe/ATCXM01WxhsyTbrOwYAlrKmvc9uCcdKOks5qMHSCgaJpZM4I2z2x .

王煜

wangyujiajia commented 8 years ago

How slow is CPU compared to GPU? Thanks.

jcjohnson commented 8 years ago

I used 1e-6 for initial learning rate. It's hard to say exactly how much slower CPU will be than GPU, but it could be as much as 10x slower.

wangyujiajia commented 8 years ago

Is there any visualization during training so I can know the performance on both training and testing set?

jcjohnson commented 8 years ago

The .json checkpoint files saved during training contain training loss history as well as validation set mAP. You can use those to watch the training process.