Training loss 0 - Githubissues

rajiv235 commented 8 years ago

Hey @sanghoon I tried running the training script on VOC07-12 trainval data with original imagenet pretrained model. However I am seeing I1007 00:45:18.758940 7782 solver.cpp:228] Iteration 40, loss = 0.41927 I1007 00:45:18.759007 7782 solver.cpp:244] Train net output #0: cls_loss = 0 (* 1 = 0 loss) I1007 00:45:18.759021 7782 solver.cpp:244] Train net output #1: loss_bbox = 0 (* 1 = 0 loss) I1007 00:45:18.759030 7782 solver.cpp:244] Train net output #2: rpn_cls_loss = 0.136419 (* 1 = 0.136419 loss) I1007 00:45:18.759040 7782 solver.cpp:244] Train net output #3: rpn_loss_bbox = 0.083394 (* 1 = 0.083394 loss) I1007 00:45:18.759050 7782 sgd_solver.cpp:138] Iteration 40, lr = 0.001 I1007 00:45:29.026012 7782 solver.cpp:228] Iteration 60, loss = 0.324707 I1007 00:45:29.026077 7782 solver.cpp:244] Train net output #0: cls_loss = 0 (* 1 = 0 loss) I1007 00:45:29.026090 7782 solver.cpp:244] Train net output #1: loss_bbox = 0 (* 1 = 0 loss) I1007 00:45:29.026103 7782 solver.cpp:244] Train net output #2: rpn_cls_loss = 0.161693 (* 1 = 0.161693 loss) I1007 00:45:29.026111 7782 solver.cpp:244] Train net output #3: rpn_loss_bbox = 0.288804 (* 1 = 0.288804 loss)

cls_loss and loss_bbox=0. Do you have insight why would that be?

Thanks

shuye-cheung commented 8 years ago

Hi, @sanghoon I encounter a similar case like @rajiv235 when I try training a new model on VOC07 dataset.

The script I used is ./tools/train_net.py --gpu 0 --solver models/pvanet/example_train_384/solver.prototxt --imdb voc_2007_trainval --iters 70000 --cfg models/pvanet/cfgs/train.yml

The caffe log is shown as follow: I1008 21:44:28.736819 30222 solver.cpp:228] Iteration 0, loss = nan I1008 21:44:28.736874 30222 solver.cpp:244] Train net output #0: cls_loss = 87.3365 (* 1 = 87.3365 loss) I1008 21:44:28.736889 30222 solver.cpp:244] Train net output #1: loss_bbox = nan (* 1 = nan loss) I1008 21:44:28.736901 30222 solver.cpp:244] Train net output #2: rpn_cls_loss = 87.3365 (* 1 = 87.3365 loss) I1008 21:44:28.736912 30222 solver.cpp:244] Train net output #3: rpn_loss_bbox = nan (* 1 = nan loss) I1008 21:44:28.736927 30222 sgd_solver.cpp:138] Iteration 0, lr = 0.001 I1008 21:45:10.809685 30222 solver.cpp:228] Iteration 20, loss = nan I1008 21:45:10.809734 30222 solver.cpp:244] Train net output #0: cls_loss = 87.3365 (* 1 = 87.3365 loss) I1008 21:45:10.809749 30222 solver.cpp:244] Train net output #1: loss_bbox = nan (* 1 = nan loss) I1008 21:45:10.809761 30222 solver.cpp:244] Train net output #2: rpn_cls_loss = 87.3365 (* 1 = 87.3365 loss) I1008 21:45:10.809800 30222 solver.cpp:244] Train net output #3: rpn_loss_bbox = nan (* 1 = nan loss) I1008 21:45:10.809811 30222 sgd_solver.cpp:138] Iteration 20, lr = 0.001 I1008 21:45:50.867269 30222 solver.cpp:228] Iteration 40, loss = nan I1008 21:45:50.867318 30222 solver.cpp:244] Train net output #0: cls_loss = 87.3365 (* 1 = 87.3365 loss) I1008 21:45:50.867331 30222 solver.cpp:244] Train net output #1: loss_bbox = nan (* 1 = nan loss) I1008 21:45:50.867341 30222 solver.cpp:244] Train net output #2: rpn_cls_loss = 87.3365 (* 1 = 87.3365 loss) I1008 21:45:50.867352 30222 solver.cpp:244] Train net output #3: rpn_loss_bbox = nan (* 1 = nan loss) I1008 21:45:50.867362 30222 sgd_solver.cpp:138] Iteration 40, lr = 0.001

As shown, the loss is nan. Why the loss is nan? Do I miss something? Any idea to fix this problem?

Thanks

sanghoon commented 8 years ago

Hi @shuye-cheung ,

You should load a pre-trained model. The reason is that randomly initialized weights make the outputs explode (e.g. overflow in exp()).

sanghoon commented 8 years ago

Hi @rajiv235, Cls_loss, bbox_loss = 0 seem really awkward. Could you share your command?

rajiv235 commented 8 years ago

Hey @sanghoon sure. I use python tools/train_net.py --gpu 0 --solver models/pvanet/example_train_384/solver.prototxt --iters 250000 --weights data/pvanet.model --cfg models/pvanet/cfgs/train.yml --imdb voc_2012_trainval where pvanet.model is pretrained imagenet model.

Thanks

shuye-cheung commented 8 years ago

Hi, @sanghoon Now I use pre-trained model as follow: ./tools/train_net.py --gpu 1 --solver models/pvanet/example_finetune/solver.prototxt --weights models/pvanet/imagenet/test.model --imdb voc_2007_trainval --iters 70000 --cfg models/pvanet/cfgs/train.yml (All "fc6" in examplefinetune/train.prototxt is replaced by "fc6")

Now the caffe log is like this: I1009 11:33:33.015254 24113 solver.cpp:228] Iteration 0, loss = 4.89822 I1009 11:33:33.015303 24113 solver.cpp:244] Train net output #0: cls_loss = 3.92276 (* 1 = 3.92276 loss) I1009 11:33:33.015314 24113 solver.cpp:244] Train net output #1: loss_bbox = 0.000491955 (* 1 = 0.000491955 loss) I1009 11:33:33.015323 24113 solver.cpp:244] Train net output #2: rpn_cls_loss = 0.69589 (* 1 = 0.69589 loss) I1009 11:33:33.015331 24113 solver.cpp:244] Train net output #3: rpn_loss_bbox = 0.595581 (* 1 = 0.595581 loss) I1009 11:33:33.015339 24113 sgd_solver.cpp:138] Iteration 0, lr = 0.0001 I1009 11:33:38.601795 24113 solver.cpp:228] Iteration 20, loss = 3.80439 I1009 11:33:38.601855 24113 solver.cpp:244] Train net output #0: cls_loss = 1.60729 (* 1 = 1.60729 loss) I1009 11:33:38.601866 24113 solver.cpp:244] Train net output #1: loss_bbox = 3.31022e-05 (* 1 = 3.31022e-05 loss) I1009 11:33:38.601874 24113 solver.cpp:244] Train net output #2: rpn_cls_loss = 0.210614 (* 1 = 0.210614 loss) I1009 11:33:38.601884 24113 solver.cpp:244] Train net output #3: rpn_loss_bbox = 0.0397171 (* 1 = 0.0397171 loss) I1009 11:33:38.601892 24113 sgd_solver.cpp:138] Iteration 20, lr = 0.0001 I1009 11:33:44.266188 24113 solver.cpp:228] Iteration 40, loss = 1.28547 I1009 11:33:44.266242 24113 solver.cpp:244] Train net output #0: cls_loss = 0.316124 (* 1 = 0.316124 loss) I1009 11:33:44.266253 24113 solver.cpp:244] Train net output #1: loss_bbox = 0.000229506 (* 1 = 0.000229506 loss) I1009 11:33:44.266261 24113 solver.cpp:244] Train net output #2: rpn_cls_loss = 0.486822 (* 1 = 0.486822 loss) I1009 11:33:44.266271 24113 solver.cpp:244] Train net output #3: rpn_loss_bbox = 0.0158842 (* 1 = 0.0158842 loss) I1009 11:33:44.266279 24113 sgd_solver.cpp:138] Iteration 40, lr = 0.0001

It seems work.

Thanks for your help

SeaOfOcean commented 8 years ago

why (All "fc6" in examplefinetune/train.prototxt is replaced by "fc6") solve the problem? @shuye-cheung

shuye-cheung commented 8 years ago

If I don't replace "fc6" with "fc6_", I will encounter shape mismatch problem. @SeaOfOcean More details have been discussed in #1

SeaOfOcean commented 7 years ago

Thanks @shuye-cheung

I trained the model in this way but got very bad test results

AP for aeroplane = 0.0000 AP for bicycle = 0.0000 AP for bird = 0.0000 AP for boat = 0.0000 AP for bottle = 0.0011 AP for bus = 0.0000 AP for car = 0.0002 AP for cat = 0.0002 AP for chair = 0.0015 AP for cow = 0.0000 AP for diningtable = 0.0000 AP for dog = 0.0000 AP for horse = 0.0000 AP for motorbike = 0.0000 AP for person = 0.0102 AP for pottedplant = 0.0001 AP for sheep = 0.0000 AP for sofa = 0.0193 AP for train = 0.0000 AP for tvmonitor = 0.0006 Mean AP = 0.0017

Have you tested on your trained results?

The test result in pretrained model is pretty nice~ Can you provide the solver.prototxt, train.prototxt test.prototxt and command used in your pretrained model? @sanghoon

shuye-cheung commented 7 years ago

Hi, @SeaOfOcean I encounter a similar result. The command I used is ./tools/test_net.py --gpu 0 --def models/pvanet/full/test.pt --net output/faster_rcnn_pvanet/voc_2007_trainval/pvanet_frcnn_iter_30000.caffemodel --cfg models/pvanet/cfgs/submit_160715.yml

, and the output is AP for aeroplane = 0.0000 AP for bicycle = 0.0005 AP for bird = 0.0000 AP for boat = 0.0002 AP for bottle = 0.0010 Traceback (most recent call last): File "./tools/test_net.py", line 90, in test_net(net, imdb, max_per_image=args.max_per_image, vis=args.vis) File "../lib/fast_rcnn/test.py", line 319, in test_net imdb.evaluate_detections(all_boxes, output_dir) File "../lib/datasets/pascal_voc.py", line 322, in evaluate_detections self._do_python_eval(output_dir) File "../lib/datasets/pascal_voc.py", line 285, in _do_python_eval use_07_metric=use_07_metric) File "../lib/datasets/voc_eval.py", line 148, in voc_eval BB = BB[sorted_ind, :] IndexError: too many indices for array

SeaOfOcean commented 7 years ago

@shuye-cheung IndexError: too many indices for array can be solved by using

if len(BB) == 0:
    BB = BB[sorted_ind, :]

But I have no idea why the trained model is strange. @sanghoon can you provide any help?

shuye-cheung commented 7 years ago

@SeaOfOcean You are right. Now the result is: AP for aeroplane = 0.0000 AP for bicycle = 0.0020 AP for bird = 0.0000 AP for boat = 0.0000 AP for bottle = 0.0000 AP for bus = 0.0000 AP for car = 0.0000 AP for cat = 0.0001 AP for chair = 0.0000 AP for cow = 0.0000 AP for diningtable = 0.0000 AP for dog = 0.0000 AP for horse = 0.0000 AP for motorbike = 0.0000 AP for person = 0.0027 AP for pottedplant = 0.0000 AP for sheep = 0.0000 AP for sofa = 0.0000 AP for train = 0.0000 AP for tvmonitor = 0.0000 Mean AP = 0.0002

In addition, I try to train a new model on custom dataset by fine-tuning full/test.model, but the loss don't decrease. My command is like ./tools/train_net.py --gpu 0 --solver models/pvanet/custom_dataset/solver.prototxt --weights models/pvanet/full/test.model --imdb custom_dataset_train --iters 70000 --cfg models/pvanet/cfgs/train.yml But I have no idea to fix them. Any idea? @sanghoon

Thanks in advance.

sanghoon commented 7 years ago

Hi all, I've checked the prototxts and found a bug in them.

I'm currently training a model so that I can be sure the bug is fixed. I'm going to update the prototxts after I check its outcomes. I won't take more than a couple of days.

Sorry for this inconvenience.

sanghoon commented 7 years ago

Hi all, I've updated the training examples. Please refer to 7be72509ff0ce0510643082d5bcfb4cdd0dc620b

I bet that this will resolve the issue with 0-ish APs.

Talking about @rajiv235 's report, I couldn't reproduce 0-bbox, class loss. However, I guess the update is going to resolve that issue, too.

P.S.: One more comment, I recommend you start with 'example_finetune' but with higher base_lr (e.g. 0.001 or 0.003) I guess you'll get a decent detector in a shorter time with much less number of iterations.

SeaOfOcean commented 7 years ago

Thanks @sanghoon I trained with your updated network prototxt. When I try to test with voc2007 dataset, I got error

F1011 17:35:31.303407 5734 net.cpp:767] Check failed: target_blobs.size() == source_layer.blobs_size() (1 vs. 2) Incompatible number of blobs for layer conv1_1/conv

sanghoon commented 7 years ago

@SeaOfOcean

Did you use the test prototxt in the same directory? Would you please let me whether you 'trained' or 'fine-tuned' a model?

rajiv235 commented 7 years ago

Thanks @sanghoon . I am not getting 0 loss now. I am closing the issue.

sanghoon / pva-faster-rcnn

Training loss 0 #8