Loss NAN - Githubissues

lucasjinreal commented 5 years ago

Try to train on Resnet50 FasterRCNN on VOC, got nan loss in the begin of training process:

I0812 13:02:23.224562 15010 caffe.cpp:248] Starting Optimization
I0812 13:02:23.224596 15010 solver.cpp:292] Solving ResNet-50
I0812 13:02:23.224606 15010 solver.cpp:293] Learning Rate Policy: multistep
I0812 13:02:23.315627 15010 frcnn_anchor_target_layer.hpp:97] Info_Stds_Means_AvePos : COUNT : 23
I0812 13:02:23.315645 15010 frcnn_anchor_target_layer.hpp:98] STDS   : 0.0124143, 0.142521, 0.0744314, 0.40446
I0812 13:02:23.315654 15010 frcnn_anchor_target_layer.hpp:99] MEANS  : 0.00592739, 0.0176122, 0.0154362, 0.0769088
I0812 13:02:23.315659 15010 frcnn_anchor_target_layer.hpp:101] num_positive ave : 23
I0812 13:02:23.315661 15010 frcnn_anchor_target_layer.hpp:102] num_negitive ave : 233
I0812 13:02:23.694118 15010 solver.cpp:231] Iteration 0 (0.00570228 iter/s, 0.469403s/100 iters), loss = 3.69928  [ 0 / 70000 ] -> [ 0:0 (H:M) ]
I0812 13:02:23.694160 15010 solver.cpp:257]     Train net output #0: bbox_accuracy = 0.0857143
I0812 13:02:23.694181 15010 solver.cpp:257]     Train net output #1: loss_bbox = 0.66597 (* 1 = 0.66597 loss)
I0812 13:02:23.694186 15010 solver.cpp:257]     Train net output #2: loss_cls = 2.69545 (* 1 = 2.69545 loss)
I0812 13:02:23.694191 15010 solver.cpp:257]     Train net output #3: rpn_cls_loss = 0.662394 (* 1 = 0.662394 loss)
I0812 13:02:23.694195 15010 solver.cpp:257]     Train net output #4: rpn_loss_bbox = 0.256747 (* 1 = 0.256747 loss)
I0812 13:02:23.694200 15010 sgd_solver.cpp:148] Iteration 0, lr = 0.00033
I0812 13:02:59.574069 15010 solver.cpp:231] Iteration 100 (2.78717 iter/s, 35.8787s/100 iters), loss = 1.96366  [ 100 / 70000 ] -> [ 7:3 (H:M) ]
I0812 13:02:59.574113 15010 solver.cpp:257]     Train net output #0: bbox_accuracy = 0.857143
I0812 13:02:59.574122 15010 solver.cpp:257]     Train net output #1: loss_bbox = 10.4415 (* 1 = 10.4415 loss)
I0812 13:02:59.574128 15010 solver.cpp:257]     Train net output #2: loss_cls = 0.139943 (* 1 = 0.139943 loss)
I0812 13:02:59.574133 15010 solver.cpp:257]     Train net output #3: rpn_cls_loss = 0.341543 (* 1 = 0.341543 loss)
I0812 13:02:59.574139 15010 solver.cpp:257]     Train net output #4: rpn_loss_bbox = 0.219639 (* 1 = 0.219639 loss)
I0812 13:02:59.574143 15010 sgd_solver.cpp:148] Iteration 100, lr = 0.000464
I0812 13:03:37.712059 15010 solver.cpp:231] Iteration 200 (2.62215 iter/s, 38.1367s/100 iters), loss = 3.6148  [ 200 / 70000 ] -> [ 7:13 (H:M) ]
I0812 13:03:37.712097 15010 solver.cpp:257]     Train net output #0: bbox_accuracy = 0.857143
I0812 13:03:37.712105 15010 solver.cpp:257]     Train net output #1: loss_bbox = 2.76457 (* 1 = 2.76457 loss)
I0812 13:03:37.712110 15010 solver.cpp:257]     Train net output #2: loss_cls = 0.623248 (* 1 = 0.623248 loss)
I0812 13:03:37.712113 15010 solver.cpp:257]     Train net output #3: rpn_cls_loss = 0.37599 (* 1 = 0.37599 loss)
I0812 13:03:37.712117 15010 solver.cpp:257]     Train net output #4: rpn_loss_bbox = 0.135961 (* 1 = 0.135961 loss)
I0812 13:03:37.712121 15010 sgd_solver.cpp:148] Iteration 200, lr = 0.000598
I0812 13:04:17.866065 15010 solver.cpp:231] Iteration 300 (2.4905 iter/s, 40.1526s/100 iters), loss = 5.94563  [ 300 / 70000 ] -> [ 7:24 (H:M) ]
I0812 13:04:17.866216 15010 solver.cpp:257]     Train net output #0: bbox_accuracy = 0.771429
I0812 13:04:17.866282 15010 solver.cpp:257]     Train net output #1: loss_bbox = 1.84373 (* 1 = 1.84373 loss)
I0812 13:04:17.866341 15010 solver.cpp:257]     Train net output #2: loss_cls = 2.36289 (* 1 = 2.36289 loss)
I0812 13:04:17.866400 15010 solver.cpp:257]     Train net output #3: rpn_cls_loss = 0.15481 (* 1 = 0.15481 loss)
I0812 13:04:17.866457 15010 solver.cpp:257]     Train net output #4: rpn_loss_bbox = 0.0201009 (* 1 = 0.0201009 loss)
I0812 13:04:17.866518 15010 sgd_solver.cpp:148] Iteration 300, lr = 0.000732
I0812 13:04:54.086403 15010 solver.cpp:231] Iteration 400 (2.76101 iter/s, 36.2187s/100 iters), loss = 7.70644  [ 400 / 70000 ] -> [ 7:17 (H:M) ]
I0812 13:04:54.086589 15010 solver.cpp:257]     Train net output #0: bbox_accuracy = 0.971429
I0812 13:04:54.086655 15010 solver.cpp:257]     Train net output #1: loss_bbox = 0.276507 (* 1 = 0.276507 loss)
I0812 13:04:54.086716 15010 solver.cpp:257]     Train net output #2: loss_cls = 0.13878 (* 1 = 0.13878 loss)
I0812 13:04:54.086776 15010 solver.cpp:257]     Train net output #3: rpn_cls_loss = 0.22572 (* 1 = 0.22572 loss)
I0812 13:04:54.086835 15010 solver.cpp:257]     Train net output #4: rpn_loss_bbox = 0.0198562 (* 1 = 0.0198562 loss)
I0812 13:04:54.086892 15010 sgd_solver.cpp:148] Iteration 400, lr = 0.000866
I0812 13:05:34.160019 15010 frcnn_anchor_target_layer.hpp:97] Info_Stds_Means_AvePos : COUNT : 40300
I0812 13:05:34.160053 15010 frcnn_anchor_target_layer.hpp:98] STDS   : 0.189493, 0.140276, 0.428416, 0.523535
I0812 13:05:34.160060 15010 frcnn_anchor_target_layer.hpp:99] MEANS  : -0.00275885, 0.0171251, -0.0280274, -0.119019
I0812 13:05:34.160064 15010 frcnn_anchor_target_layer.hpp:101] num_positive ave : 40.2597
I0812 13:05:34.160068 15010 frcnn_anchor_target_layer.hpp:102] num_negitive ave : 215.74
I0812 13:05:34.429374 15010 solver.cpp:231] Iteration 500 (2.47885 iter/s, 40.3413s/100 iters), loss = nan  [ 500 / 70000 ] -> [ 7:23 (H:M) ]
E0812 13:05:34.429399 15010 solver.cpp:236] ======= exit cause of nan loss =======
*** Error in `build/tools/caffe': double free or corruption (out): 0x0000000001c7a240 ***

lucasjinreal commented 5 years ago

It seems there were some nan out from first conv:

I0812 13:33:12.153046 24802 solver.cpp:231] Iteration 500 (2.45156 iter/s, 40.7904s/100 iters), loss = 7.33115  [ 500 / 140000 ] -> [ 14:53 (H:M) ]
I0812 13:33:12.153055 24802 solver.cpp:257]     Train net output #0: bbox_accuracy = 0.771429
I0812 13:33:12.153061 24802 solver.cpp:257]     Train net output #1: loss_bbox = 0.604116 (* 1 = 0.604116 loss)
I0812 13:33:12.153067 24802 solver.cpp:257]     Train net output #2: loss_cls = 0.6076 (* 1 = 0.6076 loss)
I0812 13:33:12.153072 24802 solver.cpp:257]     Train net output #3: rpn_cls_loss = 0.111379 (* 1 = 0.111379 loss)
I0812 13:33:12.153079 24802 solver.cpp:257]     Train net output #4: rpn_loss_bbox = 0.0530575 (* 1 = 0.0530575 loss)
I0812 13:33:12.153090 24802 sgd_solver.cpp:148] Iteration 500, lr = 0.001
I0812 13:33:47.280968 24802 net.cpp:592]     [Forward] Layer input-data, top blob data data: 33.4105
I0812 13:33:47.281054 24802 net.cpp:592]     [Forward] Layer input-data, top blob im_info data: 467.2
I0812 13:33:47.281081 24802 net.cpp:592]     [Forward] Layer input-data, top blob gt_boxes data: 263.84
I0812 13:33:47.281111 24802 net.cpp:592]     [Forward] Layer im_info_input-data_1_split, top blob im_info_input-data_1_split_0 data: 467.2
I0812 13:33:47.281136 24802 net.cpp:592]     [Forward] Layer im_info_input-data_1_split, top blob im_info_input-data_1_split_1 data: 467.2
I0812 13:33:47.281164 24802 net.cpp:592]     [Forward] Layer gt_boxes_input-data_2_split, top blob gt_boxes_input-data_2_split_0 data: 263.84
I0812 13:33:47.281189 24802 net.cpp:592]     [Forward] Layer gt_boxes_input-data_2_split, top blob gt_boxes_input-data_2_split_1 data: 263.84
I0812 13:33:47.282073 24802 net.cpp:592]     [Forward] Layer conv1, top blob conv1 data: nan
I0812 13:33:47.282413 24802 net.cpp:604]     [Forward] Layer conv1, param blob 0 data: 0.0668743
I0812 13:33:47.292757 24802 net.cpp:604]     [Forward] Layer conv1, param blob 1 data: nan
I0812 13:33:47.293643 24802 net.cpp:592]     [Forward] Layer bn_conv1, top blob conv1 data: nan
I0812 13:33:47.293735 24802 net.cpp:604]     [Forward] Layer bn_conv1, param blob 0 data: 0.578535
I0812 13:33:47.293767 24802 net.cpp:604]     [Forward] Layer bn_conv1, param blob 1 data: 8680.63
I0812 13:33:47.293798 24802 net.cpp:604]     [Forward] Layer bn_conv1, param blob 2 data: 1
I0812 13:33:47.294286 24802 net.cpp:592]     [Forward] Layer scale_conv1, top blob conv1 data: nan
I0812 13:33:47.294315 24802 net.cpp:604]     [Forward] Layer scale_conv1, param blob 0 data: nan
I0812 13:33:47.294343 24802 net.cpp:604]     [Forward] Layer scale_conv1, param blob 1 data: nan
I0812 13:33:47.294653 24802 net.cpp:592]     [Forward] Layer conv1_relu, top blob conv1 data: nan
I0812 13:33:47.294883 24802 net.cpp:592]     [Forward] Layer pool1, top blob pool1 data: inf

What could be the reason?

makefile commented 5 years ago

Check if the object labels are all correct, and whether the initial weight file match the proto file(for example, the proto maybe merge BN layers to convs).

lucasjinreal commented 5 years ago

@makefile Thanks for your reply, same data I have trained on zf backbone, it's ok at least for 120000 iterations.

From I can notice, it's still have BatchNorm layer, pretrained models simply from hekaiming's repo of resnet50.

Also, I found this could also cause Nan output of BatchNorm layer. I try fix but no luck. It's still got Nan after several iterations.

Model start like this:

name: "ResNet-50"
# fyk: layer names are same as ResNet-50
# notice that some of ResNet-50 layer names are different from ResNet-101

#=========Frcnn-RoiData============
layer {
  name: "input-data"
  type: "FrcnnRoiData"
  top: "data"
  top: "im_info"
  top: "gt_boxes"
  include {
    phase: TRAIN
  }
  window_data_param {
   source: "examples/FRCNN/dataset/voc2007.trainval"
   config: "examples/FRCNN/config/voc_config.json"
   root_folder: "VOCdevkit/VOC2007/JPEGImages/"
    cache_images: true
  }
}

#========= conv1-conv4f ============

layer {
  bottom: "data"
  top: "conv1"
  name: "conv1"
  type: "Convolution"
  param {
    lr_mult: 0
    decay_mult: 0
  }
  convolution_param {
    num_output: 64
    kernel_size: 7
    pad: 3
    stride: 2
#    bias_term: false # a little different from ResNet101,which set bias_term: false 
  }
}

layer {
  bottom: "conv1"
  top: "conv1"
  name: "bn_conv1"
  type: "BatchNorm"
  batch_norm_param {
    use_global_stats: true
  }
}

layer {
  bottom: "conv1"
  top: "conv1"
  name: "scale_conv1"
  type: "Scale"
  scale_param {
    bias_term: true
  }
}

layer {
  top: "conv1"
  bottom: "conv1"
  name: "conv1_relu"
  type: "ReLU"
}

makefile commented 5 years ago

It's hard to figure out the exact problem according to your information. Please check again the params setting or the proto definition or any other things.

lucasjinreal commented 5 years ago

I am training only with VOC, what I mean is that, does rfcn-res50 or any other proto with resnets is OK to train? Have u tested with those configurations?

makefile commented 5 years ago

It is ok to train with the rfcn-* proto, I have tested most of the proto that I upload.

lucasjinreal commented 5 years ago

Just got loss nan on fasterrcnn with resnet50, with VOC data, not sure which reason for this.

makefile commented 5 years ago

Can you paste more snippet here(or in pastebin) for analysis, such as config,data label and proto.

lucasjinreal commented 5 years ago

For sure, start with train.sh:

#!/usr/bin/env sh
# This script test four voc images using faster rcnn end-to-end trained model (ZF-Model)
if [ ! -n "$1" ] ;then
    echo "$1 is empty, default is 0"
    gpu=0
else
    echo "use $1-th gpu"
    gpu=$1
fi

export PYTHONPATH=/z/users/detection_model_furnace/vendor/frcnn/python

CAFFE=build/tools/caffe 

$CAFFE train   \
    --gpu $gpu \
    --solver examples/FRCNN/res50/solver.proto \
    --weights examples/imagenet_models/ResNet-50-model.caffemodel
#    --weights models/FRCNN/Res101.v2.caffemodel
echo 'remember to convert_model'
exit 0
time python3 examples/FRCNN/convert_model.py \
    --model models/FRCNN/res50/test.proto \
    --weights models/FRCNN/snapshot/res50_faster_rcnn_iter_180000.caffemodel \
    --config examples/FRCNN/config/voc_config.json \
    --net_out models/FRCNN/res50_faster_rcnn_final.caffemodel

Solver.proto:

# Resnet 101   When 07: 72.x+%, using 07+12, 79.x+%(18w iterations)
# fyk: res50 *.proto files is copied from res101
train_net: "examples/FRCNN/res50/train_val.proto"
base_lr: 0.001
lr_policy: "multistep"
gamma: 0.1
stepvalue: 50000
max_iter: 140000
display: 100
average_loss: 100
momentum: 0.9
weight_decay: 0.0001
# function
snapshot: 10000
# We still use the snapshot prefix, though
snapshot_prefix: "examples/FRCNN/snapshot/res50-voc_faster_rcnn"
iter_size: 2
# debug_info: true

The model is resnet50 with fastercnn and I haven't change anything, voc_config.json:

{"scales": "600",
 "max_size": "1000",
 "batch_size": "150",

 "fg_fraction": "0.25",
 "fg_thresh": "0.5",
 "bg_thresh_hi": "0.5",
 "bg_thresh_lo": "0",
 "use_flipped": "1",

 "bbox_reg": "1",
 "bbox_thresh": "0.5",
 "snapshot_infix": "",

 "bbox_normalize_targets": "1",
 "bbox_inside_weights": "1.0, 1.0, 1.0, 1.0",
 "bbox_normalize_means": "0.0, 0.0, 0.0, 0.0",
 "bbox_normalize_stds": "0.1, 0.1, 0.2, 0.2",

 "rpn_positive_overlap": "0.7",
 "rpn_negative_overlap": "0.3",
 "rpn_clobber_positives": "0",
 "rpn_fg_fraction": "0.5",
 "rpn_batchsize": "256",
 "rpn_nms_thresh": "0.7",
 "rpn_pre_nms_top_n": "12000",
 "rpn_post_nms_top_n": "2000",
 "rpn_min_size": "16",
 "rpn_bbox_inside_weights": "1.0, 1.0, 1.0, 1.0",
 "rpn_positive_weight": "-1.0",
 "rpn_allowed_border": "0",

 "test_scales": "600",
 "test_max_size": "1000",
 "test_nms": "0.2",
 "test_bbox_reg": "1",
 "test_rpn_nms_thresh": "0.7",
 "test_rpn_pre_nms_top_n": "6000",
 "test_rpn_post_nms_top_n": "300",
 "test_rpn_min_size": "16",

 "pixel_means": "102.9801, 115.9465, 122.7717",
 "rng_seed": "3",
 "eps": "0.00000000000001",
 "inf": "100000000",

 "feat_stride": "16",
 "anchors": "-84, -40, 99, 55,
             -176,  -88,  191,   103,
             -360,  -184,   375,   100,
             -56,   -56,    71,    72,
             -120,  -120,   135,   135,
             -248,  -248,   263,   263,
             -36,   -80,    51,    95,
             -80,   -168,   95,    183,
             -168,  -344,   183,   359",
 "test_score_thresh": "0.5",
 "n_classes": "21",
 "iter_test": "-1"
}

Also haven't change much.

makefile commented 5 years ago

There seems have no problem, I cannot figure it out either. You can also try with D-X-Y's repo, since my repo is based on his and changed some code.

makefile / frcnn

Loss NAN #15