Open lucasjinreal opened 5 years ago
It seems there were some nan out from first conv:
I0812 13:33:12.153046 24802 solver.cpp:231] Iteration 500 (2.45156 iter/s, 40.7904s/100 iters), loss = 7.33115 [ 500 / 140000 ] -> [ 14:53 (H:M) ]
I0812 13:33:12.153055 24802 solver.cpp:257] Train net output #0: bbox_accuracy = 0.771429
I0812 13:33:12.153061 24802 solver.cpp:257] Train net output #1: loss_bbox = 0.604116 (* 1 = 0.604116 loss)
I0812 13:33:12.153067 24802 solver.cpp:257] Train net output #2: loss_cls = 0.6076 (* 1 = 0.6076 loss)
I0812 13:33:12.153072 24802 solver.cpp:257] Train net output #3: rpn_cls_loss = 0.111379 (* 1 = 0.111379 loss)
I0812 13:33:12.153079 24802 solver.cpp:257] Train net output #4: rpn_loss_bbox = 0.0530575 (* 1 = 0.0530575 loss)
I0812 13:33:12.153090 24802 sgd_solver.cpp:148] Iteration 500, lr = 0.001
I0812 13:33:47.280968 24802 net.cpp:592] [Forward] Layer input-data, top blob data data: 33.4105
I0812 13:33:47.281054 24802 net.cpp:592] [Forward] Layer input-data, top blob im_info data: 467.2
I0812 13:33:47.281081 24802 net.cpp:592] [Forward] Layer input-data, top blob gt_boxes data: 263.84
I0812 13:33:47.281111 24802 net.cpp:592] [Forward] Layer im_info_input-data_1_split, top blob im_info_input-data_1_split_0 data: 467.2
I0812 13:33:47.281136 24802 net.cpp:592] [Forward] Layer im_info_input-data_1_split, top blob im_info_input-data_1_split_1 data: 467.2
I0812 13:33:47.281164 24802 net.cpp:592] [Forward] Layer gt_boxes_input-data_2_split, top blob gt_boxes_input-data_2_split_0 data: 263.84
I0812 13:33:47.281189 24802 net.cpp:592] [Forward] Layer gt_boxes_input-data_2_split, top blob gt_boxes_input-data_2_split_1 data: 263.84
I0812 13:33:47.282073 24802 net.cpp:592] [Forward] Layer conv1, top blob conv1 data: nan
I0812 13:33:47.282413 24802 net.cpp:604] [Forward] Layer conv1, param blob 0 data: 0.0668743
I0812 13:33:47.292757 24802 net.cpp:604] [Forward] Layer conv1, param blob 1 data: nan
I0812 13:33:47.293643 24802 net.cpp:592] [Forward] Layer bn_conv1, top blob conv1 data: nan
I0812 13:33:47.293735 24802 net.cpp:604] [Forward] Layer bn_conv1, param blob 0 data: 0.578535
I0812 13:33:47.293767 24802 net.cpp:604] [Forward] Layer bn_conv1, param blob 1 data: 8680.63
I0812 13:33:47.293798 24802 net.cpp:604] [Forward] Layer bn_conv1, param blob 2 data: 1
I0812 13:33:47.294286 24802 net.cpp:592] [Forward] Layer scale_conv1, top blob conv1 data: nan
I0812 13:33:47.294315 24802 net.cpp:604] [Forward] Layer scale_conv1, param blob 0 data: nan
I0812 13:33:47.294343 24802 net.cpp:604] [Forward] Layer scale_conv1, param blob 1 data: nan
I0812 13:33:47.294653 24802 net.cpp:592] [Forward] Layer conv1_relu, top blob conv1 data: nan
I0812 13:33:47.294883 24802 net.cpp:592] [Forward] Layer pool1, top blob pool1 data: inf
What could be the reason?
Check if the object labels are all correct, and whether the initial weight file match the proto file(for example, the proto maybe merge BN layers to convs).
@makefile Thanks for your reply, same data I have trained on zf backbone, it's ok at least for 120000 iterations.
From I can notice, it's still have BatchNorm layer, pretrained models simply from hekaiming's repo of resnet50.
Also, I found this could also cause Nan output of BatchNorm layer. I try fix but no luck. It's still got Nan after several iterations.
Model start like this:
name: "ResNet-50"
# fyk: layer names are same as ResNet-50
# notice that some of ResNet-50 layer names are different from ResNet-101
#=========Frcnn-RoiData============
layer {
name: "input-data"
type: "FrcnnRoiData"
top: "data"
top: "im_info"
top: "gt_boxes"
include {
phase: TRAIN
}
window_data_param {
source: "examples/FRCNN/dataset/voc2007.trainval"
config: "examples/FRCNN/config/voc_config.json"
root_folder: "VOCdevkit/VOC2007/JPEGImages/"
cache_images: true
}
}
#========= conv1-conv4f ============
layer {
bottom: "data"
top: "conv1"
name: "conv1"
type: "Convolution"
param {
lr_mult: 0
decay_mult: 0
}
convolution_param {
num_output: 64
kernel_size: 7
pad: 3
stride: 2
# bias_term: false # a little different from ResNet101,which set bias_term: false
}
}
layer {
bottom: "conv1"
top: "conv1"
name: "bn_conv1"
type: "BatchNorm"
batch_norm_param {
use_global_stats: true
}
}
layer {
bottom: "conv1"
top: "conv1"
name: "scale_conv1"
type: "Scale"
scale_param {
bias_term: true
}
}
layer {
top: "conv1"
bottom: "conv1"
name: "conv1_relu"
type: "ReLU"
}
It's hard to figure out the exact problem according to your information. Please check again the params setting or the proto definition or any other things.
I am training only with VOC, what I mean is that, does rfcn-res50 or any other proto with resnets is OK to train? Have u tested with those configurations?
It is ok to train with the rfcn-*
proto, I have tested most of the proto that I upload.
Just got loss nan on fasterrcnn with resnet50, with VOC data, not sure which reason for this.
Can you paste more snippet here(or in pastebin) for analysis, such as config,data label and proto.
For sure, start with train.sh:
#!/usr/bin/env sh
# This script test four voc images using faster rcnn end-to-end trained model (ZF-Model)
if [ ! -n "$1" ] ;then
echo "$1 is empty, default is 0"
gpu=0
else
echo "use $1-th gpu"
gpu=$1
fi
export PYTHONPATH=/z/users/detection_model_furnace/vendor/frcnn/python
CAFFE=build/tools/caffe
$CAFFE train \
--gpu $gpu \
--solver examples/FRCNN/res50/solver.proto \
--weights examples/imagenet_models/ResNet-50-model.caffemodel
# --weights models/FRCNN/Res101.v2.caffemodel
echo 'remember to convert_model'
exit 0
time python3 examples/FRCNN/convert_model.py \
--model models/FRCNN/res50/test.proto \
--weights models/FRCNN/snapshot/res50_faster_rcnn_iter_180000.caffemodel \
--config examples/FRCNN/config/voc_config.json \
--net_out models/FRCNN/res50_faster_rcnn_final.caffemodel
Solver.proto:
# Resnet 101 When 07: 72.x+%, using 07+12, 79.x+%(18w iterations)
# fyk: res50 *.proto files is copied from res101
train_net: "examples/FRCNN/res50/train_val.proto"
base_lr: 0.001
lr_policy: "multistep"
gamma: 0.1
stepvalue: 50000
max_iter: 140000
display: 100
average_loss: 100
momentum: 0.9
weight_decay: 0.0001
# function
snapshot: 10000
# We still use the snapshot prefix, though
snapshot_prefix: "examples/FRCNN/snapshot/res50-voc_faster_rcnn"
iter_size: 2
# debug_info: true
The model is resnet50 with fastercnn and I haven't change anything, voc_config.json:
{"scales": "600",
"max_size": "1000",
"batch_size": "150",
"fg_fraction": "0.25",
"fg_thresh": "0.5",
"bg_thresh_hi": "0.5",
"bg_thresh_lo": "0",
"use_flipped": "1",
"bbox_reg": "1",
"bbox_thresh": "0.5",
"snapshot_infix": "",
"bbox_normalize_targets": "1",
"bbox_inside_weights": "1.0, 1.0, 1.0, 1.0",
"bbox_normalize_means": "0.0, 0.0, 0.0, 0.0",
"bbox_normalize_stds": "0.1, 0.1, 0.2, 0.2",
"rpn_positive_overlap": "0.7",
"rpn_negative_overlap": "0.3",
"rpn_clobber_positives": "0",
"rpn_fg_fraction": "0.5",
"rpn_batchsize": "256",
"rpn_nms_thresh": "0.7",
"rpn_pre_nms_top_n": "12000",
"rpn_post_nms_top_n": "2000",
"rpn_min_size": "16",
"rpn_bbox_inside_weights": "1.0, 1.0, 1.0, 1.0",
"rpn_positive_weight": "-1.0",
"rpn_allowed_border": "0",
"test_scales": "600",
"test_max_size": "1000",
"test_nms": "0.2",
"test_bbox_reg": "1",
"test_rpn_nms_thresh": "0.7",
"test_rpn_pre_nms_top_n": "6000",
"test_rpn_post_nms_top_n": "300",
"test_rpn_min_size": "16",
"pixel_means": "102.9801, 115.9465, 122.7717",
"rng_seed": "3",
"eps": "0.00000000000001",
"inf": "100000000",
"feat_stride": "16",
"anchors": "-84, -40, 99, 55,
-176, -88, 191, 103,
-360, -184, 375, 100,
-56, -56, 71, 72,
-120, -120, 135, 135,
-248, -248, 263, 263,
-36, -80, 51, 95,
-80, -168, 95, 183,
-168, -344, 183, 359",
"test_score_thresh": "0.5",
"n_classes": "21",
"iter_test": "-1"
}
Also haven't change much.
There seems have no problem, I cannot figure it out either. You can also try with D-X-Y's repo, since my repo is based on his and changed some code.
Try to train on Resnet50 FasterRCNN on VOC, got nan loss in the begin of training process: