tensorflow / models

Models and examples built with TensorFlow
Other
77.23k stars 45.75k forks source link

Training bug in mobilenet v1 extractor for Faster r-cnn #4259

Closed yryun closed 4 years ago

yryun commented 6 years ago

System information

Describe the problem

It seems like training bug with Faster R-CNN Mobilenet. I've been training Faster R-CNN with mobilenet feature extractor for 400k iteration, batch 8, learning rate 3e-03(mentioned in HuangMurphy_2017_Speed,accuracy trade-offs for modern convolutional object detectors). But it's mAP is "zero". I'm using my own dataset and it's going well with InceptionV2, Resnet50, 101. mAP of InceptionV2, Resnet50, 101 is 0.7. So it's not about hyperparameter tuning problem.

Source code / logs

here is loss of inception v2, which has good mAP. image

here is loss of mobilenet. Second stage loss is strangely low (loss is zero almost of time) image

here is mAP of mobilenet. How can it be a zero? image

here is my tensorboard distributions of mobilenet. Compared to Incepction V2, mobilenet has No change in second stage conv2d_12, conv2d_13 - moving_mean, moving_variance. I think it could be a clue of cause. image

pkulzc commented 6 years ago

Thanks for letting us know, we'll look into this. Could you please also share your config file?

yryun commented 6 years ago

Thanks for replying !

Here is my config file. Image size looks very small, but going well with incepV2, resnet50 and 101. I tried stride 8 and stride 16, but same things happened. Mobilenet with faster r-cnn is really important for my research, please solve this problem.

model {
  faster_rcnn {
    #mobile16 + size_+ +512depth+ crop10 + achor box2 + proposal 100
    num_classes: 2
    image_resizer {
      fixed_shape_resizer {
        height: 288
        width: 960
      }
    }
    feature_extractor {
      type: "faster_rcnn_mobilenet"
      first_stage_features_stride: 16
    }
    first_stage_anchor_generator {
      grid_anchor_generator {
        height_stride: 16
        width_stride: 16

        scales: 0.125
        scales: 0.25
        scales: 0.5
        scales: 1.0
        scales: 1.5
        scales: 2.0

        aspect_ratios: 0.48
        aspect_ratios: 0.65
        aspect_ratios: 1.09
        aspect_ratios: 1.28
    aspect_ratios: 1.48
        aspect_ratios: 1.70

      }
    }
    first_stage_box_predictor_conv_hyperparams {
      op: CONV
      regularizer {
        l2_regularizer {
          weight: 0.0
        }
      }
      initializer {
        truncated_normal_initializer {
          stddev: 0.00999999977648
        }
      }
    }
    first_stage_nms_score_threshold: 0.0
    first_stage_nms_iou_threshold: 0.699999988079
    first_stage_max_proposals: 100
    first_stage_localization_loss_weight: 2.0
    first_stage_objectness_loss_weight: 1.0

    first_stage_box_predictor_depth: 512

    initial_crop_size: 10
    maxpool_kernel_size: 2
    maxpool_stride: 2
    second_stage_box_predictor {
      mask_rcnn_box_predictor {
        fc_hyperparams {
          op: FC
          regularizer {
            l2_regularizer {
              weight: 0.0
            }
          }
          initializer {
            variance_scaling_initializer {
              factor: 1.0
              uniform: true
              mode: FAN_AVG
            }
          }
        }
        #use_dropout: true
        #dropout_keep_probability: 0.8
      }
    }
    second_stage_post_processing {
      batch_non_max_suppression {
        score_threshold: 0.0
        iou_threshold: 0.600000023842
        max_detections_per_class: 100
        max_total_detections: 100
      }
      score_converter: SOFTMAX
    }
    second_stage_localization_loss_weight: 2.0
    second_stage_classification_loss_weight: 1.0
  }
}
train_config {
  batch_size: 8
  data_augmentation_options {
    random_horizontal_flip {
    }
  }

  data_augmentation_options {
    random_crop_pad_image {
    }
  }
  optimizer {
    momentum_optimizer {
      learning_rate {
        manual_step_learning_rate {
          initial_learning_rate: 3e-03

          schedule {
            step: 400000
            learning_rate: 3e-04
          }
          schedule {
            step: 1200000
            learning_rate: 3e-05
          }
          schedule {
            step: 1500000
            learning_rate: 3e-06
          }
        }
      }
      momentum_optimizer_value: 0.899999976158
    }
    use_moving_average: false
  }
  gradient_clipping_by_norm: 10.0
  from_detection_checkpoint: false
  num_steps: 1600000
}
train_input_reader {
  label_map_path: "/home/yryun/fasterRCNN/models/research/object_detection/data/kitti_label_map_v3.pbtxt"
  tf_record_input_reader {
    input_path: "/home/yryun/fasterRCNN/models/research/object_detection/yeongro/blackbox_train_v3.record_train.tfrecord"
  }
  shuffle: true
}
eval_config {
  num_examples: 592
  metrics_set: "pascal_voc_metrics"
  use_moving_averages: false
}
eval_input_reader {
  label_map_path: "/home/yryun/fasterRCNN/models/research/object_detection/data/kitti_label_map_v3.pbtxt"
  tf_record_input_reader {
    input_path: "/home/yryun/fasterRCNN/models/research/object_detection/yeongro/blackbox_test_v3.record_train.tfrecord"
  }
  shuffle: false
}
pkulzc commented 6 years ago

We didn't release faster rcnn mobilenet model in the past, so I guess you added the mapping here by yourself?

yryun commented 6 years ago

Yes, I added the mapping. I am using the latest faster rcnn mobilenet code.

twangnh commented 6 years ago

@yryun Hi! yryun, is there update, have you solved the problem?

pkulzc commented 6 years ago

Even though the code for faster rcnn mobilenet model is almost complete but it's not officially ready yet. Contributions are welcomed if anyone is interested in the debugging.

yryun commented 6 years ago

Let me try mobilenet with official KITTI dataset. I will post the result later!

yryun commented 6 years ago

Unfortunately, MobileNet feature extractor with Official KITTI dataset has the same problem. So, it is not about my dataset problem.

ddzhangjie commented 6 years ago

May I ask the runtime and accuracy for faster rcnn mobilenet?

zhimengfan1990 commented 5 years ago

@yryun I got the same issue, have you solved this problem? thanks!

UmarSpa commented 5 years ago

Any update on this ?

I am trying to train faster rcnn with pretrained mobilenet_v1 as backbone. Training on single GPU works fine. But when I try to use two GPU's, the replicas of the variables created on the second GPUs fail to load the pretrained weights of mobilenet_v1.

For all variables (Conv2d_0 to Conv2d_13) I get an output similar to the following: Variable [MobilenetV1/Conv2d_0/BatchNorm/beta/replica_1] is not available in checkpoint

tensorflowbutler commented 4 years ago

Hi There, We are checking to see if you still need help on this, as this seems to be considerably old issue. Please update this issue with the latest information, code snippet to reproduce your issue and error you are seeing. If we don't hear from you in the next 7 days, this issue will be closed automatically. If you don't need help on this issue any more, please consider closing this.