tensorflow / models

Models and examples built with TensorFlow
Other
77.18k stars 45.75k forks source link

object detection fail when I train mask_rcnn_resnet_101_pets #3837

Closed mxmxlwlw closed 4 years ago

mxmxlwlw commented 6 years ago

Here's the error info:

InvalidArgumentError (see above for traceback): assertion failed: [] [Condition x == y did not hold element-wise:] [x (Loss/BoxClassifierLoss/assert_equal_2/x:0) = ] [0] [y (Loss/BoxClassifierLoss/assert_equal_2/y:0) = ] [1]
     [[Node: Loss/BoxClassifierLoss/assert_equal_2/Assert/Assert = Assert[T=[DT_STRING, DT_STRING, DT_STRING, DT_INT32, DT_STRING, DT_INT32], summarize=3, _device="/job:localhost/replica:0/task:0/device:CPU:0"](Loss/BoxClassifierLoss/assert_equal_2/All/_153, Loss/RPNLoss/assert_equal/Assert/Assert/data_0, Loss/RPNLoss/assert_equal/Assert/Assert/data_1, Loss/BoxClassifierLoss/assert_equal_2/Assert/Assert/data_2, Loss/BoxClassifierLoss/assert_equal_2/x/_155, Loss/BoxClassifierLoss/assert_equal_2/Assert/Assert/data_4, Loss/RPNLoss/ones_1/packed/_145)]]
     [[Node: FirstStageFeatureExtractor/resnet_v1_101/block3/unit_16/bottleneck_v1/conv2/BatchNorm/gamma/read/_919 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_2186_...gamma/read", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

run with

python object_detection/train.py \
    --logtostderr \
    --train_dir=experiment/mask_rcnn_resnet101_pets/train \
    --pipeline_config_path=experiment/mask_rcnn_resnet101_pets/mask_rcnn_resnet101_pets.config

pipeline script is

# Mask R-CNN with Resnet-101 (v1) configured for the Oxford-IIIT Pet Dataset.
# Users should configure the fine_tune_checkpoint field in the train config as
# well as the label_map_path and input_path fields in the train_input_reader and
# eval_input_reader. Search for "PATH_TO_BE_CONFIGURED" to find the fields that
# should be configured.

model {
  faster_rcnn {
    num_classes: 37
    image_resizer {
      keep_aspect_ratio_resizer {
        min_dimension: 600
        max_dimension: 1024
      }
    }
    number_of_stages: 3
    feature_extractor {
      type: 'faster_rcnn_resnet101'
      first_stage_features_stride: 16
    }
    first_stage_anchor_generator {
      grid_anchor_generator {
        scales: [0.25, 0.5, 1.0, 2.0]
        aspect_ratios: [0.5, 1.0, 2.0]
        height_stride: 16
        width_stride: 16
      }
    }
    first_stage_box_predictor_conv_hyperparams {
      op: CONV
      regularizer {
        l2_regularizer {
          weight: 0.0
        }
      }
      initializer {
        truncated_normal_initializer {
          stddev: 0.01
        }
      }
    }
    first_stage_nms_score_threshold: 0.0
    first_stage_nms_iou_threshold: 0.7
    first_stage_max_proposals: 300
    first_stage_localization_loss_weight: 2.0
    first_stage_objectness_loss_weight: 1.0
    initial_crop_size: 14
    maxpool_kernel_size: 2
    maxpool_stride: 2
    second_stage_box_predictor {
      mask_rcnn_box_predictor {
        use_dropout: false
        dropout_keep_probability: 1.0
        predict_instance_masks: true
        conv_hyperparams {
          op: CONV
          regularizer {
            l2_regularizer {
              weight: 0.0
            }
          }
          initializer {
            truncated_normal_initializer {
              stddev: 0.01
            }
          }
        }
        fc_hyperparams {
          op: FC
          regularizer {
            l2_regularizer {
              weight: 0.0
            }
          }
          initializer {
            variance_scaling_initializer {
              factor: 1.0
              uniform: true
              mode: FAN_AVG
            }
          }
        }
      }
    }
    second_stage_post_processing {
      batch_non_max_suppression {
        score_threshold: 0.0
        iou_threshold: 0.6
        max_detections_per_class: 100
        max_total_detections: 300
      }
      score_converter: SOFTMAX
    }
    second_stage_localization_loss_weight: 2.0
    second_stage_classification_loss_weight: 1.0
  }
}

train_config: {
  batch_size: 1
  optimizer {
    momentum_optimizer: {
      learning_rate: {
        manual_step_learning_rate {
          initial_learning_rate: 0.0007
          schedule {
            step: 15000
            learning_rate: 0.00007
          }
          schedule {
            step: 30000
            learning_rate: 0.000007
          }
        }
      }
      momentum_optimizer_value: 0.9
    }
    use_moving_average: false
  }
  gradient_clipping_by_norm: 10.0
  fine_tune_checkpoint: "experiment/mask_rcnn_resnet101_pets/pretrain/model.ckpt"
  from_detection_checkpoint: true
  # Note: The below line limits the training process to 200K steps, which we
  # empirically found to be sufficient enough to train the pets dataset. This
  # effectively bypasses the learning rate schedule (the learning rate will
  # never decay). Remove the below line to train indefinitely.
  num_steps: 200000
  data_augmentation_options {
    random_horizontal_flip {
    }
  }
}

train_input_reader: {
  tf_record_input_reader {
    input_path: "experiment/mask_rcnn_resnet101_pets/data/pet_train.record"
  }
  label_map_path: "experiment/data/pet_label_map.pbtxt"
  load_instance_masks: true
}

eval_config: {
  num_examples: 2000
  # Note: The below line limits the evaluation process to 10 evaluations.
  # Remove the below line to evaluate indefinitely.
  max_evals: 10
}

eval_input_reader: {
  tf_record_input_reader {
    input_path: "experiment/mask_rcnn_resnet101_pets/data/pet_val.record"
  }
  label_map_path: "experiment/data/pet_label_map.pbtxt"
  load_instance_masks: true
  shuffle: false
  num_readers: 1
}

pretrain model is faster_rcnn_resnet101_coco_11_06_2017

samhsieh commented 6 years ago

I also met the similar problem. I refered to the "object_detection/samples/configs/faster_rcnn_resnet50_coco.config", just modify the "PATH_TO_BE_CONFIGURED" and save in object_detection/model/train folder, as the following:

faster_rcnn_resnet50_coco.config

At the same time, I download trained model(faster_rcnn_resnet50_coco) from the "Tensorflow detection model zoo" as the model checkpoint in "object_detection/model/train" folder. then do Running the Training Job locally by the command: python object_detection/train.py --logtostderr \ -pipeline_config_path=./object_detection/model/train/faster_rcnn_resnet50_coco.config \ --train_dir=./object_detection/model/train 2>&1 | tee log.txt

Here the error info : NotFoundError (see above for traceback): Key Conv/biases/Momentum not found in checkpoint [[Node: save_1/RestoreV2_1 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save_1/Const_0_0, save_1/RestoreV2_1/tensor_names, save_1/RestoreV2_1/shape_and_slices)]] In details, refer to the attached. log.txt

Could you help share some experience in problem solving the training locally? Thanks.

sjwhhhi commented 6 years ago

I also met the same problem when I tried to train mask rcnn model.

InvalidArgumentError (see above for traceback): assertion failed: [] [Condition x == y did not hold element-wise:] [x (Loss/BoxClassifierLoss/assert_equal_2/x:0) = ] [0] [y (Loss/BoxClassifierLoss/assert_equal_2/y:0) = ] [1]

Could anyone help me?Thanks.

Abduoit commented 6 years ago

+1

Abduoit commented 6 years ago

I had this problem, I solved as follow:

The name of the TFRecords files should be pet_train/val.record. I changed it by editing the faces_only from True to False

check the line here https://github.com/tensorflow/models/blob/master/research/object_detection/dataset_tools/create_pet_tf_record.py#L49

Then, I regenerated TFRecord files by this

python object_detection/dataset_tools/create_pet_tf_record.py
 --label_map_path=object_detection/data/two_label_map.pbtxt 
--data_dir=`pwd`     --output_dir=`pwd` --include_masks=True

Then, I got two TFRecords files with names pet_train/val.record, then I used them for training process with mask_rcnn_inception_v2_coco

Hope this helps

Abduoit commented 6 years ago

@mxmxlwlw did u solve your issue, I have this issue with pascal_train/val.record only. I don't have it with pet_train/val.record.

wxianfeng commented 6 years ago

when i train, i meet too

INFO:tensorflow:Error reported to Coordinator: assertion failed: [] [Condition x == y did not hold element-wise:] [x (Loss/BoxClassifierLoss/assert_equal_2/x:0) = ] [0] [y (Loss/BoxClassifierLoss/assert_equal_2/y:0) = ] [2]
     [[Node: Loss/BoxClassifierLoss/assert_equal_2/Assert/Assert = Assert[T=[DT_STRING, DT_STRING, DT_STRING, DT_INT32, DT_STRING, DT_INT32], summarize=3, _device="/job:localhost/replica:0/task:0/device:CPU:0"](Loss/BoxClassifierLoss/assert_equal_2/All/_139, Loss/RPNLoss/assert_equal/Assert/Assert/data_0, Loss/RPNLoss/assert_equal/Assert/Assert/data_1, Loss/BoxClassifierLoss/assert_equal_2/Assert/Assert/data_2, Loss/BoxClassifierLoss/assert_equal_2/x/_141, Loss/BoxClassifierLoss/assert_equal_2/Assert/Assert/data_4, Loss/RPNLoss/ones_1/packed/_143)]]

Caused by op u'Loss/BoxClassifierLoss/assert_equal_2/Assert/Assert', defined at:
  File "train.py", line 167, in <module>
    tf.app.run()

who solved?

SpiralBeing commented 6 years ago

I think you need check your data. I did so and this issue was solved

epratheeban commented 6 years ago

@wxianfeng For sure the problem is in the data. Your tf.record don't have the mask information. Below are the three mistakes that I have corrected.

  1. As @Abduoit said when generating tf.record, change the faces_only from True to False.
  2. Mask information might be encoded or decoded in different format.
  3. Make sure you have the right mask files in the trimaps folder.
wxianfeng commented 6 years ago

@epratheeban yeah, when i set faces_only to False, tf record file is larger than True

i train success, but when i predict, not success, train data just 200, because not enough ?

tensorflowbutler commented 4 years ago

Hi There, We are checking to see if you still need help on this, as this seems to be considerably old issue. Please update this issue with the latest information, code snippet to reproduce your issue and error you are seeing. If we don't hear from you in the next 7 days, this issue will be closed automatically. If you don't need help on this issue any more, please consider closing this.