tensorflow / models

Models and examples built with TensorFlow
Other
77.16k stars 45.75k forks source link

Retrain failed due to fail in text_format.Merge(proto_str, pipeline_config) #7389

Open tomruarol opened 5 years ago

tomruarol commented 5 years ago

I am trying to train a whole model based on the COCO dataset using this scripts provided but reducing the number of classes to only 6.

I run the download_and_preprocess_coco.sh script which downloads the dataset and calls the create_coco_tf_record.py script which creates the TFRecords from the dataset previously downloaded. After that steps (successfully achieved) I try to run the retrain_detection_model.sh as it is described in the tutorial, but modifying the labels .pdtxt file in order to take into account only 6 clases and modifying the pipeline.config file in order to achieve the same (with a v2 net and training the whole model option).

The first error that came out was:

RuntimeError: Did not find any input files matching the glob pattern [u'/tensorflow/models/research/tmp/mscoco/coco_train.record-00001-of-00010']

When I do have a file under: /tensorflow/models/research/tmp/mscoco/ which contains files of the following format:

coco_testdev.record-00000-of-00100
coco_train.record-00024-of-00100
coco_val.record-00001-of-00010

Being the first set of 5 numbers after the record part numbers that go from 00000 to 00099.

So I do have those files that the error reports I do not have, and I have the PATH specified in the pipeline.config file.

I have narrowed down the problem to this part of the /research/object_detection/utils/config_util.py script:

pipeline_config = pipeline_pb2.TrainEvalPipelineConfig()
 with tf.gfile.GFile(pipeline_config_path, "r") as f:
   proto_str = f.read()
   text_format.Merge(proto_str, pipeline_config)
 if config_override:
   text_format.Merge(config_override, pipeline_config)
 return create_configs_from_pipeline_proto(pipeline_config)

It crashes in the Merge call. print(pipeline_config) returns nothing. I guess the object is empty or it does not merge and crash if it doesn't find num_layers in the pipeline_config object.

I paste my pipeline.config file:

model {
  ssd {
    num_classes: 2
    image_resizer {
      fixed_shape_resizer {
        height: 300
        width: 300
      }
    }
    feature_extractor {
      type: "ssd_mobilenet_v2"
      depth_multiplier: 1.0
      min_depth: 16
      conv_hyperparams {
        regularizer {
          l2_regularizer {
            weight: 3.99999989895e-05
          }
        }
        initializer {
          random_normal_initializer {
            mean: 0.0
            stddev: 0.00999999977648
          }
        }
        activation: RELU_6
        batch_norm {
          decay: 0.97000002861
          center: true
          scale: true
          epsilon: 0.0010000000475
        }
      }
      override_base_feature_extractor_hyperparams: true
    }
    box_coder {
      faster_rcnn_box_coder {
        y_scale: 10.0
        x_scale: 10.0
        height_scale: 5.0
        width_scale: 5.0
      }
    }
    matcher {
      argmax_matcher {
        matched_threshold: 0.5
        unmatched_threshold: 0.5
        ignore_thresholds: false
        negatives_lower_than_unmatched: true
        force_match_for_each_row: true
        use_matmul_gather: true
      }
    }
    similarity_calculator {
      iou_similarity {
      }
    }
    box_predictor {
      convolutional_box_predictor {
        conv_hyperparams {
          regularizer {
            l2_regularizer {
              weight: 3.99999989895e-05
            }
          }
          initializer {
            random_normal_initializer {
              mean: 0.0
              stddev: 0.00999999977648
            }
          }
          activation: RELU_6
          batch_norm {
            decay: 0.97000002861
            center: true
            scale: true
            epsilon: 0.0010000000475
          }
        }
        min_depth: 0
        max_depth: 0
        num_layers_before_predictor: 0
        use_dropout: false
        dropout_keep_probability: 0.800000011921
        kernel_size: 1
        box_code_size: 4
        apply_sigmoid_to_scores: false
        class_prediction_bias_init: -4.59999990463
      }
    }
    anchor_generator {
      ssd_anchor_generator {
        num_layers: 6
        min_scale: 0.20000000298
        max_scale: 0.949999988079
        aspect_ratios: 1.0
        aspect_ratios: 2.0
        aspect_ratios: 0.5
        aspect_ratios: 3.0
        aspect_ratios: 0.333299994469
      }
    }
    post_processing {
      batch_non_max_suppression {
        score_threshold: 0.300000011921
        iou_threshold: 0.600000023842
        max_detections_per_class: 100
        max_total_detections: 100
      }
      score_converter: SIGMOID
    }
    normalize_loss_by_num_matches: true
    loss {
      localization_loss {
        weighted_smooth_l1 {
        }
      }
      classification_loss {
        weighted_sigmoid_focal {
          gamma: 2.0
          alpha: 0.75
        }
      }
      classification_weight: 1.0
      localization_weight: 1.0
    }
    encode_background_as_zeros: true
    normalize_loc_loss_by_codesize: true
    inplace_batchnorm_update: true
    freeze_batchnorm: false
  }
}
train_config {
  batch_size: 128
  data_augmentation_options {
    random_horizontal_flip {
    }
  }
  data_augmentation_options {
    ssd_random_crop {
    }
  }
  sync_replicas: true
  optimizer {
    momentum_optimizer {
      learning_rate {
        cosine_decay_learning_rate {
          learning_rate_base: 0.20000000298
          total_steps: 50000
          warmup_learning_rate: 0.0599999986589
          warmup_steps: 2000
        }
      }
      momentum_optimizer_value: 0.899999976158
    }
    use_moving_average: false
  }
  fine_tune_checkpoint: "/tensorflow/models/research/learn_human_car/ckpt/model.ckpt"
  from_detection_checkpoint: true
  load_all_detection_checkpoint_vars: true
  num_steps: 50000
  startup_delay_steps: 0.0
  replicas_to_aggregate: 8
  max_number_of_boxes: 100
  unpad_groundtruth_tensors: false
}
train_input_reader {
  label_map_path: "/tensorflow/models/research/object_detection/data/mscoco_label_map.pbtxt"
  tf_record_input_reader {
    input_path: "/tensorflow/models/research/tmp/mscoco/coco_train.record-00001-of-00010"
  }
}
eval_config {
  num_examples: 8000
  metrics_set: "coco_detection_metrics"
  use_moving_averages: false
}
eval_input_reader {
  label_map_path: "/tensorflow/models/research/object_detection/data/mscoco_label_map.pbtxt"
  shuffle: false
  num_readers: 1
  tf_record_input_reader {
    input_path: "/tensorflow/models/research/tmp/mscoco/coco_val.record-?????-of-00010"
  }
}
graph_rewriter {
  quantization {
    delay: 48000
    weight_bits: 8
    activation_bits: 8
  }
}
tensorflowbutler commented 5 years ago

Thank you for your post. We noticed you have not filled out the following field in the issue template. Could you update them if they are relevant in your case, or leave them as N/A? Thanks. What is the top-level directory of the model you are using Have I written custom code OS Platform and Distribution TensorFlow installed from TensorFlow version Bazel version CUDA/cuDNN version GPU model and memory Exact command to reproduce

tomruarol commented 5 years ago

Ok, I managed to move on a bit by skipping the use of the glob library in the dataset_builder.py script under the route research/object_detection/builders/. It is not working as it should, so by just removing the use of it the script runs a bit ahead, but it still throws and error:

NotFoundError (see above for traceback): /tensorflow/models/research/tmp/mscoco/coco_train.record-00001-of-00010; No such file or directory
         [[node IteratorGetNext (defined at object_detection/model_main.py:105)  = IteratorGetNext[output_shapes=[[128], [128,300,300,3], [128,2], [128,3], [128,100], [128,100,4], [128,100,2], [128,100,2], [128,100], [128,100], [128,100], [128]],
 output_types=[DT_INT32, DT_FLOAT, DT_INT32, DT_INT32, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_INT32, DT_BOOL, DT_FLOAT, DT_INT32], _device="/job:localhost/replica:0/task:0/device:CPU:0"](IteratorV2)]]

I have not figured out how to move on from here.