Issues retraining faster_rcnn_resnet101_coco

alexmagsam commented 6 years ago

System information

What is the top-level directory of the model you are using: C:\Users\Alexm\my-venv\Lib\site-packages\tensorflow\models\research\object_detection
Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes, to generate TFRecords. I used generate_tfrecord.py from https://github.com/datitran/raccoon_dataset.git
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10
TensorFlow installed from (source or binary): binary
TensorFlow version (use command below): 1.9
Bazel Version: N/A
CUDA/cuDNN version: 9.0 and 7.2.
GPU model and memory: GTX 1080 8GB
Exact command to reproduce: py model_main.py --pipeline_config_path =models\faster_rcnn_resnet101_coco_2018_01_28\faster_rcnn_resnet101_coco.config \ --model_dir=training/ --num_train_steps=50000 --num_eval_steps=1000 --alsologtostderr

Issue

The training session seizes to begin. The process is terminated without throwing any helpful errors. I have opened tensorboard in the training/ directory, but there is no training going on. This information can also be found at https://stackoverflow.com/questions/51754386/tensorflow-object-detection-training-issue, but I feel this issue is better suited for this forum.

Background info

I have cloned the object detection API into my site-packages/tensorflow
I have compiled the .proto files,
My training set: 32 RGB images 3312x3312x3
My test set: 13 RGB images 3312x3312x3
I created annotations using labelImg, an open-source program for creating bounding-boxes. I used another script to convert the XML annotations to CSV file. These CSV files were then converted to TFRecords train.record and eval.record. They are approximately the same size as my the folders containing my train and test images.

File structure

object_detection/data/train.record
object_detection/data/eval.record
object_detection/data/spheroid_label_map.pbtxt
object_detection/models/faster_rcnn_resnet101_coco_2018_01_28/model.ckpt.data- 00000-of-00001
object_detection/models/faster_rcnn_resnet101_coco_2018_01_28/model.ckpt.index
object_detection/models/faster_rcnn_resnet101_coco_2018_01_28/model.ckpt.meta
object_detection/models/faster_rcnn_resnet101_coco_2018_01_28/faster_rcnn_resnet 101_coco.config
object_detection/training/

Output

C:\Users\Alexm\my-venv\Lib\site-packages\tensorflow\models\research\object_detection\utils\visualization_utils.py:25: UserWarning:
    This call to matplotlib.use() has no effect because the backend has already
    been chosen; matplotlib.use() must be called *before* pylab, matplotlib.pyplot,
    or matplotlib.backends is imported for the first time.

    The backend was *originally* set to 'TkAgg' by the following code:
    File "model_main.py", line 26, in <module>
        from object_detection import model_lib
    File "C:\Users\Alexm\my-venv\Lib\site-packages\tensorflow\models\research\object_detection\model_lib.py", line 26, in <module>
        from object_detection import eval_util
    File "C:\Users\Alexm\my-venv\Lib\site-packages\tensorflow\models\research\object_detection\eval_util.py", line 28, in <module>
        from object_detection.metrics import coco_evaluation
    File "C:\Users\Alexm\my-venv\Lib\site-packages\tensorflow\models\research\object_detection\metrics\coco_evaluation.py", line 20, in <module>
        from object_detection.metrics import coco_tools
    File "C:\Users\Alexm\my-venv\Lib\site-packages\tensorflow\models\research\object_detection\metrics\coco_tools.py", line 47, in <module>
        from pycocotools import coco
    File "C:\Users\Alexm\my-venv\Lib\site-packages\tensorflow\models\research\pycocotools\coco.py", line 49, in <module>
        import matplotlib.pyplot as plt
    File "C:\Users\Alexm\my-venv\lib\site-packages\matplotlib\pyplot.py", line 71, in <module>
        from matplotlib.backends import pylab_setup
    File "C:\Users\Alexm\my-venv\lib\site-packages\matplotlib\backends\__init__.py", line 16, in <module> line for line in traceback.format_stack()

    import matplotlib; matplotlib.use('Agg')  # pylint: disable=multiple-statements
    WARNING:tensorflow:Estimator's model_fn (<function create_model_fn.<locals>.model_fn at 0x0000020D98AF79D8>) includes params argument, but params are not passed to Estimator.
    WARNING:tensorflow:num_readers has been reduced to 1 to match input file shards.
    WARNING:tensorflow:From C:\Users\Alexm\my-venv\Lib\site-packages\tensorflow\models\research\object_detection\predictors\mask_rcnn_heads\box_head.py:76: calling reduce_mean (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.
    Instructions for updating:
    keep_dims is deprecated, use keepdims instead
    WARNING:tensorflow:From C:\Users\Alexm\my-venv\Lib\site-packages\tensorflow\models\research\object_detection\meta_architectures\faster_rcnn_meta_arch.py:2070: get_or_create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
    Instructions for updating:
    Please switch to tf.train.get_or_create_global_step
    WARNING:tensorflow:From C:\Users\Alexm\my-venv\Lib\site-packages\tensorflow\models\research\object_detection\core\losses.py:317: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version.
    Instructions for updating:

    Future major versions of TensorFlow will allow gradients to flow
    into the labels input on backprop by default.

    See @{tf.nn.softmax_cross_entropy_with_logits_v2}.

    C:\Users\Alexm\my-venv\lib\site-packages\tensorflow\python\ops\gradients_impl.py:100: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
      "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
    2018-08-08 10:46:02.440678: I T:\src\github\tensorflow\tensorflow\core\platform\cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
    2018-08-08 10:46:02.717546: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1392] Found device 0 with properties:
    name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.8475
    pciBusID: 0000:02:00.0
    totalMemory: 8.00GiB freeMemory: 6.59GiB
    2018-08-08 10:46:02.722863: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1471] Adding visible gpu devices: 0
    2018-08-08 10:46:03.603798: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:952] Device interconnect StreamExecutor with strength 1 edge matrix:
    2018-08-08 10:46:03.606244: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:958]      0
    2018-08-08 10:46:03.608054: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:971] 0:   N
    2018-08-08 10:46:03.610095: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1084] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6364 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:02:00.0, compute capability: 6.1)
    2018-08-08 10:46:28.351472: I T:\src\github\tensorflow\tensorflow\core\kernels\data\shuffle_dataset_op.cc:95] Filling up shuffle buffer (this may take a while): 345 of 2048
    2018-08-08 10:46:38.337869: I T:\src\github\tensorflow\tensorflow\core\kernels\data\shuffle_dataset_op.cc:95] Filling up shuffle buffer (this may take a while): 537 of 2048
    2018-08-08 10:46:48.592470: I T:\src\github\tensorflow\tensorflow\core\kernels\data\shuffle_dataset_op.cc:95] Filling up shuffle buffer (this may take a while): 780 of 2048
    2018-08-08 10:46:58.358157: I T:\src\github\tensorflow\tensorflow\core\kernels\data\shuffle_dataset_op.cc:95] Filling up shuffle buffer (this may take a while): 967 of 2048
    2018-08-08 10:47:08.412867: I T:\src\github\tensorflow\tensorflow\core\kernels\data\shuffle_dataset_op.cc:95] Filling up shuffle buffer (this may take a while): 1157 of 2048
    2018-08-08 10:47:18.415555: I T:\src\github\tensorflow\tensorflow\core\kernels\data\shuffle_dataset_op.cc:95] Filling up shuffle buffer (this may take a while): 1364 of 2048
    2018-08-08 10:47:28.401811: I T:\src\github\tensorflow\tensorflow\core\kernels\data\shuffle_dataset_op.cc:95] Filling up shuffle buffer (this may take a while): 1559 of 2048

Config file

# Faster R-CNN with Resnet-101 (v1), configuration for MSCOCO Dataset.
# Users should configure the fine_tune_checkpoint field in the train config as
# well as the label_map_path and input_path fields in the train_input_reader and
# eval_input_reader. Search for "PATH_TO_BE_CONFIGURED" to find the fields that
# should be configured.

model {
  faster_rcnn {
    num_classes: 1
    image_resizer {
      keep_aspect_ratio_resizer {
        min_dimension: 600
        max_dimension: 4000
      }
    }
    feature_extractor {
      type: 'faster_rcnn_resnet101'
      first_stage_features_stride: 16
    }
    first_stage_anchor_generator {
      grid_anchor_generator {
        scales: [0.25, 0.5, 1.0, 2.0]
        aspect_ratios: [0.5, 1.0, 2.0]
        height_stride: 16
        width_stride: 16
      }
    }
    first_stage_box_predictor_conv_hyperparams {
      op: CONV
      regularizer {
        l2_regularizer {
          weight: 0.0
        }
      }
      initializer {
        truncated_normal_initializer {
          stddev: 0.01
        }
      }
    }
    first_stage_nms_score_threshold: 0.0
    first_stage_nms_iou_threshold: 0.7
    first_stage_max_proposals: 300
    first_stage_localization_loss_weight: 2.0
    first_stage_objectness_loss_weight: 1.0
    initial_crop_size: 14
    maxpool_kernel_size: 2
    maxpool_stride: 2
    second_stage_box_predictor {
      mask_rcnn_box_predictor {
        use_dropout: false
        dropout_keep_probability: 1.0
        fc_hyperparams {
          op: FC
          regularizer {
            l2_regularizer {
              weight: 0.0
            }
          }
          initializer {
            variance_scaling_initializer {
              factor: 1.0
              uniform: true
              mode: FAN_AVG
            }
          }
        }
      }
    }
    second_stage_post_processing {
      batch_non_max_suppression {
        score_threshold: 0.0
        iou_threshold: 0.6
        max_detections_per_class: 100
        max_total_detections: 300
      }
      score_converter: SOFTMAX
    }
    second_stage_localization_loss_weight: 2.0
    second_stage_classification_loss_weight: 1.0
  }
}

train_config: {
  batch_size: 1
  optimizer {
    momentum_optimizer: {
      learning_rate: {
        manual_step_learning_rate {
          initial_learning_rate: 0.0003
          schedule {
            step: 900000
            learning_rate: .00003
          }
          schedule {
            step: 1200000
            learning_rate: .000003
          }
        }
      }
      momentum_optimizer_value: 0.9
    }
    use_moving_average: false
  }
  gradient_clipping_by_norm: 10.0
  fine_tune_checkpoint: "models/faster_rcnn_resnet101_coco_2018_01_28/model.ckpt"
  from_detection_checkpoint: true
  data_augmentation_options {
    random_horizontal_flip {
    }
  }
}

train_input_reader: {
  tf_record_input_reader {
    input_path: "data/train.record"
  }
  label_map_path: "data/spheroid_label_map.pbtxt"
}

eval_config: {
  num_examples: 1000
  metrics_set: "coco_detection_metrics"
  # Note: The below line limits the evaluation process to 10 evaluations.
  # Remove the below line to evaluate indefinitely.
  max_evals: 10
}

eval_input_reader: {
  tf_record_input_reader {
    input_path: "data/eval.record"
  }
  label_map_path: "data/spheroid_label_map.pbtxt"
  shuffle: false
  num_readers: 1
}

tensorflowbutler commented 6 years ago

Thank you for your post. We noticed you have not filled out the following field in the issue template. Could you update them if they are relevant in your case, or leave them as N/A? Thanks. Bazel version

alexmagsam commented 6 years ago

Bazel Version N/A

k-w-w commented 6 years ago

The program terminates without throwing any errors or system messages? Since it seems to end while filling up up the shuffle buffer, perhaps try lowering the shuffle_buffer_size value.

alexmagsam commented 6 years ago

Yes the output I posted was copy and pasted. Where can I set the shuffle_buffer_size? Can you refer me to documentation or the file where it is set? This setting is not listed in my configuration file.

k-w-w commented 6 years ago

The configuration instructions are here: https://github.com/tensorflow/models/blob/b9ca525f88cd942882ca541ec5ac9d27bb87a24f/research/object_detection/g3doc/configuring_jobs.md

Taking a look at the InputReader proto, there's an optional field called "shuffle_buffer_size".

So, for example, you can set the train_input_reader field in the config file like this:

train_input_reader: {
  shuffle_buffer_size: 1024
}

If you have further questions about configuring the pipeline, Stack Overflow would be the best place to ask. If you are still encountering issues that you believe is a bug or feature request, feel free to open another issue.

kulsemig commented 6 years ago

Hi @alexmagsam, did you solve your problem with "Filling up shuffle buffer"?

alexmagsam commented 6 years ago

Partially, yes. I added shuffle: false to the train_input_reader field like so

train_input_reader: {
   shuffle: false
}

But now I receive a different error after training begins.

  File "model_main.py", line 101, in <module>
    tf.app.run()
  File "C:\Users\Alexm\my-venv\lib\site-packages\tensorflow\python\platform\app.py", line 125, in run
    _sys.exit(main(argv))
  File "model_main.py", line 97, in main
    tf.estimator.train_and_evaluate(estimator, train_spec, eval_specs[0])
  File "C:\Users\Alexm\my-venv\lib\site-packages\tensorflow\python\estimator\training.py", line 447, in train_and_evaluate
    return executor.run()
  File "C:\Users\Alexm\my-venv\lib\site-packages\tensorflow\python\estimator\training.py", line 531, in run
    return self.run_local()
  File "C:\Users\Alexm\my-venv\lib\site-packages\tensorflow\python\estimator\training.py", line 669, in run_local
    hooks=train_hooks)
  File "C:\Users\Alexm\my-venv\lib\site-packages\tensorflow\python\estimator\estimator.py", line 366, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "C:\Users\Alexm\my-venv\lib\site-packages\tensorflow\python\estimator\estimator.py", line 1119, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "C:\Users\Alexm\my-venv\lib\site-packages\tensorflow\python\estimator\estimator.py", line 1135, in _train_model_default
    saving_listeners)
  File "C:\Users\Alexm\my-venv\lib\site-packages\tensorflow\python\estimator\estimator.py", line 1336, in _train_with_estimator_spec
    _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
  File "C:\Users\Alexm\my-venv\lib\site-packages\tensorflow\python\training\monitored_session.py", line 577, in run
    run_metadata=run_metadata)
  File "C:\Users\Alexm\my-venv\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1053, in run
    run_metadata=run_metadata)
  File "C:\Users\Alexm\my-venv\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1144, in run
    raise six.reraise(*original_exc_info)
  File "C:\Users\Alexm\my-venv\lib\site-packages\six.py", line 693, in reraise
    raise value
  File "C:\Users\Alexm\my-venv\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1129, in run
    return self._sess.run(*args, **kwargs)
  File "C:\Users\Alexm\my-venv\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1209, in run
    run_metadata=run_metadata))
  File "C:\Users\Alexm\my-venv\lib\site-packages\tensorflow\python\training\basic_session_run_hooks.py", line 635, in after_run
    raise NanLossDuringTrainingError
tensorflow.python.training.basic_session_run_hooks.NanLossDuringTrainingError: NaN loss during training.

JanithT-Lboro commented 5 years ago

Hello, @alexmagsam, Did you ever find a solution to your problem as I am experiencing the same thing currently?

tensorflow / models