Retraining on COCO from scratch using given code, got stuck when saving the checkpoint 0

What is the top-level directory of the model you are using:models/research/object_detection/ Have I written custom code (as opposed to using a stock example script provided in TensorFlow):NO OS Platform and Distribution (e.g., Linux Ubuntu 16.04):Linux Ubuntu16.04 TensorFlow installed from (source or binary):pip3 install tensorflow-gpu TensorFlow version (use command below):1.14.0 Bazel version (if compiling from source):N/A CUDA/cuDNN version : cuda 10.0 cudnn-7.3.1 GPU model and memory:GeForce RTX 2080Ti 11G Exact command to reproduce:python3 object_detection/model_main.py \ --pipeline_config_path="/data/code/vision_ori/models/research/object_detection/samples/configs/faster_rcnn_resnet50_coco.config" \ --model_dir="/data/code/vision_ori/my_checkpoints" \ --num_train_steps=200000 \ --sample_1_of_n_eval_examples=1 \ --alsologtostderr

Describe the problem

I want to retrain faster-rcnn on MSCOCO dataset from scratch with model_main.py. First I generate tfrecord file using create_coco_tf_record.py with COCO2017 Detection, and I got train/val file like this: coco_train.record-00000-of-00100. After that, I ran model_main.py , and the commang window outputs many warning logs. Then I got stuck at Saving checkpoints for 0 into /data/code/vision_ori/my_checkpoints/model.ckpt. I checked carefully and found the process stuck while building up a new MonitoredSession object .

Source code / logs

logs:

2019-09-12 14:36:27.183614: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1
640] Found device 0 with properties: 
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545
pciBusID: 0000:1b:00.0
2019-09-12 14:36:27.185145: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1
640] Found device 1 with properties: 
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545
pciBusID: 0000:1c:00.0
2019-09-12 14:36:27.186642: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1
640] Found device 2 with properties: 
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545
pciBusID: 0000:1d:00.0
2019-09-12 14:36:27.188176: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1
640] Found device 3 with properties: 
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545
pciBusID: 0000:1e:00.0
2019-09-12 14:36:27.190511: I tensorflow/stream_executor/platform/default/dso_lo
ader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-09-12 14:36:27.242760: I tensorflow/stream_executor/platform/default/dso_lo
ader.cc:42] Successfully opened dynamic library libcufft.so.10.0
2019-09-12 14:36:27.293652: I tensorflow/stream_executor/platform/default/dso_lo
ader.cc:42] Successfully opened dynamic library libcurand.so.10.0
2019-09-12 14:36:27.324475: I tensorflow/stream_executor/platform/default/dso_lo
ader.cc:42] Successfully opened dynamic library libcusolver.so.10.0
2019-09-12 14:36:27.345063: I tensorflow/stream_executor/platform/default/dso_lo
ader.cc:42] Successfully opened dynamic library libcusparse.so.10.0
2019-09-12 14:36:27.355112: I tensorflow/stream_executor/platform/default/dso_lo
ader.cc:53] Could not dlopen library 'libcudnn.so.7'; dlerror: libcudnn.so.7: ca
nnot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/l
ocal/nvidia/lib64:/data/cuda/cuda-10.0/cuda/lib64:/data/cuda/cuda-10.0/cuda/lib6
4
2019-09-12 14:36:27.355137: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1
663] Cannot dlopen some GPU libraries. Skipping registering GPU devices...
2019-09-12 14:36:27.355137: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1
663] Cannot dlopen some GPU libraries. Skipping registering GPU devices...
2019-09-12 14:36:27.355518: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1
181] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-09-12 14:36:27.355536: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1
187]      0 1 2 3 
2019-09-12 14:36:27.355553: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1
200] 0:   N N N N 
2019-09-12 14:36:27.355569: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1
200] 1:   N N N N 
2019-09-12 14:36:27.355584: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1
200] 2:   N N N N 
2019-09-12 14:36:27.355600: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1
200] 3:   N N N N 
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python
/training/saver.py:1276: checkpoint_exists (from tensorflow.python.training.chec
kpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
W0912 14:36:27.357658 140603477600000 deprecation.py:323] From /usr/local/lib/py
thon3.5/dist-packages/tensorflow/python/training/saver.py:1276: checkpoint_exist
s (from tensorflow.python.training.checkpoint_management) is deprecated and will
 be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
INFO:tensorflow:Restoring parameters from /data/code/vision_ori/my_checkpoints/m
odel.ckpt-0
I0912 14:36:27.359247 140603477600000 saver.py:1280] Restoring parameters from /
data/code/vision_ori/my_checkpoints/model.ckpt-0
2019-09-12 14:36:28.141141: W tensorflow/compiler/jit/mark_for_compilation_pass.
cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA
_FLAGS=--tf_xla_cpu_global_jit was not set.  If you want XLA:CPU, either set tha
t envvar, or use experimental_jit_scope to enable XLA:CPU.  To confirm that XLA 
is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag
, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python
/training/saver.py:1066: get_checkpoint_mtimes (from tensorflow.python.training.
checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file utilities to get mtimes.
W0912 14:36:28.481194 140603477600000 deprecation.py:323] From /usr/local/lib/py
thon3.5/dist-packages/tensorflow/python/training/saver.py:1066: get_checkpoint_m
times (from tensorflow.python.training.checkpoint_management) is deprecated and 
will be removed in a future version.
Instructions for updating:
Use standard file utilities to get mtimes.
INFO:tensorflow:Running local_init_op.
I0912 14:36:29.053177 140603477600000 session_manager.py:500] Running local_init
_op.
INFO:tensorflow:Done running local_init_op.
I0912 14:36:29.196373 140603477600000 session_manager.py:502] Done running local
_init_op.
INFO:tensorflow:Saving checkpoints for 0 into /data/code/vision_ori/my_checkpoin
ts/model.ckpt.
I0912 14:36:33.405228 140603477600000 basic_session_run_hooks.py:606] Saving che
ckpoints for 0 into /data/code/vision_ori/my_checkpoints/model.ckpt.

It gets stuck here for days and just won't go on.

building tf-record:

python3 create_coco_tf_record.py --logtostderr \
--train_image_dir="/data/code/vision_ori/dataset/train2017" \
--val_image_dir="/data/code/vision_ori/dataset/val2017" \
--test_image_dir="/data/code/vision_ori/dataset/test2017" \
--train_annotations_file="/data/code/vision_ori/dataset/anno/instances_train2017.json" \
--val_annotations_file="/data/code/vision_ori/dataset/anno/annotations/instances_val2017.json" \
--testdev_annotations_file="/data/code/vision_ori/dataset/anno/annotations/image_info_test-dev2017.json" \
--output_dir="cocodata"

trying to train:

python3 object_detection/model_main.py \
    --pipeline_config_path="/data/code/vision_ori/models/research/object_detection/samples/configs/faster_rcnn_inception_resnet_v2_atrous_coco.config" \
    --model_dir="/data/code/vision_ori/my_checkpoints" \
    --num_train_steps=200000 \
    --sample_1_of_n_eval_examples=1 \
    --alsologtostderr

config file:

I would like to train from scratch so I removed 2 lines of code:

fine_tune_checkpoint: "PATH_TO_BE_CONFIGURED/model.ckpt"
  from_detection_checkpoint: true

model {
  faster_rcnn {
    num_classes: 90
    image_resizer {
      keep_aspect_ratio_resizer {
        min_dimension: 600
        max_dimension: 1024
      }
    }
    feature_extractor {
      type: 'faster_rcnn_inception_resnet_v2'
      first_stage_features_stride: 8
    }
    first_stage_anchor_generator {
      grid_anchor_generator {
        scales: [0.25, 0.5, 1.0, 2.0]
        aspect_ratios: [0.5, 1.0, 2.0]
        height_stride: 8
        width_stride: 8
      }
    }
    first_stage_atrous_rate: 2
    first_stage_box_predictor_conv_hyperparams {
      op: CONV
      regularizer {
        l2_regularizer {
          weight: 0.0
        }
      }
      initializer {
        truncated_normal_initializer {
          stddev: 0.01
        }
      }
    }
    first_stage_nms_score_threshold: 0.0
    first_stage_nms_iou_threshold: 0.7
    first_stage_max_proposals: 300
    first_stage_localization_loss_weight: 2.0
    first_stage_objectness_loss_weight: 1.0
    initial_crop_size: 17
    maxpool_kernel_size: 1
    maxpool_stride: 1
    second_stage_box_predictor {
      mask_rcnn_box_predictor {
        use_dropout: false
        dropout_keep_probability: 1.0
        fc_hyperparams {
          op: FC
          regularizer {
            l2_regularizer {
              weight: 0.0
            }
          }
          initializer {
            variance_scaling_initializer {
              factor: 1.0
              uniform: true
              mode: FAN_AVG
            }
          }
        }
      }
    }
    second_stage_post_processing {
      batch_non_max_suppression {
        score_threshold: 0.0
        iou_threshold: 0.6
        max_detections_per_class: 100
        max_total_detections: 100
      }
      score_converter: SOFTMAX
    }
    second_stage_localization_loss_weight: 2.0
    second_stage_classification_loss_weight: 1.0
  }
}

train_config: {
  batch_size: 1
  optimizer {
    momentum_optimizer: {
      learning_rate: {
        manual_step_learning_rate {
          initial_learning_rate: 0.0003
          schedule {
            step: 900000
            learning_rate: .00003
          }
          schedule {
            step: 1200000
            learning_rate: .000003
          }
        }
      }
      momentum_optimizer_value: 0.9
    }
    use_moving_average: false
  }
  gradient_clipping_by_norm: 10.0
  # Note: The below line limits the training process to 200K steps, which we
  # empirically found to be sufficient enough to train the pets dataset. This
  # effectively bypasses the learning rate schedule (the learning rate will
  # never decay). Remove the below line to train indefinitely.
  num_steps: 200000
  data_augmentation_options {
    random_horizontal_flip {
    }
  }
}

train_input_reader: {
  tf_record_input_reader {
    input_path: "/data/code/vision_ori/models/research/object_detection/dataset_tools/cocodata/coco_train.record-00000-of-00100"
  }
  label_map_path: "/data/code/vision_ori/models/research/object_detection/data/mscoco_label_map.pbtxt"
}

eval_config: {
  num_examples: 5000
  # Note: The below line limits the evaluation process to 10 evaluations.
  # Remove the below line to evaluate indefinitely.
  max_evals: 10
}

eval_input_reader: {
  tf_record_input_reader {
    input_path: "/data/code/vision_ori/models/research/object_detection/dataset_tools/cocodata/coco_val.record-00000-of-00010"
  }
  label_map_path: "/data/code/vision_ori/models/research/object_detection/data/mscoco_label_map.pbtxt"
  shuffle: false
  num_readers: 1
}

tensorflow / models