tensorflow / models

Models and examples built with TensorFlow
Other
77.01k stars 45.78k forks source link

Retraining on COCO from scratch using given code, got stuck when saving the checkpoint 0 #7555

Open BrownZong opened 5 years ago

BrownZong commented 5 years ago

What is the top-level directory of the model you are using:models/research/object_detection/ Have I written custom code (as opposed to using a stock example script provided in TensorFlow):NO OS Platform and Distribution (e.g., Linux Ubuntu 16.04):Linux Ubuntu16.04 TensorFlow installed from (source or binary):pip3 install tensorflow-gpu TensorFlow version (use command below):1.14.0 Bazel version (if compiling from source):N/A CUDA/cuDNN version : cuda 10.0 cudnn-7.3.1 GPU model and memory:GeForce RTX 2080Ti 11G Exact command to reproduce:python3 object_detection/model_main.py \ --pipeline_config_path="/data/code/vision_ori/models/research/object_detection/samples/configs/faster_rcnn_resnet50_coco.config" \ --model_dir="/data/code/vision_ori/my_checkpoints" \ --num_train_steps=200000 \ --sample_1_of_n_eval_examples=1 \ --alsologtostderr

Describe the problem

I want to retrain faster-rcnn on MSCOCO dataset from scratch with model_main.py. First I generate tfrecord file using create_coco_tf_record.py with COCO2017 Detection, and I got train/val file like this: coco_train.record-00000-of-00100. After that, I ran model_main.py , and the commang window outputs many warning logs. Then I got stuck at Saving checkpoints for 0 into /data/code/vision_ori/my_checkpoints/model.ckpt. I checked carefully and found the process stuck while building up a new MonitoredSession object .

Source code / logs

logs:

2019-09-12 14:36:27.183614: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1
640] Found device 0 with properties: 
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545
pciBusID: 0000:1b:00.0
2019-09-12 14:36:27.185145: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1
640] Found device 1 with properties: 
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545
pciBusID: 0000:1c:00.0
2019-09-12 14:36:27.186642: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1
640] Found device 2 with properties: 
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545
pciBusID: 0000:1d:00.0
2019-09-12 14:36:27.188176: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1
640] Found device 3 with properties: 
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545
pciBusID: 0000:1e:00.0
2019-09-12 14:36:27.190511: I tensorflow/stream_executor/platform/default/dso_lo
ader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-09-12 14:36:27.242760: I tensorflow/stream_executor/platform/default/dso_lo
ader.cc:42] Successfully opened dynamic library libcufft.so.10.0
2019-09-12 14:36:27.293652: I tensorflow/stream_executor/platform/default/dso_lo
ader.cc:42] Successfully opened dynamic library libcurand.so.10.0
2019-09-12 14:36:27.324475: I tensorflow/stream_executor/platform/default/dso_lo
ader.cc:42] Successfully opened dynamic library libcusolver.so.10.0
2019-09-12 14:36:27.345063: I tensorflow/stream_executor/platform/default/dso_lo
ader.cc:42] Successfully opened dynamic library libcusparse.so.10.0
2019-09-12 14:36:27.355112: I tensorflow/stream_executor/platform/default/dso_lo
ader.cc:53] Could not dlopen library 'libcudnn.so.7'; dlerror: libcudnn.so.7: ca
nnot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/l
ocal/nvidia/lib64:/data/cuda/cuda-10.0/cuda/lib64:/data/cuda/cuda-10.0/cuda/lib6
4
2019-09-12 14:36:27.355137: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1
663] Cannot dlopen some GPU libraries. Skipping registering GPU devices...
2019-09-12 14:36:27.355137: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1
663] Cannot dlopen some GPU libraries. Skipping registering GPU devices...
2019-09-12 14:36:27.355518: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1
181] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-09-12 14:36:27.355536: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1
187]      0 1 2 3 
2019-09-12 14:36:27.355553: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1
200] 0:   N N N N 
2019-09-12 14:36:27.355569: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1
200] 1:   N N N N 
2019-09-12 14:36:27.355584: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1
200] 2:   N N N N 
2019-09-12 14:36:27.355600: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1
200] 3:   N N N N 
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python
/training/saver.py:1276: checkpoint_exists (from tensorflow.python.training.chec
kpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
W0912 14:36:27.357658 140603477600000 deprecation.py:323] From /usr/local/lib/py
thon3.5/dist-packages/tensorflow/python/training/saver.py:1276: checkpoint_exist
s (from tensorflow.python.training.checkpoint_management) is deprecated and will
 be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
INFO:tensorflow:Restoring parameters from /data/code/vision_ori/my_checkpoints/m
odel.ckpt-0
I0912 14:36:27.359247 140603477600000 saver.py:1280] Restoring parameters from /
data/code/vision_ori/my_checkpoints/model.ckpt-0
2019-09-12 14:36:28.141141: W tensorflow/compiler/jit/mark_for_compilation_pass.
cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA
_FLAGS=--tf_xla_cpu_global_jit was not set.  If you want XLA:CPU, either set tha
t envvar, or use experimental_jit_scope to enable XLA:CPU.  To confirm that XLA 
is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag
, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python
/training/saver.py:1066: get_checkpoint_mtimes (from tensorflow.python.training.
checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file utilities to get mtimes.
W0912 14:36:28.481194 140603477600000 deprecation.py:323] From /usr/local/lib/py
thon3.5/dist-packages/tensorflow/python/training/saver.py:1066: get_checkpoint_m
times (from tensorflow.python.training.checkpoint_management) is deprecated and 
will be removed in a future version.
Instructions for updating:
Use standard file utilities to get mtimes.
INFO:tensorflow:Running local_init_op.
I0912 14:36:29.053177 140603477600000 session_manager.py:500] Running local_init
_op.
INFO:tensorflow:Done running local_init_op.
I0912 14:36:29.196373 140603477600000 session_manager.py:502] Done running local
_init_op.
INFO:tensorflow:Saving checkpoints for 0 into /data/code/vision_ori/my_checkpoin
ts/model.ckpt.
I0912 14:36:33.405228 140603477600000 basic_session_run_hooks.py:606] Saving che
ckpoints for 0 into /data/code/vision_ori/my_checkpoints/model.ckpt.

It gets stuck here for days and just won't go on.

building tf-record:

python3 create_coco_tf_record.py --logtostderr \
--train_image_dir="/data/code/vision_ori/dataset/train2017" \
--val_image_dir="/data/code/vision_ori/dataset/val2017" \
--test_image_dir="/data/code/vision_ori/dataset/test2017" \
--train_annotations_file="/data/code/vision_ori/dataset/anno/instances_train2017.json" \
--val_annotations_file="/data/code/vision_ori/dataset/anno/annotations/instances_val2017.json" \
--testdev_annotations_file="/data/code/vision_ori/dataset/anno/annotations/image_info_test-dev2017.json" \
--output_dir="cocodata"

trying to train:

python3 object_detection/model_main.py \
    --pipeline_config_path="/data/code/vision_ori/models/research/object_detection/samples/configs/faster_rcnn_inception_resnet_v2_atrous_coco.config" \
    --model_dir="/data/code/vision_ori/my_checkpoints" \
    --num_train_steps=200000 \
    --sample_1_of_n_eval_examples=1 \
    --alsologtostderr

config file:

I would like to train from scratch so I removed 2 lines of code:

fine_tune_checkpoint: "PATH_TO_BE_CONFIGURED/model.ckpt"
  from_detection_checkpoint: true
model {
  faster_rcnn {
    num_classes: 90
    image_resizer {
      keep_aspect_ratio_resizer {
        min_dimension: 600
        max_dimension: 1024
      }
    }
    feature_extractor {
      type: 'faster_rcnn_inception_resnet_v2'
      first_stage_features_stride: 8
    }
    first_stage_anchor_generator {
      grid_anchor_generator {
        scales: [0.25, 0.5, 1.0, 2.0]
        aspect_ratios: [0.5, 1.0, 2.0]
        height_stride: 8
        width_stride: 8
      }
    }
    first_stage_atrous_rate: 2
    first_stage_box_predictor_conv_hyperparams {
      op: CONV
      regularizer {
        l2_regularizer {
          weight: 0.0
        }
      }
      initializer {
        truncated_normal_initializer {
          stddev: 0.01
        }
      }
    }
    first_stage_nms_score_threshold: 0.0
    first_stage_nms_iou_threshold: 0.7
    first_stage_max_proposals: 300
    first_stage_localization_loss_weight: 2.0
    first_stage_objectness_loss_weight: 1.0
    initial_crop_size: 17
    maxpool_kernel_size: 1
    maxpool_stride: 1
    second_stage_box_predictor {
      mask_rcnn_box_predictor {
        use_dropout: false
        dropout_keep_probability: 1.0
        fc_hyperparams {
          op: FC
          regularizer {
            l2_regularizer {
              weight: 0.0
            }
          }
          initializer {
            variance_scaling_initializer {
              factor: 1.0
              uniform: true
              mode: FAN_AVG
            }
          }
        }
      }
    }
    second_stage_post_processing {
      batch_non_max_suppression {
        score_threshold: 0.0
        iou_threshold: 0.6
        max_detections_per_class: 100
        max_total_detections: 100
      }
      score_converter: SOFTMAX
    }
    second_stage_localization_loss_weight: 2.0
    second_stage_classification_loss_weight: 1.0
  }
}

train_config: {
  batch_size: 1
  optimizer {
    momentum_optimizer: {
      learning_rate: {
        manual_step_learning_rate {
          initial_learning_rate: 0.0003
          schedule {
            step: 900000
            learning_rate: .00003
          }
          schedule {
            step: 1200000
            learning_rate: .000003
          }
        }
      }
      momentum_optimizer_value: 0.9
    }
    use_moving_average: false
  }
  gradient_clipping_by_norm: 10.0
  # Note: The below line limits the training process to 200K steps, which we
  # empirically found to be sufficient enough to train the pets dataset. This
  # effectively bypasses the learning rate schedule (the learning rate will
  # never decay). Remove the below line to train indefinitely.
  num_steps: 200000
  data_augmentation_options {
    random_horizontal_flip {
    }
  }
}

train_input_reader: {
  tf_record_input_reader {
    input_path: "/data/code/vision_ori/models/research/object_detection/dataset_tools/cocodata/coco_train.record-00000-of-00100"
  }
  label_map_path: "/data/code/vision_ori/models/research/object_detection/data/mscoco_label_map.pbtxt"
}

eval_config: {
  num_examples: 5000
  # Note: The below line limits the evaluation process to 10 evaluations.
  # Remove the below line to evaluate indefinitely.
  max_evals: 10
}

eval_input_reader: {
  tf_record_input_reader {
    input_path: "/data/code/vision_ori/models/research/object_detection/dataset_tools/cocodata/coco_val.record-00000-of-00010"
  }
  label_map_path: "/data/code/vision_ori/models/research/object_detection/data/mscoco_label_map.pbtxt"
  shuffle: false
  num_readers: 1
}
BraginIvan commented 5 years ago

In your logs

No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib64:/data/cuda/cuda-10.0/cuda/lib64:/data/cuda/cuda-10.0/cuda/lib64

So you train on CPU and tf is not compiled for your CPU

Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set.

So it is not stuck it is just very slow, log will appear after 100 steps by default.

If you have issues with cuda installation just use docker provided by tensorflow