roysheffi commented 6 years ago

Describe the problem

While running a training job on Cloud ML Engine (Runtime Version 1.6), after running for a while (~5900 steps), the job fails and worker-replica-0 reports the following message:

Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.UnavailableError'>, OS Error

Finally, worker-replica-0 returns:

returned non-zero exit status 1

Please see logs/Traceback below.

System information

What is the top-level directory of the model you are using: /models/research
Have I written custom code: No, Running the latest 2018/03/27 commit of Object Detection API over Cloud ML Engine. However, I did have to make a small python 3 compatibility fix. In models/research/object_detection/utils/learning_schedules.py: line 168, I wrapped the range() with a list():
```
rate_index = tf.reduce_max(tf.where(tf.greater_equal(global_step, boundaries),
                                           list(range(num_boundaries)),
                                           [0] * num_boundaries))
```
OS Platform and Distribution: Google Cloud ML Engine Runtime 1.6
TensorFlow version (use command below): 1.6
GPU model and memory: Nvidia Tesla P100 16GB

Exact command to reproduce:

gcloud ml-engine jobs submit training object_detection_`date +%s` \
--job-dir=gs://${TRAIN_DIR} \
--packages gs://${DIST_DIR}/object_detection-0.1.tar.gz,gs://${DIST_DIR}/slim-0.1.tar.gz \
--module-name object_detection.train \
--region us-central1 \
--config ${PATH_TO_LOCAL_YAML_FILE} \
-- \
--train_dir=gs://${TRAIN_DIR} \
--pipeline_config_path=gs://${PIPELINE_CONFIG_PATH} \
--num_clones=4

Exact Cloud ML Engine YAML file to reproduce:

trainingInput:
runtimeVersion: '1.6'
pythonVersion: '3.5'
scaleTier: CUSTOM
masterType: complex_model_m_p100
workerType: complex_model_m_p100
parameterServerType: large_model
workerCount: 1
parameterServerCount: 3

Link to Pre-trained model to reproduce: faster_rcnn_resnet101_coco_2018_01_28

Exact Pipeline Config file to reproduce:


model {
faster_rcnn {
num_classes: 1
image_resizer {
  keep_aspect_ratio_resizer {
    min_dimension: 600
    max_dimension: 1024
  }
}
feature_extractor {
  type: 'faster_rcnn_resnet101'
  first_stage_features_stride: 16
}
first_stage_anchor_generator {
  grid_anchor_generator {
    scales: [0.25, 0.5, 1.0, 2.0]
    aspect_ratios: [0.5, 1.0, 2.0]
    height_stride: 16
    width_stride: 16
  }
}
first_stage_box_predictor_conv_hyperparams {
  op: CONV
  regularizer {
    l2_regularizer {
      weight: 0.0
    }
  }
  initializer {
    truncated_normal_initializer {
      stddev: 0.01
    }
  }
}
first_stage_nms_score_threshold: 0.0
first_stage_nms_iou_threshold: 0.7
first_stage_max_proposals: 300
first_stage_localization_loss_weight: 2.0
first_stage_objectness_loss_weight: 1.0
initial_crop_size: 14
maxpool_kernel_size: 2
maxpool_stride: 2
second_stage_box_predictor {
  mask_rcnn_box_predictor {
    use_dropout: false
    dropout_keep_probability: 1.0
    fc_hyperparams {
      op: FC
      regularizer {
        l2_regularizer {
          weight: 0.0
        }
      }
      initializer {
        variance_scaling_initializer {
          factor: 1.0
          uniform: true
          mode: FAN_AVG
        }
      }
    }
  }
}
second_stage_post_processing {
  batch_non_max_suppression {
    score_threshold: 0.0
    iou_threshold: 0.6
    max_detections_per_class: 100
    max_total_detections: 300
  }
  score_converter: SOFTMAX
}
second_stage_localization_loss_weight: 2.0
second_stage_classification_loss_weight: 1.0
}
}

train_config: { batch_size: 4 optimizer { momentum_optimizer: { learning_rate: { manual_step_learning_rate { initial_learning_rate: 3e-4 schedule { step: 900000 learning_rate: 3e-5 } schedule { step: 1200000 learning_rate: 3e-6 } } } momentum_optimizer_value: 0.9 } use_moving_average: false } gradient_clipping_by_norm: 10.0 fine_tune_checkpoint: "gs://PATH_TO_BE_CONFIGURED/model.ckpt" from_detection_checkpoint: true num_steps: 30000 data_augmentation_options { random_horizontal_flip { } } }

train_input_reader: { tf_record_input_reader { input_path: "gs://PATH_TO_BE_CONFIGURED/agroset_train.record" } label_map_path: "gs://PATH_TO_BE_CONFIGURED/agroset_label_map.pbtxt" }

eval_config: { num_examples: 25

Note: The below line limits the evaluation process to 10 evaluations.

Remove the below line to evaluate indefinitely.

max_evals: 30 visualization_export_dir: "gs://PATH_TO_BE_CONFIGURED/visualization" }

eval_input_reader: { tf_record_input_reader { input_path: "gs://PATH_TO_BE_CONFIGURED/agroset_val.record" } label_map_path: "gs://PATH_TO_BE_CONFIGURED/agroset_label_map.pbtxt" shuffle: false num_readers: 1 }

### Logs
**worker-replica-0**

The replica worker 0 exited with a non-zero status of 1. Termination reason: Error.

Traceback (most recent call last): File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1361, in _do_call return fn(*args) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1340, in _run_fn target_list, status, run_metadata) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 516, in exit c_api.TF_GetCode(self.status.status)) tensorflow.python.framework.errors_impl.UnavailableError: OS Error

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.5/runpy.py", line 85, in _run_code exec(code, run_globals) File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 167, in tf.app.run() File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 126, in run _sys.exit(main(argv)) File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 163, in main worker_job_name, is_chief, FLAGS.train_dir) File "/root/.local/lib/python3.5/site-packages/object_detection/trainer.py", line 370, in train saver=saver) File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 768, in train sess, train_op, global_step, train_step_kwargs) File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 487, in train_step run_metadata=run_metadata) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 905, in run run_metadata_ptr) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1137, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1355, in _do_run options, run_metadata) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1374, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.UnavailableError: OS Error

alexandru-modoranu commented 6 years ago

Hi, I managed to run the faster_rcnn kitti model with transfer learning for my own dataset by disabling the workers (so only the standard_gpu master) and two large_model paramater servers up to ~80k steps. I suspect that the issue is caused by an OOM error as the GPU's (tesla) memory is at ~40%

roysheffi commented 6 years ago

Hi @MoMo-Tech, how removing the workers can alleviate GPU OOM issues?

roysheffi commented 6 years ago

Hi @pkulzc, if you'll repeat the same experiment but with Cloud ML Engine TF 1.5 instead of 1.6 then you'll reproduce #3757. I believe they are the same issue that manifests itself differently in TF 1.5 and TF 1.6.

alexandru-modoranu commented 6 years ago

On my initial runs I noticed that the workers were assigned on the same machine ID. And with this in mind I changed the GPU type to the next tesla tier and noticed that a few more steps were being computed, but still it failed ~10K epochs. There are also a few other tickets that claim there might be an OOM handled by the OS and thus getting the OS error.

pkulzc commented 6 years ago

Thanks for the info, we're investigating this as well as 3757 now.

roysheffi commented 6 years ago

Hi @pkulzc, today I pulled the HEAD of models repository and tried again and the problem still persists. However, I got more informative error messages than I was getting before:

master-replica-0

Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.UnavailableError'>, OS Error
[[Node: clip_grads/clip_by_norm_74/truediv_S10171 =
  _Recv[client_terminated=false,
  recv_device="/job:ps/replica:0/task:2/device:CPU:0",
  send_device="/job:master/replica:0/task:0/device:CPU:0",
  send_device_incarnation=4406029107278666217,
  tensor_name="edge_32391_clip_grads/clip_by_norm_74/truediv",
  tensor_type=DT_FLOAT,
  _device="/job:ps/replica:0/task:2/device:CPU:0"]()]]
[[Node: Momentum/update/NoOp_2_S10192 =
  _Recv[client_terminated=false,
  recv_device="/job:master/replica:0/task:0/device:CPU:0",
  send_device="/job:ps/replica:0/task:2/device:CPU:0",
  send_device_incarnation=7295333245552102225,
  tensor_name="edge_32739_Momentum/update/NoOp_2",
  tensor_type=DT_FLOAT,
  _device="/job:master/replica:0/task:0/device:CPU:0"]()]]
[[Node: clip_grads/clip_by_norm_213/truediv_S9737 =
  _Recv[client_terminated=false,
  recv_device="/job:ps/replica:0/task:3/device:CPU:0",
  send_device="/job:master/replica:0/task:0/device:CPU:0",
  send_device_incarnation=4406029107278666217,
  tensor_name="edge_26466_clip_grads/clip_by_norm_213/truediv",
  tensor_type=DT_FLOAT,
  _device="/job:ps/replica:0/task:3/device:CPU:0"]()]]

Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1361, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1340, in _run_fn
    target_list, status, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 516, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.UnavailableError: OS Error
     [[Node: clip_grads/clip_by_norm_74/truediv_S10171 = _Recv[client_terminated=false, recv_device="/job:ps/replica:0/task:2/device:CPU:0", send_device="/job:master/replica:0/task:0/device:CPU:0", send_device_incarnation=4406029107278666217, tensor_name="edge_32391_clip_grads/clip_by_norm_74/truediv", tensor_type=DT_FLOAT, _device="/job:ps/replica:0/task:2/device:CPU:0"]()]]
     [[Node: Momentum/update/NoOp_2_S10192 = _Recv[client_terminated=false, recv_device="/job:master/replica:0/task:0/device:CPU:0", send_device="/job:ps/replica:0/task:2/device:CPU:0", send_device_incarnation=7295333245552102225, tensor_name="edge_32739_Momentum/update/NoOp_2", tensor_type=DT_FLOAT, _device="/job:master/replica:0/task:0/device:CPU:0"]()]]
     [[Node: clip_grads/clip_by_norm_213/truediv_S9737 = _Recv[client_terminated=false, recv_device="/job:ps/replica:0/task:3/device:CPU:0", send_device="/job:master/replica:0/task:0/device:CPU:0", send_device_incarnation=4406029107278666217, tensor_name="edge_26466_clip_grads/clip_by_norm_213/truediv", tensor_type=DT_FLOAT, _device="/job:ps/replica:0/task:3/device:CPU:0"]()]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 167, in <module>
    tf.app.run()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 126, in run
    _sys.exit(main(argv))
  File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 163, in main
    worker_job_name, is_chief, FLAGS.train_dir)
  File "/root/.local/lib/python3.5/site-packages/object_detection/trainer.py", line 370, in train
    saver=saver)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 768, in train
    sess, train_op, global_step, train_step_kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 487, in train_step
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 905, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1137, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1355, in _do_run
    options, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1374, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnavailableError: OS Error
     [[Node: clip_grads/clip_by_norm_74/truediv_S10171 = _Recv[client_terminated=false, recv_device="/job:ps/replica:0/task:2/device:CPU:0", send_device="/job:master/replica:0/task:0/device:CPU:0", send_device_incarnation=4406029107278666217, tensor_name="edge_32391_clip_grads/clip_by_norm_74/truediv", tensor_type=DT_FLOAT, _device="/job:ps/replica:0/task:2/device:CPU:0"]()]]
     [[Node: Momentum/update/NoOp_2_S10192 = _Recv[client_terminated=false, recv_device="/job:master/replica:0/task:0/device:CPU:0", send_device="/job:ps/replica:0/task:2/device:CPU:0", send_device_incarnation=7295333245552102225, tensor_name="edge_32739_Momentum/update/NoOp_2", tensor_type=DT_FLOAT, _device="/job:master/replica:0/task:0/device:CPU:0"]()]]
     [[Node: clip_grads/clip_by_norm_213/truediv_S9737 = _Recv[client_terminated=false, recv_device="/job:ps/replica:0/task:3/device:CPU:0", send_device="/job:master/replica:0/task:0/device:CPU:0", send_device_incarnation=4406029107278666217, tensor_name="edge_26466_clip_grads/clip_by_norm_213/truediv", tensor_type=DT_FLOAT, _device="/job:ps/replica:0/task:3/device:CPU:0"]()]]

/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gradients_impl.py:98: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.

Converting sparse IndexedSlices to a dense Tensor of unknown shape.

worker-replica-0

Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.UnavailableError'>, OS Error

2018-04-04 16:19:47.928690: E tensorflow/core/distributed_runtime/master_session.cc:1663] Cleanup partition error: Unavailable: OS Error

Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1361, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1340, in _run_fn
    target_list, status, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 516, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.UnavailableError: OS Error

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 167, in <module>
    tf.app.run()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 126, in run
    _sys.exit(main(argv))
  File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 163, in main
    worker_job_name, is_chief, FLAGS.train_dir)
  File "/root/.local/lib/python3.5/site-packages/object_detection/trainer.py", line 370, in train
    saver=saver)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 768, in train
    sess, train_op, global_step, train_step_kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 503, in train_step
    if sess.run(train_step_kwargs['should_log']):
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 905, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1137, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1355, in _do_run
    options, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1374, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnavailableError: OS Error

/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gradients_impl.py:98: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.

Converting sparse IndexedSlices to a dense Tensor of unknown shape.

ml-engine

The replica master 0 exited with a non-zero status of 1. Termination reason: Error. 
Traceback (most recent call last):
  [...]
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 487, in train_step
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 905, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1137, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1355, in _do_run
    options, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1374, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnavailableError: OS Error
     [[Node: clip_grads/clip_by_norm_74/truediv_S10171 = _Recv[client_terminated=false, recv_device="/job:ps/replica:0/task:2/device:CPU:0", send_device="/job:master/replica:0/task:0/device:CPU:0", send_device_incarnation=4406029107278666217, tensor_name="edge_32391_clip_grads/clip_by_norm_74/truediv", tensor_type=DT_FLOAT, _device="/job:ps/replica:0/task:2/device:CPU:0"]()]]
     [[Node: Momentum/update/NoOp_2_S10192 = _Recv[client_terminated=false, recv_device="/job:master/replica:0/task:0/device:CPU:0", send_device="/job:ps/replica:0/task:2/device:CPU:0", send_device_incarnation=7295333245552102225, tensor_name="edge_32739_Momentum/update/NoOp_2", tensor_type=DT_FLOAT, _device="/job:master/replica:0/task:0/device:CPU:0"]()]]
     [[Node: clip_grads/clip_by_norm_213/truediv_S9737 = _Recv[client_terminated=false, recv_device="/job:ps/replica:0/task:3/device:CPU:0", send_device="/job:master/replica:0/task:0/device:CPU:0", send_device_incarnation=4406029107278666217, tensor_name="edge_26466_clip_grads/clip_by_norm_213/truediv", tensor_type=DT_FLOAT, _device="/job:ps/replica:0/task:3/device:CPU:0"]()]]

The replica worker 0 exited with a non-zero status of 1. Termination reason: Error. 
Traceback (most recent call last):
  [...]
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 516, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.UnavailableError: OS Error

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 167, in <module>
    tf.app.run()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 126, in run
    _sys.exit(main(argv))
  File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 163, in main
    worker_job_name, is_chief, FLAGS.train_dir)
  File "/root/.local/lib/python3.5/site-packages/object_detection/trainer.py", line 370, in train
    saver=saver)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 768, in train
    sess, train_op, global_step, train_step_kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 503, in train_step
    if sess.run(train_step_kwargs['should_log']):
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 905, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1137, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1355, in _do_run
    options, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1374, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnavailableError: OS Error

roysheffi commented 6 years ago

Hi @pkulzc, thank you for investigating this issue!

Are you able to please share any updates or information regarding the status of this issue?

Thanks 👍

pkulzc commented 6 years ago

Sorry for the delay, right now the cloud team and tensorflow team are still investigating.

roysheffi commented 6 years ago

Great!

Thanks for the update

roysheffi commented 6 years ago

Hi @pkulzc, I think I may have a lead:

On line 357, object_detection/trainer.py calls tf.contrib.slim.learning.train() which uses the deprecated tf.train.Supervisor and should be migrated to tf.train.MonitoredTrainingSession instead, as documented in tf.train.Supervisor

This is already requested in tensorflow/tensorflow#15793 and is reported as a solution to tensorflow/tensorflow#17852 on the last comment of yahoo/TensorFlowOnSpark#245.

Wilson13 commented 6 years ago

@MoMo-Tech I also managed to run transfer learning based on ssd_mobilenet_v1_coco_2017_11_17 models up to about ~80k steps and it stopped with the following error.

The replica worker 4 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 184, in <module> tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 126, in run _sys.exit(main(argv)) File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 180, in main graph_hook_fn=graph_rewriter_fn) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 410, in train saver=saver) File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 769, in train sess, train_op, global_step, train_step_kwargs) File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 487, in train_step run_metadata=run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 900, in run run_metadata_ptr) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1135, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1316, in _do_run run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1335, in _do_call raise type(e)(node_def, op, message) UnavailableError: OS Error To find out more about why your job exited please check the logs: https://console.cloud.google.com/logs/viewer?project=1466216932&resource=ml_job%2Fjob_id%2Ffreshturf_object_detection_1530117506&advancedFilter=resource.type%3D%22ml_job%22%0Aresource.labels.job_id%3D%22freshturf_object_detection_1530117506%22

The master memory usage was around 38.8 ~ 49.7 %

nerdyalbin commented 6 years ago

I had the same error when the tensorflow instances trying to contact each other. I found out that the problem is that GRPC uses the native "epoll" polling engine for communication. Changing this to a portable polling engine solved this issue for me. The way to do is to set the environment variable, "GRPC_POLL_STRATEGY=poll" before running the tensorflow programs. This solved this issue for me. For reference, see, https://github.com/grpc/grpc/blob/master/doc/environment_variables.md. May be this can help.

pkulzc commented 6 years ago

I believe after our switching to tf.estimator framework this issue is already gone. Closing this.

moussas1 commented 5 years ago

@pkulzc Can you please clarify how this issue is solved? I am still facing the same problem

pkulzc commented 5 years ago

@moussas1 Have you synced to latest?

tensorflow / models

tensorflow.python.framework.errors_impl.UnavailableError: OS Error #3788

Describe the problem

System information

Note: The below line limits the evaluation process to 10 evaluations.

Remove the below line to evaluate indefinitely.