Closed roysheffi closed 6 years ago
Hi, I managed to run the faster_rcnn kitti model with transfer learning for my own dataset by disabling the workers (so only the standard_gpu master) and two large_model paramater servers up to ~80k steps. I suspect that the issue is caused by an OOM error as the GPU's (tesla) memory is at ~40%
Hi @MoMo-Tech, how removing the workers can alleviate GPU OOM issues?
Hi @pkulzc, if you'll repeat the same experiment but with Cloud ML Engine TF 1.5 instead of 1.6 then you'll reproduce #3757. I believe they are the same issue that manifests itself differently in TF 1.5 and TF 1.6.
On my initial runs I noticed that the workers were assigned on the same machine ID. And with this in mind I changed the GPU type to the next tesla tier and noticed that a few more steps were being computed, but still it failed ~10K epochs. There are also a few other tickets that claim there might be an OOM handled by the OS and thus getting the OS error.
Hi @pkulzc, today I pulled the HEAD of models repository and tried again and the problem still persists. However, I got more informative error messages than I was getting before:
master-replica-0
Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.UnavailableError'>, OS Error
[[Node: clip_grads/clip_by_norm_74/truediv_S10171 =
_Recv[client_terminated=false,
recv_device="/job:ps/replica:0/task:2/device:CPU:0",
send_device="/job:master/replica:0/task:0/device:CPU:0",
send_device_incarnation=4406029107278666217,
tensor_name="edge_32391_clip_grads/clip_by_norm_74/truediv",
tensor_type=DT_FLOAT,
_device="/job:ps/replica:0/task:2/device:CPU:0"]()]]
[[Node: Momentum/update/NoOp_2_S10192 =
_Recv[client_terminated=false,
recv_device="/job:master/replica:0/task:0/device:CPU:0",
send_device="/job:ps/replica:0/task:2/device:CPU:0",
send_device_incarnation=7295333245552102225,
tensor_name="edge_32739_Momentum/update/NoOp_2",
tensor_type=DT_FLOAT,
_device="/job:master/replica:0/task:0/device:CPU:0"]()]]
[[Node: clip_grads/clip_by_norm_213/truediv_S9737 =
_Recv[client_terminated=false,
recv_device="/job:ps/replica:0/task:3/device:CPU:0",
send_device="/job:master/replica:0/task:0/device:CPU:0",
send_device_incarnation=4406029107278666217,
tensor_name="edge_26466_clip_grads/clip_by_norm_213/truediv",
tensor_type=DT_FLOAT,
_device="/job:ps/replica:0/task:3/device:CPU:0"]()]]
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1361, in _do_call
return fn(*args)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1340, in _run_fn
target_list, status, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 516, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.UnavailableError: OS Error
[[Node: clip_grads/clip_by_norm_74/truediv_S10171 = _Recv[client_terminated=false, recv_device="/job:ps/replica:0/task:2/device:CPU:0", send_device="/job:master/replica:0/task:0/device:CPU:0", send_device_incarnation=4406029107278666217, tensor_name="edge_32391_clip_grads/clip_by_norm_74/truediv", tensor_type=DT_FLOAT, _device="/job:ps/replica:0/task:2/device:CPU:0"]()]]
[[Node: Momentum/update/NoOp_2_S10192 = _Recv[client_terminated=false, recv_device="/job:master/replica:0/task:0/device:CPU:0", send_device="/job:ps/replica:0/task:2/device:CPU:0", send_device_incarnation=7295333245552102225, tensor_name="edge_32739_Momentum/update/NoOp_2", tensor_type=DT_FLOAT, _device="/job:master/replica:0/task:0/device:CPU:0"]()]]
[[Node: clip_grads/clip_by_norm_213/truediv_S9737 = _Recv[client_terminated=false, recv_device="/job:ps/replica:0/task:3/device:CPU:0", send_device="/job:master/replica:0/task:0/device:CPU:0", send_device_incarnation=4406029107278666217, tensor_name="edge_26466_clip_grads/clip_by_norm_213/truediv", tensor_type=DT_FLOAT, _device="/job:ps/replica:0/task:3/device:CPU:0"]()]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 167, in <module>
tf.app.run()
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 126, in run
_sys.exit(main(argv))
File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 163, in main
worker_job_name, is_chief, FLAGS.train_dir)
File "/root/.local/lib/python3.5/site-packages/object_detection/trainer.py", line 370, in train
saver=saver)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 768, in train
sess, train_op, global_step, train_step_kwargs)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 487, in train_step
run_metadata=run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 905, in run
run_metadata_ptr)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1137, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1355, in _do_run
options, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1374, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnavailableError: OS Error
[[Node: clip_grads/clip_by_norm_74/truediv_S10171 = _Recv[client_terminated=false, recv_device="/job:ps/replica:0/task:2/device:CPU:0", send_device="/job:master/replica:0/task:0/device:CPU:0", send_device_incarnation=4406029107278666217, tensor_name="edge_32391_clip_grads/clip_by_norm_74/truediv", tensor_type=DT_FLOAT, _device="/job:ps/replica:0/task:2/device:CPU:0"]()]]
[[Node: Momentum/update/NoOp_2_S10192 = _Recv[client_terminated=false, recv_device="/job:master/replica:0/task:0/device:CPU:0", send_device="/job:ps/replica:0/task:2/device:CPU:0", send_device_incarnation=7295333245552102225, tensor_name="edge_32739_Momentum/update/NoOp_2", tensor_type=DT_FLOAT, _device="/job:master/replica:0/task:0/device:CPU:0"]()]]
[[Node: clip_grads/clip_by_norm_213/truediv_S9737 = _Recv[client_terminated=false, recv_device="/job:ps/replica:0/task:3/device:CPU:0", send_device="/job:master/replica:0/task:0/device:CPU:0", send_device_incarnation=4406029107278666217, tensor_name="edge_26466_clip_grads/clip_by_norm_213/truediv", tensor_type=DT_FLOAT, _device="/job:ps/replica:0/task:3/device:CPU:0"]()]]
/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gradients_impl.py:98: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
Converting sparse IndexedSlices to a dense Tensor of unknown shape.
worker-replica-0
Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.UnavailableError'>, OS Error
2018-04-04 16:19:47.928690: E tensorflow/core/distributed_runtime/master_session.cc:1663] Cleanup partition error: Unavailable: OS Error
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1361, in _do_call
return fn(*args)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1340, in _run_fn
target_list, status, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 516, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.UnavailableError: OS Error
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 167, in <module>
tf.app.run()
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 126, in run
_sys.exit(main(argv))
File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 163, in main
worker_job_name, is_chief, FLAGS.train_dir)
File "/root/.local/lib/python3.5/site-packages/object_detection/trainer.py", line 370, in train
saver=saver)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 768, in train
sess, train_op, global_step, train_step_kwargs)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 503, in train_step
if sess.run(train_step_kwargs['should_log']):
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 905, in run
run_metadata_ptr)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1137, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1355, in _do_run
options, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1374, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnavailableError: OS Error
/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gradients_impl.py:98: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
Converting sparse IndexedSlices to a dense Tensor of unknown shape.
ml-engine
The replica master 0 exited with a non-zero status of 1. Termination reason: Error.
Traceback (most recent call last):
[...]
File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 487, in train_step
run_metadata=run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 905, in run
run_metadata_ptr)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1137, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1355, in _do_run
options, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1374, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnavailableError: OS Error
[[Node: clip_grads/clip_by_norm_74/truediv_S10171 = _Recv[client_terminated=false, recv_device="/job:ps/replica:0/task:2/device:CPU:0", send_device="/job:master/replica:0/task:0/device:CPU:0", send_device_incarnation=4406029107278666217, tensor_name="edge_32391_clip_grads/clip_by_norm_74/truediv", tensor_type=DT_FLOAT, _device="/job:ps/replica:0/task:2/device:CPU:0"]()]]
[[Node: Momentum/update/NoOp_2_S10192 = _Recv[client_terminated=false, recv_device="/job:master/replica:0/task:0/device:CPU:0", send_device="/job:ps/replica:0/task:2/device:CPU:0", send_device_incarnation=7295333245552102225, tensor_name="edge_32739_Momentum/update/NoOp_2", tensor_type=DT_FLOAT, _device="/job:master/replica:0/task:0/device:CPU:0"]()]]
[[Node: clip_grads/clip_by_norm_213/truediv_S9737 = _Recv[client_terminated=false, recv_device="/job:ps/replica:0/task:3/device:CPU:0", send_device="/job:master/replica:0/task:0/device:CPU:0", send_device_incarnation=4406029107278666217, tensor_name="edge_26466_clip_grads/clip_by_norm_213/truediv", tensor_type=DT_FLOAT, _device="/job:ps/replica:0/task:3/device:CPU:0"]()]]
The replica worker 0 exited with a non-zero status of 1. Termination reason: Error.
Traceback (most recent call last):
[...]
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 516, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.UnavailableError: OS Error
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 167, in <module>
tf.app.run()
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 126, in run
_sys.exit(main(argv))
File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 163, in main
worker_job_name, is_chief, FLAGS.train_dir)
File "/root/.local/lib/python3.5/site-packages/object_detection/trainer.py", line 370, in train
saver=saver)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 768, in train
sess, train_op, global_step, train_step_kwargs)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 503, in train_step
if sess.run(train_step_kwargs['should_log']):
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 905, in run
run_metadata_ptr)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1137, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1355, in _do_run
options, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1374, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnavailableError: OS Error
Hi @pkulzc, thank you for investigating this issue!
Are you able to please share any updates or information regarding the status of this issue?
Thanks 👍
Sorry for the delay, right now the cloud team and tensorflow team are still investigating.
Great!
Thanks for the update
Hi @pkulzc, I think I may have a lead:
On line 357, object_detection/trainer.py calls tf.contrib.slim.learning.train() which uses the deprecated tf.train.Supervisor and should be migrated to tf.train.MonitoredTrainingSession instead, as documented in tf.train.Supervisor
This is already requested in tensorflow/tensorflow#15793 and is reported as a solution to tensorflow/tensorflow#17852 on the last comment of yahoo/TensorFlowOnSpark#245.
@MoMo-Tech I also managed to run transfer learning based on ssd_mobilenet_v1_coco_2017_11_17 models up to about ~80k steps and it stopped with the following error.
The replica worker 4 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 184, in <module> tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 126, in run _sys.exit(main(argv)) File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 180, in main graph_hook_fn=graph_rewriter_fn) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 410, in train saver=saver) File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 769, in train sess, train_op, global_step, train_step_kwargs) File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 487, in train_step run_metadata=run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 900, in run run_metadata_ptr) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1135, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1316, in _do_run run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1335, in _do_call raise type(e)(node_def, op, message) UnavailableError: OS Error To find out more about why your job exited please check the logs: https://console.cloud.google.com/logs/viewer?project=1466216932&resource=ml_job%2Fjob_id%2Ffreshturf_object_detection_1530117506&advancedFilter=resource.type%3D%22ml_job%22%0Aresource.labels.job_id%3D%22freshturf_object_detection_1530117506%22
The master memory usage was around 38.8 ~ 49.7 %
I had the same error when the tensorflow instances trying to contact each other. I found out that the problem is that GRPC uses the native "epoll" polling engine for communication. Changing this to a portable polling engine solved this issue for me. The way to do is to set the environment variable, "GRPC_POLL_STRATEGY=poll" before running the tensorflow programs. This solved this issue for me. For reference, see, https://github.com/grpc/grpc/blob/master/doc/environment_variables.md. May be this can help.
I believe after our switching to tf.estimator framework this issue is already gone. Closing this.
@pkulzc Can you please clarify how this issue is solved? I am still facing the same problem
@moussas1 Have you synced to latest?
Describe the problem
While running a training job on Cloud ML Engine (Runtime Version 1.6), after running for a while (~5900 steps), the job fails and worker-replica-0 reports the following message:
Finally, worker-replica-0 returns:
Please see logs/Traceback below.
System information
train_config: { batch_size: 4 optimizer { momentum_optimizer: { learning_rate: { manual_step_learning_rate { initial_learning_rate: 3e-4 schedule { step: 900000 learning_rate: 3e-5 } schedule { step: 1200000 learning_rate: 3e-6 } } } momentum_optimizer_value: 0.9 } use_moving_average: false } gradient_clipping_by_norm: 10.0 fine_tune_checkpoint: "gs://PATH_TO_BE_CONFIGURED/model.ckpt" from_detection_checkpoint: true num_steps: 30000 data_augmentation_options { random_horizontal_flip { } } }
train_input_reader: { tf_record_input_reader { input_path: "gs://PATH_TO_BE_CONFIGURED/agroset_train.record" } label_map_path: "gs://PATH_TO_BE_CONFIGURED/agroset_label_map.pbtxt" }
eval_config: { num_examples: 25
Note: The below line limits the evaluation process to 10 evaluations.
Remove the below line to evaluate indefinitely.
max_evals: 30 visualization_export_dir: "gs://PATH_TO_BE_CONFIGURED/visualization" }
eval_input_reader: { tf_record_input_reader { input_path: "gs://PATH_TO_BE_CONFIGURED/agroset_val.record" } label_map_path: "gs://PATH_TO_BE_CONFIGURED/agroset_label_map.pbtxt" shuffle: false num_readers: 1 }
The replica worker 0 exited with a non-zero status of 1. Termination reason: Error.
Traceback (most recent call last): File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1361, in _do_call return fn(*args) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1340, in _run_fn target_list, status, run_metadata) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 516, in exit c_api.TF_GetCode(self.status.status)) tensorflow.python.framework.errors_impl.UnavailableError: OS Error
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.5/runpy.py", line 85, in _run_code exec(code, run_globals) File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 167, in
tf.app.run()
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 126, in run
_sys.exit(main(argv))
File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 163, in main
worker_job_name, is_chief, FLAGS.train_dir)
File "/root/.local/lib/python3.5/site-packages/object_detection/trainer.py", line 370, in train
saver=saver)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 768, in train
sess, train_op, global_step, train_step_kwargs)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 487, in train_step
run_metadata=run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 905, in run
run_metadata_ptr)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1137, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1355, in _do_run
options, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1374, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnavailableError: OS Error