Closed Frank1993 closed 1 year ago
In an effort to clean up the git repo so we can maintain it better going forward, the MLPerf Training working group is closing out issues older than 2 years, since much has changed in the benchmark suite. If you think this issue is still relevant, please feel free to reopen. Even better, please come to the working group meeting to discuss your issue
when training with tf.contrib.distribute.CollectiveAllReduceStrategy using tf1.13, my job always failed with OS error or socket closed after training for around 1000 steps.
Job configuration: 16 workers with 2 p100 each
Can anyone help investigating this problem?
And the error looks like this: INFO:tensorflow:loss = 5.7242765, step = 3400 (57.093 sec) INFO:tensorflow:global_step/sec: 1.75156 INFO:tensorflow:loss = 5.7242765, step = 3400 (57.101 sec) INFO:tensorflow:loss = 5.7242765, step = 3400 (57.096 sec) INFO:tensorflow:loss = 5.7242765, step = 3400 (57.129 sec) INFO:tensorflow:loss = 5.7242765, step = 3400 (57.107 sec) 2019-05-20 12:05:41.289749: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: OS Error 2019-05-20 12:05:41.289831: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: OS Error 2019-05-20 12:05:41.291604: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: OS Error 2019-05-20 12:05:41.291686: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: OS Error 2019-05-20 12:05:41.290341: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: OS Error 2019-05-20 12:05:41.290434: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: OS Error 2019-05-20 12:05:41.290454: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: OS Error 2019-05-20 12:05:41.290464: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: OS Error 2019-05-20 12:05:41.290522: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: OS Error 2019-05-20 12:05:41.290601: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: OS Error 2019-05-20 12:05:41.293137: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: OS Error 2019-05-20 12:05:41.293230: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: OS Error 2019-05-20 12:05:41.293249: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: OS Error 2019-05-20 12:05:41.293258: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: OS Error 2019-05-20 12:05:41.296025: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: OS Error 2019-05-20 12:05:41.296058: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: OS Error 2019-05-20 12:05:41.294761: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: OS Error 2019-05-20 12:05:41.294858: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: OS Error 2019-05-20 12:05:41.295484: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: OS Error 2019-05-20 12:05:41.295580: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: OS Error 2019-05-20 12:05:41.295600: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: OS Error 2019-05-20 12:05:41.295608: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: OS Error 2019-05-20 12:05:41.298684: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: OS Error 2019-05-20 12:05:41.298772: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: OS Error 2019-05-20 12:05:41.300189: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: OS Error 2019-05-20 12:05:41.300261: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: OS Error 2019-05-20 12:05:41.300279: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: OS Error 2019-05-20 12:05:41.300289: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: OS Error 2019-05-20 12:05:41.298252: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: OS Error 2019-05-20 12:05:41.298290: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: OS Error 2019-05-20 12:05:41.298330: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: OS Error 2019-05-20 12:05:41.298343: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: OS Error 2019-05-20 12:05:41.303394: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: OS Error 2019-05-20 12:05:41.303428: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: OS Error 2019-05-20 12:05:41.303456: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: OS Error 2019-05-20 12:05:41.303468: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: OS Error 2019-05-20 12:05:41.304374: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: OS Error 2019-05-20 12:05:41.304457: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: OS Error 2019-05-20 12:05:41.304475: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: OS Error 2019-05-20 12:05:41.304485: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: OS Error 2019-05-20 12:05:41.308031: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: OS Error 2019-05-20 12:05:41.308170: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: OS Error 2019-05-20 12:05:41.307822: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: OS Error 2019-05-20 12:05:41.307908: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: OS Error 2019-05-20 12:05:41.308408: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: OS Error 2019-05-20 12:05:41.308509: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: OS Error 2019-05-20 12:05:41.308903: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: OS Error 2019-05-20 12:05:41.308987: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: OS Error 2019-05-20 12:05:41.309370: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: OS Error 2019-05-20 12:05:41.309451: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: OS Error 2019-05-20 12:05:41.309470: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: OS Error 2019-05-20 12:05:41.309480: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: OS Error 2019-05-20 12:05:41.312410: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: OS Error 2019-05-20 12:05:41.312443: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: OS Error 2019-05-20 12:05:41.312471: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: OS Error 2019-05-20 12:05:41.312483: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: OS Error 2019-05-20 12:05:41.312514: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: OS Error 2019-05-20 12:05:41.312525: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: OS Error 2019-05-20 12:05:41.312554: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: OS Error 2019-05-20 12:05:41.312568: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: OS Error 2019-05-20 12:05:41.312619: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: OS Error 2019-05-20 12:05:41.312638: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: OS Error 2019-05-20 12:05:41.312660: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: OS Error 2019-05-20 12:05:41.312671: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: OS Error Segmentation fault (core dumped) 2019-05-20T12:05:44.327Z: [1,1]:[2019-05-20 12:05:44,324] ERROR: worker_14 failed, status=139
2019-05-20T12:05:44.331Z: [1,1]:[2019-05-20 12:05:44,324] INFO: App final status on Node_1_container-e559-1557431881457-14103-01-000003_0:
2019-05-20T12:05:44.331Z: [1,1]:[2019-05-20 12:05:44,324] INFO: worker_14, failed, status=139
2019-05-20T12:05:44.331Z: [1,1]:[2019-05-20 12:05:44,325] INFO: Succeed=0, Failed=1, Killed=0
2019-05-20T12:05:44.331Z: [1,1]:[2019-05-20 12:05:44,325] ERROR: Launch app failed
and for socket closed, it look like this:
` INFO:tensorflow:loss = 3.7297182, step = 10300 (50.581 sec) INFO:tensorflow:loss = 3.7297182, step = 10300 (50.594 sec) INFO:tensorflow:loss = 3.7297182, step = 10300 (50.597 sec) INFO:tensorflow:loss = 3.7297182, step = 10300 (50.608 sec) INFO:tensorflow:global_step/sec: 1.97598 2019-05-19 15:08:13.128241: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: Socket closed 2019-05-19 15:08:13.128323: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: Socket closed 2019-05-19 15:08:13.128342: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: Socket closed 2019-05-19 15:08:13.128351: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: Socket closed 2019-05-19 15:08:13.126946: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: Socket closed 2019-05-19 15:08:13.127033: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: Socket closed 2019-05-19 15:08:13.129537: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: Socket closed 2019-05-19 15:08:13.129566: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: Socket closed 2019-05-19 15:08:13.129590: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: Socket closed 2019-05-19 15:08:13.129601: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: Socket closed 2019-05-19 15:08:13.130017: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: Socket closed 2019-05-19 15:08:13.130097: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: Socket closed 2019-05-19 15:08:13.130117: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: Socket closed 2019-05-19 15:08:13.130126: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: Socket closed 2019-05-19 15:08:13.130641: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: Socket closed 2019-05-19 15:08:13.130729: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: Socket closed 2019-05-19 15:08:13.130747: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: Socket closed 2019-05-19 15:08:13.130756: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: Socket closed 2019-05-19 15:08:13.131987: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: Socket closed 2019-05-19 15:08:13.132049: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: Socket closed 2019-05-19 15:08:13.132163: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at collective_ops.cc:150 : Unavailable: Socket closed 2019-05-19 15:08:13.132238: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: Socket closed [[{{node allreduce_7/CollectiveReduce}}]] [[{{node Adam/update_1_207/ReadVariableOp}}]] 2019-05-19 15:08:13.132276: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: Socket closed [[{{node allreduce_7/CollectiveReduce}}]] [[{{node allreduce_7/CollectiveReduce_1}}]] 2019-05-19 15:08:13.132390: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: Socket closed [[{{node allreduce_7/CollectiveReduce}}]] 2019-05-19 15:08:13.133888: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: Socket closed 2019-05-19 15:08:13.134009: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: Socket closed 2019-05-19 15:08:13.134030: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: Socket closed 2019-05-19 15:08:13.134039: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: Socket closed 2019-05-19 15:08:13.134861: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: Socket closed 2019-05-19 15:08:13.134937: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: Socket closed 2019-05-19 15:08:13.134955: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: Socket closed 2019-05-19 15:08:13.134965: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: Socket closed 2019-05-19 15:08:13.134475: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: Socket closed 2019-05-19 15:08:13.134551: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: Socket closed 2019-05-19 15:08:13.135917: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: Socket closed 2019-05-19 15:08:13.136015: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: Socket closed 2019-05-19 15:08:13.136034: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: Socket closed 2019-05-19 15:08:13.136042: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: Socket closed 2019-05-19 15:08:13.134570: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: Socket closed 2019-05-19 15:08:13.134580: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: Socket closed 2019-05-19 15:08:13.134897: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: Socket closed 2019-05-19 15:08:13.134940: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: Socket closed 2019-05-19 15:08:13.134969: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: Socket closed 2019-05-19 15:08:13.134980: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: Socket closed 2019-05-19 15:08:13.135004: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: Socket closed 2019-05-19 15:08:13.135014: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: Socket closed 2019-05-19 15:08:13.135074: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at collective_ops.cc:150 : Unavailable: Socket closed 2019-05-19 15:08:13.135142: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: Socket closed [[{{node scoped_allocator_280_CollectiveReduce}}]] [[{{node GroupCrossDeviceControlEdges_0/Adam/update_0_207/Const}}]] 2019-05-19 15:08:13.135569: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at collective_ops.cc:150 : Unavailable: Socket closed 2019-05-19 15:08:13.137926: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: Socket closed 2019-05-19 15:08:13.138013: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: Socket closed 2019-05-19 15:08:13.138169: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Cancelled: Cancelled 2019-05-19 15:08:13.138199: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Cancelled: Cancelled 2019-05-19 15:08:13.138353: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at collective_ops.cc:150 : Cancelled: Cancelled 2019-05-19 15:08:13.138429: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Cancelled: Cancelled [[{{node scoped_allocator_100_CollectiveReduce}}]] [[{{node Adam/update_0_207/ReadVariableOp}}]] 2019-05-19 15:08:13.138600: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Cancelled: Cancelled [[{{node scoped_allocator_100_CollectiveReduce}}]] 2019-05-19 15:08:13.139053: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at collective_ops.cc:150 : Unavailable: Socket closed Traceback (most recent call last): File "/tmp/apprunner/.working/runtime/env/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call return fn(*args) File "/tmp/apprunner/.working/runtime/env/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata) File "/tmp/apprunner/.working/runtime/env/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.CancelledError: Cancelled [[{{node scoped_allocator_100_CollectiveReduce}}]] [[{{node Adam/update_0_207/ReadVariableOp}}]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/tmp/apprunner/.working/runtime/app/tensorflow_estimator_bminist/run_pretraining.py", line 471, in
tf.app.run()
File "/tmp/apprunner/.working/runtime/env/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "/tmp/apprunner/.working/runtime/app/tensorflow_estimator_bminist/run_pretraining.py", line 445, in main
tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)
File "/tmp/apprunner/.working/runtime/env/lib/python3.5/site-packages/tensorflow_estimator/python/estimator/training.py", line 462, in train_and_evaluate
estimator, train_spec, eval_spec, _TrainingExecutor)
File "/tmp/apprunner/.working/runtime/env/lib/python3.5/site-packages/tensorflow/python/distribute/estimator_training.py", line 289, in train_and_evaluate
session_config=run_config.session_config)
File "/tmp/apprunner/.working/runtime/env/lib/python3.5/site-packages/tensorflow/python/distribute/distribute_coordinator.py", line 823, in run_distribute_coordinator
task_id, session_config, rpc_layer)
File "/tmp/apprunner/.working/runtime/env/lib/python3.5/site-packages/tensorflow/python/distribute/distribute_coordinator.py", line 359, in _run_single_worker
return worker_fn(strategy)
File "/tmp/apprunner/.working/runtime/env/lib/python3.5/site-packages/tensorflow/python/distribute/estimator_training.py", line 251, in _worker_fn
hooks=hooks)
File "/tmp/apprunner/.working/runtime/env/lib/python3.5/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 358, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/tmp/apprunner/.working/runtime/env/lib/python3.5/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1122, in _train_model
return self._train_model_distributed(input_fn, hooks, saving_listeners)
File "/tmp/apprunner/.working/runtime/env/lib/python3.5/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1185, in _train_model_distributed
self._config._train_distribute, input_fn, hooks, saving_listeners)
File "/tmp/apprunner/.working/runtime/env/lib/python3.5/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1287, in _actual_train_model_distributed
saving_listeners)
File "/tmp/apprunner/.working/runtime/env/lib/python3.5/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1407, in _train_with_estimatorspec
, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
File "/tmp/apprunner/.working/runtime/env/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 676, in run
run_metadata=run_metadata)
File "/tmp/apprunner/.working/runtime/env/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 1171, in run
run_metadata=run_metadata)
File "/tmp/apprunner/.working/runtime/env/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 1270, in run
raise six.reraise(original_exc_info)
File "/tmp/apprunner/.working/runtime/env/lib/python3.5/site-packages/six.py", line 693, in reraise
raise value
File "/tmp/apprunner/.working/runtime/env/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 1255, in run
return self._sess.run(args, *kwargs)
File "/tmp/apprunner/.working/runtime/env/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 1327, in run
run_metadata=run_metadata)
File "/tmp/apprunner/.working/runtime/env/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 1091, in run
return self._sess.run(args, **kwargs)
File "/tmp/apprunner/.working/runtime/env/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 929, in run
run_metadata_ptr)
File "/tmp/apprunner/.working/runtime/env/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1152, in _run
feed_dict_tensor, options, run_metadata)
File "/tmp/apprunner/.working/runtime/env/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
run_metadata)
File "/tmp/apprunner/.working/runtime/env/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.CancelledError: Cancelled
[[{{node scoped_allocator_100_CollectiveReduce}}]]
[[node Adam/update_0_207/ReadVariableOp (defined at /tmp/apprunner/.working/runtime/env/lib/python3.5/site-packages/tensorflow_estimator/python/estimator/estimator.py:1254) ]]
Segmentation fault (core dumped)
`