mlcommons / training

Reference implementations of MLPerf™ training benchmarks
https://mlcommons.org/en/groups/training
Apache License 2.0
1.62k stars 557 forks source link

CollectiveAllReduceStrategy always failed with OS error or socket closed #272

Closed Frank1993 closed 1 year ago

Frank1993 commented 5 years ago

when training with tf.contrib.distribute.CollectiveAllReduceStrategy using tf1.13, my job always failed with OS error or socket closed after training for around 1000 steps.

Job configuration: 16 workers with 2 p100 each

Can anyone help investigating this problem?

And the error looks like this: INFO:tensorflow:loss = 5.7242765, step = 3400 (57.093 sec) INFO:tensorflow:global_step/sec: 1.75156 INFO:tensorflow:loss = 5.7242765, step = 3400 (57.101 sec) INFO:tensorflow:loss = 5.7242765, step = 3400 (57.096 sec) INFO:tensorflow:loss = 5.7242765, step = 3400 (57.129 sec) INFO:tensorflow:loss = 5.7242765, step = 3400 (57.107 sec) 2019-05-20 12:05:41.289749: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: OS Error 2019-05-20 12:05:41.289831: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: OS Error 2019-05-20 12:05:41.291604: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: OS Error 2019-05-20 12:05:41.291686: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: OS Error 2019-05-20 12:05:41.290341: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: OS Error 2019-05-20 12:05:41.290434: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: OS Error 2019-05-20 12:05:41.290454: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: OS Error 2019-05-20 12:05:41.290464: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: OS Error 2019-05-20 12:05:41.290522: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: OS Error 2019-05-20 12:05:41.290601: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: OS Error 2019-05-20 12:05:41.293137: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: OS Error 2019-05-20 12:05:41.293230: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: OS Error 2019-05-20 12:05:41.293249: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: OS Error 2019-05-20 12:05:41.293258: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: OS Error 2019-05-20 12:05:41.296025: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: OS Error 2019-05-20 12:05:41.296058: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: OS Error 2019-05-20 12:05:41.294761: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: OS Error 2019-05-20 12:05:41.294858: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: OS Error 2019-05-20 12:05:41.295484: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: OS Error 2019-05-20 12:05:41.295580: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: OS Error 2019-05-20 12:05:41.295600: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: OS Error 2019-05-20 12:05:41.295608: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: OS Error 2019-05-20 12:05:41.298684: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: OS Error 2019-05-20 12:05:41.298772: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: OS Error 2019-05-20 12:05:41.300189: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: OS Error 2019-05-20 12:05:41.300261: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: OS Error 2019-05-20 12:05:41.300279: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: OS Error 2019-05-20 12:05:41.300289: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: OS Error 2019-05-20 12:05:41.298252: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: OS Error 2019-05-20 12:05:41.298290: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: OS Error 2019-05-20 12:05:41.298330: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: OS Error 2019-05-20 12:05:41.298343: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: OS Error 2019-05-20 12:05:41.303394: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: OS Error 2019-05-20 12:05:41.303428: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: OS Error 2019-05-20 12:05:41.303456: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: OS Error 2019-05-20 12:05:41.303468: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: OS Error 2019-05-20 12:05:41.304374: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: OS Error 2019-05-20 12:05:41.304457: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: OS Error 2019-05-20 12:05:41.304475: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: OS Error 2019-05-20 12:05:41.304485: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: OS Error 2019-05-20 12:05:41.308031: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: OS Error 2019-05-20 12:05:41.308170: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: OS Error 2019-05-20 12:05:41.307822: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: OS Error 2019-05-20 12:05:41.307908: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: OS Error 2019-05-20 12:05:41.308408: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: OS Error 2019-05-20 12:05:41.308509: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: OS Error 2019-05-20 12:05:41.308903: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: OS Error 2019-05-20 12:05:41.308987: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: OS Error 2019-05-20 12:05:41.309370: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: OS Error 2019-05-20 12:05:41.309451: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: OS Error 2019-05-20 12:05:41.309470: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: OS Error 2019-05-20 12:05:41.309480: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: OS Error 2019-05-20 12:05:41.312410: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: OS Error 2019-05-20 12:05:41.312443: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: OS Error 2019-05-20 12:05:41.312471: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: OS Error 2019-05-20 12:05:41.312483: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: OS Error 2019-05-20 12:05:41.312514: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: OS Error 2019-05-20 12:05:41.312525: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: OS Error 2019-05-20 12:05:41.312554: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: OS Error 2019-05-20 12:05:41.312568: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: OS Error 2019-05-20 12:05:41.312619: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: OS Error 2019-05-20 12:05:41.312638: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: OS Error 2019-05-20 12:05:41.312660: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: OS Error 2019-05-20 12:05:41.312671: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: OS Error Segmentation fault (core dumped) 2019-05-20T12:05:44.327Z: [1,1]:[2019-05-20 12:05:44,324] ERROR: worker_14 failed, status=139 2019-05-20T12:05:44.331Z: [1,1]:[2019-05-20 12:05:44,324] INFO: App final status on Node_1_container-e559-1557431881457-14103-01-000003_0: 2019-05-20T12:05:44.331Z: [1,1]:[2019-05-20 12:05:44,324] INFO: worker_14, failed, status=139 2019-05-20T12:05:44.331Z: [1,1]:[2019-05-20 12:05:44,325] INFO: Succeed=0, Failed=1, Killed=0 2019-05-20T12:05:44.331Z: [1,1]:[2019-05-20 12:05:44,325] ERROR: Launch app failed

and for socket closed, it look like this:

` INFO:tensorflow:loss = 3.7297182, step = 10300 (50.581 sec) INFO:tensorflow:loss = 3.7297182, step = 10300 (50.594 sec) INFO:tensorflow:loss = 3.7297182, step = 10300 (50.597 sec) INFO:tensorflow:loss = 3.7297182, step = 10300 (50.608 sec) INFO:tensorflow:global_step/sec: 1.97598 2019-05-19 15:08:13.128241: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: Socket closed 2019-05-19 15:08:13.128323: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: Socket closed 2019-05-19 15:08:13.128342: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: Socket closed 2019-05-19 15:08:13.128351: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: Socket closed 2019-05-19 15:08:13.126946: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: Socket closed 2019-05-19 15:08:13.127033: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: Socket closed 2019-05-19 15:08:13.129537: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: Socket closed 2019-05-19 15:08:13.129566: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: Socket closed 2019-05-19 15:08:13.129590: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: Socket closed 2019-05-19 15:08:13.129601: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: Socket closed 2019-05-19 15:08:13.130017: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: Socket closed 2019-05-19 15:08:13.130097: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: Socket closed 2019-05-19 15:08:13.130117: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: Socket closed 2019-05-19 15:08:13.130126: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: Socket closed 2019-05-19 15:08:13.130641: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: Socket closed 2019-05-19 15:08:13.130729: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: Socket closed 2019-05-19 15:08:13.130747: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: Socket closed 2019-05-19 15:08:13.130756: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: Socket closed 2019-05-19 15:08:13.131987: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: Socket closed 2019-05-19 15:08:13.132049: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: Socket closed 2019-05-19 15:08:13.132163: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at collective_ops.cc:150 : Unavailable: Socket closed 2019-05-19 15:08:13.132238: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: Socket closed [[{{node allreduce_7/CollectiveReduce}}]] [[{{node Adam/update_1_207/ReadVariableOp}}]] 2019-05-19 15:08:13.132276: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: Socket closed [[{{node allreduce_7/CollectiveReduce}}]] [[{{node allreduce_7/CollectiveReduce_1}}]] 2019-05-19 15:08:13.132390: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: Socket closed [[{{node allreduce_7/CollectiveReduce}}]] 2019-05-19 15:08:13.133888: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: Socket closed 2019-05-19 15:08:13.134009: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: Socket closed 2019-05-19 15:08:13.134030: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: Socket closed 2019-05-19 15:08:13.134039: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: Socket closed 2019-05-19 15:08:13.134861: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: Socket closed 2019-05-19 15:08:13.134937: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: Socket closed 2019-05-19 15:08:13.134955: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: Socket closed 2019-05-19 15:08:13.134965: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: Socket closed 2019-05-19 15:08:13.134475: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: Socket closed 2019-05-19 15:08:13.134551: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: Socket closed 2019-05-19 15:08:13.135917: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: Socket closed 2019-05-19 15:08:13.136015: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: Socket closed 2019-05-19 15:08:13.136034: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: Socket closed 2019-05-19 15:08:13.136042: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: Socket closed 2019-05-19 15:08:13.134570: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: Socket closed 2019-05-19 15:08:13.134580: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: Socket closed 2019-05-19 15:08:13.134897: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: Socket closed 2019-05-19 15:08:13.134940: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: Socket closed 2019-05-19 15:08:13.134969: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: Socket closed 2019-05-19 15:08:13.134980: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: Socket closed 2019-05-19 15:08:13.135004: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: Socket closed 2019-05-19 15:08:13.135014: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: Socket closed 2019-05-19 15:08:13.135074: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at collective_ops.cc:150 : Unavailable: Socket closed 2019-05-19 15:08:13.135142: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: Socket closed [[{{node scoped_allocator_280_CollectiveReduce}}]] [[{{node GroupCrossDeviceControlEdges_0/Adam/update_0_207/Const}}]] 2019-05-19 15:08:13.135569: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at collective_ops.cc:150 : Unavailable: Socket closed 2019-05-19 15:08:13.137926: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Unavailable: Socket closed 2019-05-19 15:08:13.138013: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: Socket closed 2019-05-19 15:08:13.138169: E tensorflow/core/common_runtime/ring_reducer.cc:369] Aborting RingReduce with Cancelled: Cancelled 2019-05-19 15:08:13.138199: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Cancelled: Cancelled 2019-05-19 15:08:13.138353: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at collective_ops.cc:150 : Cancelled: Cancelled 2019-05-19 15:08:13.138429: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Cancelled: Cancelled [[{{node scoped_allocator_100_CollectiveReduce}}]] [[{{node Adam/update_0_207/ReadVariableOp}}]] 2019-05-19 15:08:13.138600: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Cancelled: Cancelled [[{{node scoped_allocator_100_CollectiveReduce}}]] 2019-05-19 15:08:13.139053: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at collective_ops.cc:150 : Unavailable: Socket closed Traceback (most recent call last): File "/tmp/apprunner/.working/runtime/env/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call return fn(*args) File "/tmp/apprunner/.working/runtime/env/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata) File "/tmp/apprunner/.working/runtime/env/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.CancelledError: Cancelled [[{{node scoped_allocator_100_CollectiveReduce}}]] [[{{node Adam/update_0_207/ReadVariableOp}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/tmp/apprunner/.working/runtime/app/tensorflow_estimator_bminist/run_pretraining.py", line 471, in tf.app.run() File "/tmp/apprunner/.working/runtime/env/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 125, in run _sys.exit(main(argv)) File "/tmp/apprunner/.working/runtime/app/tensorflow_estimator_bminist/run_pretraining.py", line 445, in main tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec) File "/tmp/apprunner/.working/runtime/env/lib/python3.5/site-packages/tensorflow_estimator/python/estimator/training.py", line 462, in train_and_evaluate estimator, train_spec, eval_spec, _TrainingExecutor) File "/tmp/apprunner/.working/runtime/env/lib/python3.5/site-packages/tensorflow/python/distribute/estimator_training.py", line 289, in train_and_evaluate session_config=run_config.session_config) File "/tmp/apprunner/.working/runtime/env/lib/python3.5/site-packages/tensorflow/python/distribute/distribute_coordinator.py", line 823, in run_distribute_coordinator task_id, session_config, rpc_layer) File "/tmp/apprunner/.working/runtime/env/lib/python3.5/site-packages/tensorflow/python/distribute/distribute_coordinator.py", line 359, in _run_single_worker return worker_fn(strategy) File "/tmp/apprunner/.working/runtime/env/lib/python3.5/site-packages/tensorflow/python/distribute/estimator_training.py", line 251, in _worker_fn hooks=hooks) File "/tmp/apprunner/.working/runtime/env/lib/python3.5/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 358, in train loss = self._train_model(input_fn, hooks, saving_listeners) File "/tmp/apprunner/.working/runtime/env/lib/python3.5/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1122, in _train_model return self._train_model_distributed(input_fn, hooks, saving_listeners) File "/tmp/apprunner/.working/runtime/env/lib/python3.5/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1185, in _train_model_distributed self._config._train_distribute, input_fn, hooks, saving_listeners) File "/tmp/apprunner/.working/runtime/env/lib/python3.5/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1287, in _actual_train_model_distributed saving_listeners) File "/tmp/apprunner/.working/runtime/env/lib/python3.5/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1407, in _train_with_estimatorspec , loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss]) File "/tmp/apprunner/.working/runtime/env/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 676, in run run_metadata=run_metadata) File "/tmp/apprunner/.working/runtime/env/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 1171, in run run_metadata=run_metadata) File "/tmp/apprunner/.working/runtime/env/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 1270, in run raise six.reraise(original_exc_info) File "/tmp/apprunner/.working/runtime/env/lib/python3.5/site-packages/six.py", line 693, in reraise raise value File "/tmp/apprunner/.working/runtime/env/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 1255, in run return self._sess.run(args, *kwargs) File "/tmp/apprunner/.working/runtime/env/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 1327, in run run_metadata=run_metadata) File "/tmp/apprunner/.working/runtime/env/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 1091, in run return self._sess.run(args, **kwargs) File "/tmp/apprunner/.working/runtime/env/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 929, in run run_metadata_ptr) File "/tmp/apprunner/.working/runtime/env/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1152, in _run feed_dict_tensor, options, run_metadata) File "/tmp/apprunner/.working/runtime/env/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run run_metadata) File "/tmp/apprunner/.working/runtime/env/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.CancelledError: Cancelled [[{{node scoped_allocator_100_CollectiveReduce}}]] [[node Adam/update_0_207/ReadVariableOp (defined at /tmp/apprunner/.working/runtime/env/lib/python3.5/site-packages/tensorflow_estimator/python/estimator/estimator.py:1254) ]] Segmentation fault (core dumped) `

peladodigital commented 1 year ago

In an effort to clean up the git repo so we can maintain it better going forward, the MLPerf Training working group is closing out issues older than 2 years, since much has changed in the benchmark suite. If you think this issue is still relevant, please feel free to reopen. Even better, please come to the working group meeting to discuss your issue