Closed bjourne closed 4 years ago
This is known and we will fix it.
https://github.com/pytorch/xla/issues/2190
OTOH, that happens on the last rendezvous, if any.
Sorry, I didn't see that report.
Let's keep this open so we do not forget.
Discussion also on https://github.com/pytorch/xla/issues/2190
Let's see if https://github.com/pytorch/xla/pull/2269 helps...
This seems fixed now.
@dlibenzi this is still a problem in pytorch-xla 1.7 in Colab PRO Standard TPU runtime on the last epoch of DCGAN tutorial notebook:
Finished training epoch 21: D_error:0.9385375380516052, G_error: 0.6658371686935425
Exception in device=TPU:4: tensorflow/compiler/xla/xla_client/mesh_service.cc:364 : Failed to meet rendezvous 'torch_xla.core.xla_model.do_on_ordinals': Socket closed (14)
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 330, in _mp_start_fn
_start_fn(index, pf_cfg, fn, args)
File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 324, in _start_fn
fn(gindex, *args)
File "<ipython-input-13-a50ba3d9aab8>", line 6, in _mp_fn
train_gan(rank)
File "<ipython-input-12-a72e0519d202>", line 108, in train_gan
xm.do_on_ordinals(plot_results, generator(test_noise).detach(), (0,))
File "/usr/local/lib/python3.6/dist-packages/torch_xla/core/xla_model.py", line 888, in do_on_ordinals
rendezvous('torch_xla.core.xla_model.do_on_ordinals')
File "/usr/local/lib/python3.6/dist-packages/torch_xla/core/xla_model.py", line 861, in rendezvous
return torch_xla._XLAC._xla_rendezvous(get_ordinal(), tag, payload, replicas)
RuntimeError: tensorflow/compiler/xla/xla_client/mesh_service.cc:364 : Failed to meet rendezvous 'torch_xla.core.xla_model.do_on_ordinals': Socket closed (14)
---------------------------------------------------------------------------
Exception Traceback (most recent call last)
<ipython-input-13-a50ba3d9aab8> in <module>()
7
8 xmp.spawn(_mp_fn, args=(FLAGS,), nprocs=FLAGS['num_cores'],
----> 9 start_method='fork')
2 frames
/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py in join(self, timeout)
110 raise Exception(
111 "process %d terminated with exit code %d" %
--> 112 (error_index, exitcode)
113 )
114
Exception: process 4 terminated with exit code 17
I just ran the colab a few times and it always worked fine for me? does this happen for you every time @denfromufa ? could be a one-off not being able to meet rendezvous.
Confirmed - one off. Not repeatable!
On Thu, Dec 3, 2020 at 1:17 PM Taylan Bilal notifications@github.com wrote:
I just ran the colab a few times and it always worked fine for me? does this happen for you every time @denfromufa https://github.com/denfromufa ? could be a one-off not being able to meet rendezvous.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pytorch/xla/issues/2246#issuecomment-738237270, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB4BTZI3MSPSXEJ72ZHSWGTSS7P5DANCNFSM4ODXFDMA .
This happens on torch_xla nightly. This code
Produces lots of errors like