pytorch / xla

Enabling PyTorch on XLA Devices (e.g. Google TPU)
https://pytorch.org/xla
Other
2.48k stars 480 forks source link

Failed to meet rendezvous 'init': Socket closed (14) #2246

Closed bjourne closed 4 years ago

bjourne commented 4 years ago

This happens on torch_xla nightly. This code

from torch_xla.core.xla_model import rendezvous
from torch_xla.distributed.xla_multiprocessing import spawn
def fn(*args):
    rendezvous('init')
spawn(fn, args = ({},), nprocs = 8, start_method = 'fork')

Produces lots of errors like

2020-06-21 06:57:22.600246: E tensorflow/compiler/xla/xla_client/tf_logging.cc:11] Failed to meet rendezvous 'init': Socket closed (14)
2020-06-21 06:57:22.600666: E tensorflow/compiler/xla/xla_client/tf_logging.cc:11] Failed to meet rendezvous 'init': Socket closed (14)
Exception in device=TPU:6: tensorflow/compiler/xla/xla_client/mesh_service.cc:294 : Failed to meet rendezvous 'init': Socket closed (14)
Exception in device=TPU:7: tensorflow/compiler/xla/xla_client/mesh_service.cc:294 : Failed to meet rendezvous 'init': Socket closed (14)
Traceback (most recent call last):
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 231, in _start_fn
    fn(gindex, *args)
2020-06-21 06:57:22.603999: E tensorflow/compiler/xla/xla_client/tf_logging.cc:11] Failed to meet rendezvous 'init': Socket closed (14)
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 231, in _start_fn
    fn(gindex, *args)
  File "tut1.py", line 5, in fn
    rendezvous('init')
  File "tut1.py", line 5, in fn
    rendezvous('init')
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/core/xla_model.py", line 679, in rendezvous
    return torch_xla._XLAC._xla_rendezvous(get_ordinal(), tag, payload, replicas)
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/core/xla_model.py", line 679, in rendezvous
    return torch_xla._XLAC._xla_rendezvous(get_ordinal(), tag, payload, replicas)
RuntimeError: tensorflow/compiler/xla/xla_client/mesh_service.cc:294 : Failed to meet rendezvous 'init': Socket closed (14)
RuntimeError: tensorflow/compiler/xla/xla_client/mesh_service.cc:294 : Failed to meet rendezvous 'init': Socket closed (14)
Exception in device=TPU:3: tensorflow/compiler/xla/xla_client/mesh_service.cc:294 : Failed to meet rendezvous 'init': Socket closed (14)
2020-06-21 06:57:22.605231: E tensorflow/compiler/xla/xla_client/tf_logging.cc:11] Failed to meet rendezvous 'init': Socket closed (14)
Exception in device=TPU:4: tensorflow/compiler/xla/xla_client/mesh_service.cc:294 : Failed to meet rendezvous 'init': Socket closed (14)
dlibenzi commented 4 years ago

This is known and we will fix it.

https://github.com/pytorch/xla/issues/2190

OTOH, that happens on the last rendezvous, if any.

bjourne commented 4 years ago

Sorry, I didn't see that report.

dlibenzi commented 4 years ago

Let's keep this open so we do not forget.

dlibenzi commented 4 years ago

Discussion also on https://github.com/pytorch/xla/issues/2190

dlibenzi commented 4 years ago

Let's see if https://github.com/pytorch/xla/pull/2269 helps...

dlibenzi commented 4 years ago

This seems fixed now.

den-run-ai commented 3 years ago

@dlibenzi this is still a problem in pytorch-xla 1.7 in Colab PRO Standard TPU runtime on the last epoch of DCGAN tutorial notebook:

Finished training epoch 21: D_error:0.9385375380516052, G_error: 0.6658371686935425

Exception in device=TPU:4: tensorflow/compiler/xla/xla_client/mesh_service.cc:364 : Failed to meet rendezvous 'torch_xla.core.xla_model.do_on_ordinals': Socket closed (14)
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 330, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 324, in _start_fn
    fn(gindex, *args)
  File "<ipython-input-13-a50ba3d9aab8>", line 6, in _mp_fn
    train_gan(rank)
  File "<ipython-input-12-a72e0519d202>", line 108, in train_gan
    xm.do_on_ordinals(plot_results, generator(test_noise).detach(), (0,))
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/core/xla_model.py", line 888, in do_on_ordinals
    rendezvous('torch_xla.core.xla_model.do_on_ordinals')
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/core/xla_model.py", line 861, in rendezvous
    return torch_xla._XLAC._xla_rendezvous(get_ordinal(), tag, payload, replicas)
RuntimeError: tensorflow/compiler/xla/xla_client/mesh_service.cc:364 : Failed to meet rendezvous 'torch_xla.core.xla_model.do_on_ordinals': Socket closed (14)

---------------------------------------------------------------------------

Exception                                 Traceback (most recent call last)

<ipython-input-13-a50ba3d9aab8> in <module>()
      7 
      8 xmp.spawn(_mp_fn, args=(FLAGS,), nprocs=FLAGS['num_cores'],
----> 9           start_method='fork')

2 frames

/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py in join(self, timeout)
    110                 raise Exception(
    111                     "process %d terminated with exit code %d" %
--> 112                     (error_index, exitcode)
    113                 )
    114 

Exception: process 4 terminated with exit code 17
taylanbil commented 3 years ago

I just ran the colab a few times and it always worked fine for me? does this happen for you every time @denfromufa ? could be a one-off not being able to meet rendezvous.

den-run-ai commented 3 years ago

Confirmed - one off. Not repeatable!

On Thu, Dec 3, 2020 at 1:17 PM Taylan Bilal notifications@github.com wrote:

I just ran the colab a few times and it always worked fine for me? does this happen for you every time @denfromufa https://github.com/denfromufa ? could be a one-off not being able to meet rendezvous.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pytorch/xla/issues/2246#issuecomment-738237270, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB4BTZI3MSPSXEJ72ZHSWGTSS7P5DANCNFSM4ODXFDMA .