pytorch / xla

Enabling PyTorch on XLA Devices (e.g. Google TPU)
https://pytorch.org/xla
Other
2.48k stars 480 forks source link

Exception: process 4 terminated with exit code 17 #2119

Closed andrewng88 closed 4 years ago

andrewng88 commented 4 years ago

❓ Questions and Help

Trying to initialize TPU before training can be done.

I was able to init using this ( single TPU ) https://colab.research.google.com/github/pytorch/xla/blob/master/contrib/colab/getting-started.ipynb

But I wasn't able to when I ran the first 2 blocks of code using this notebook as I was trying the 8 Free TPUs for colab. https://github.com/pytorch/xla/blob/master/contrib/colab/multi-core-alexnet-fashion-mnist.ipynb

Error as follows. Thanks

I notice that the TPU has random errors at the random TPU number.

Exception in device=TPU:3: tensorflow/compiler/xla/xla_client/mesh_service.cc:243 : Failed to meet rendezvous 'init': Socket closed (14)
Exception in device=TPU:4: tensorflow/compiler/xla/xla_client/mesh_service.cc:243 : Failed to meet rendezvous 'init': Socket closed (14)
Process 0 is using TPU:0
Process 7 is using TPU:7
Process 1 is using TPU:1
Process 5 is using TPU:5
Process 6 is using TPU:6
Process 3 is using TPU:3
Process 2 is using TPU:2
Process 4 is using TPU:4
Exception in device=TPU:3: tensorflow/compiler/xla/xla_client/mesh_service.cc:243 : Failed to meet rendezvous 'init': Socket closed (14)
Exception in device=TPU:4: tensorflow/compiler/xla/xla_client/mesh_service.cc:243 : Failed to meet rendezvous 'init': Socket closed (14)
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 119, in _start_fn
Traceback (most recent call last):
  File "<ipython-input-4-3767ef48d64e>", line 21, in simple_map_fn
    xm.rendezvous('init')
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/core/xla_model.py", line 614, in rendezvous
    file objects must point to different destinations as otherwise all the
RuntimeError: tensorflow/compiler/xla/xla_client/mesh_service.cc:243 : Failed to meet rendezvous 'init': Socket closed (14)
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 119, in _start_fn
  File "<ipython-input-4-3767ef48d64e>", line 21, in simple_map_fn
    xm.rendezvous('init')
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/core/xla_model.py", line 614, in rendezvous
    file objects must point to different destinations as otherwise all the
---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-4-3767ef48d64e> in <module>()
     25 flags = {}
     26 # Note: Colab only supports start_method='fork'
---> 27 xmp.spawn(simple_map_fn, args=(flags,), nprocs=8, start_method='fork')

2 frames
/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py in join(self, timeout)
    111                 raise Exception(
    112                     "process %d terminated with exit code %d" %
--> 113                     (error_index, exitcode)
    114                 )
    115 

Exception: process 4 terminated with exit code 17
dlibenzi commented 4 years ago

I have tried the whole notebook and it ran fine for me. Did you by any chance run the commented do-not-run-these code? Can you try again with nightly?

andrewng88 commented 4 years ago

@dlibenzi Thanks it's working now, maybe it's the 'nightly' settings.

i'm a beginner here, do I need to anything special aside from sending the tensors to TPU for training ? will it work in batches like what we normally do?

Thanks.

dlibenzi commented 4 years ago

I suggest you to read the comments in that Colab, as it explains the minor differences WRT a pytorch CPU/GPU training.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.