mgrankin / ru_transformers

Apache License 2.0
776 stars 108 forks source link

TPU hanging with message "Waiting to connect to client mesh master (300 seconds) localhost:57343" #21

Closed nikhilno1 closed 4 years ago

nikhilno1 commented 4 years ago

Thanks to your detailed instructions, I am able to run the training loop. But after some time it is getting stuck. When I press Ctrl-C, we can see it is stuck in some socket polling. Previously I was able to run the MNIST example run successfully, but once this error comes, even running that throws the same error. Did you face this anytime? Any idea how I can get around this? I have tried restarting the TPU but same problem is seen.

Iteration: 100%|############################################################################################################################################################| 32/32 [01:47<00:00,  3.37s/it]
100%|#########################################################################################################################################################################| 1/1 [00:00<00:00, 39.03it/s]
Evaluating: 36it [00:23,  1.55it/s]                                                                                                                                                   | 0/1 [00:00<?, ?it/s]
100%|#########################################################################################################################################################################| 1/1 [00:00<00:00,  4.96it/s]
Epoch: 100%|#################################################################################################################################################################| 1/1 [02:15<00:00, 135.51s/it]
2020-03-08 09:46:32.542284: I      68 tensorflow/compiler/xla/xla_client/mesh_service.cc:208] Waiting to connect to client mesh master (300 seconds) localhost:57343
^CTraceback (most recent call last):
  File "tpu_lm_finetuning.py", line 697, in <module>
    xmp.spawn(main)
  File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 182, in spawn
    start_method=start_method)
  File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
    while not context.join():
  File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 78, in join
    timeout=timeout,
  File "/root/anaconda3/envs/pytorch/lib/python3.6/multiprocessing/connection.py", line 911, in wait
    ready = selector.select(timeout)
  File "/root/anaconda3/envs/pytorch/lib/python3.6/selectors.py", line 376, in select
    fd_event_list = self._poll.poll(timeout)
KeyboardInterrupt
nikhilno1 commented 4 years ago

Is it the case that training is actually completing but command doesn't return, which is what I am used to seeing?

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.