pytorch / xla

Enabling PyTorch on XLA Devices (e.g. Google TPU)
https://pytorch.org/xla
Other
2.45k stars 462 forks source link

Need of firewall rule for ssh/tcp:22 in GCP #3129

Closed kakashiUc closed 2 years ago

kakashiUc commented 3 years ago

❓ Questions and Help

I am trying to run a tpu pod training on gcp, following https://cloud.google.com/tpu/docs/tutorials/pytorch-pod. I keep getting ssh related errors, such as

[5] Connection to 10.142.0.57 closed by remote host.
[5] ERROR: (gcloud.compute.ssh) [/usr/bin/ssh] exited with return code [255].

and

packet_write_wait: Connection to 120.192.110.58 port 22: Broken pipe
ERROR: (gcloud.compute.ssh) [/usr/bin/ssh] exited with return code [255].

Also, it sometime shows errors related to rendezous, such as

Exception in device=TPU:24: tensorflow/compiler/xla/xla_client/mesh_service.cc:364 : Failed to meet rendezvous 'test_loss': failed to connect to all addresses (14)

The gcp account has a policy setup which prevent setting up a firewall rule for tcp:22 or ssh connections. I'm suspecting these all errors are due to absence of the firewall rule as I have checked with other things such as possible memory overflow with different data.

Can anyone give information on possible reason of these errors or confirm whether these errors are actually due to inability of VMs to communicate through ssh?

JackCaoG commented 3 years ago

Hi @kakashiUc, based on the tutorial, I am guessing you are using TPU Node where you will create a TPU Node and a correspodning instance-group with (# core / 8) VMs.

Did you see these ssh error message once start training or when you tried to ssh to the instance group?

Also, we now also have TPUVM in public preview stage which you can directly ssh to TPU without needing to create an instance group. You can checkout here.

kakashiUc commented 3 years ago

Actually both. This error is observed when I tried to ssh into any of the group VMs. So I can set a temporary ssh firewall rule or can ssh into VM through browser option. And during training too.

But this question is related to occurrence this error during training.

Yeah, I can try that. One question @JackCaoG , there's some limit on TPUs size on TPUVM? I mean if I use v3s then is there a limit on for v3-?

JackCaoG commented 3 years ago

I only encountered similar behavior for some google internal project where it will constantly delete firewall rules. I have to keep adding defualt-allow-ssh while launching training. I don't think this is a global setting for all gcp projects through.

kakashiUc commented 2 years ago

Was your training duration low(<10 mins) or it was of considerable amount of time? What was the duration after which the rule was getting deleted? - I want to understand whether you had to add the rule only at start of training or during the whole period of training.

JackCaoG commented 2 years ago

Training takes hours. The rule is deleted after ~ 1 minutes. I only need to manually add the rule once at the start of the training to make sure ssh connection establish among instance groups, ssh connection won't be kill when the firewall rule is deleted in my case. @jysohn23 Do you have more insight of this firewall issue?

kakashiUc commented 2 years ago

Let me try the TPU VM and invoking ssh rule at the beginning.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.