Closed kakashiUc closed 2 years ago
Hi @kakashiUc, based on the tutorial, I am guessing you are using TPU Node where you will create a TPU Node and a correspodning instance-group with (# core / 8) VMs.
Did you see these ssh error message once start training or when you tried to ssh to the instance group?
Also, we now also have TPUVM in public preview stage which you can directly ssh to TPU without needing to create an instance group. You can checkout here.
Actually both. This error is observed when I tried to ssh into any of the group VMs. So I can set a temporary ssh firewall rule or can ssh into VM through browser option. And during training too.
But this question is related to occurrence this error during training.
Yeah, I can try that. One question @JackCaoG , there's some limit on TPUs size on TPUVM? I mean if I use v3s then is there a limit on for v3-?
I only encountered similar behavior for some google internal project where it will constantly delete firewall rules. I have to keep adding defualt-allow-ssh
while launching training. I don't think this is a global setting for all gcp projects through.
Was your training duration low(<10 mins) or it was of considerable amount of time? What was the duration after which the rule was getting deleted? - I want to understand whether you had to add the rule only at start of training or during the whole period of training.
Training takes hours. The rule is deleted after ~ 1 minutes. I only need to manually add the rule once at the start of the training to make sure ssh connection establish among instance groups, ssh connection won't be kill when the firewall rule is deleted in my case. @jysohn23 Do you have more insight of this firewall issue?
Let me try the TPU VM and invoking ssh rule at the beginning.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
❓ Questions and Help
I am trying to run a tpu pod training on gcp, following https://cloud.google.com/tpu/docs/tutorials/pytorch-pod. I keep getting ssh related errors, such as
and
Also, it sometime shows errors related to rendezous, such as
The gcp account has a policy setup which prevent setting up a firewall rule for tcp:22 or ssh connections. I'm suspecting these all errors are due to absence of the firewall rule as I have checked with other things such as possible memory overflow with different data.
Can anyone give information on possible reason of these errors or confirm whether these errors are actually due to inability of VMs to communicate through ssh?