skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.69k stars 494 forks source link

TPU VM: guard against spurious and transient failures #962

Closed concretevitamin closed 1 year ago

concretevitamin commented 2 years ago

Encountered this before, and saw this in test_tpu_vm smoke test:

...
+ sky logs test-tpu-vm-zongheng-4edc-eb 1 --status
Job 1 SUCCEEDED
+ sky stop -y test-tpu-vm-zongheng-4edc-eb

Stopping cluster test-tpu-vm-zongheng-4edc-eb...done.
  To restart the cluster, run: sky start test-tpu-vm-zongheng-4edc-eb
Stopping 1 cluster ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
+ sky status --refresh | grep test-tpu-vm-zongheng-4edc-eb | grep STOPPED
test-tpu-vm-zongheng-4edc-eb          5 mins ago   1x GCP(TPU-VM, {'tpu-v2-8': 1}, accelerator_args={'runtime_version': '...  STOPPED  -         sky launch -y -c test-tpu...  
+ sky start -y test-tpu-vm-zongheng-4edc-eb
I 07-10 10:42:21 cloud_vm_ray_backend.py:862] To view detailed progress: tail -n100 -f /Users/zongheng/sky_logs/sky-2022-07-10-10-42-21-027993/provision.log
I 07-10 10:42:21 cloud_vm_ray_backend.py:709] Cluster 'test-tpu-vm-zongheng-4edc-eb' (status: STOPPED) was previously launched in GCP (us-central1). Relaunching in that region.
I 07-10 10:42:26 cloud_vm_ray_backend.py:1049] Launching on GCP us-central1 (us-central1-c)
I 07-10 10:43:48 cloud_vm_ray_backend.py:506] Got googleapiclient.errors.HttpError: <HttpError 400 when requesting https://tpu.googleapis.com/v2alpha1/projects/intercloud-320520/locations/us-central1-c/nodes/ray-test-tpu-vm-zongheng-4edc-eb-head-779c3415-tpu?updateMask=labels&alt=json returned "Cloud TPU received a bad request. update is not supported while in state STARTING [EID: 0x5993ae50c674c78]". Details: "Cloud TPU received a bad request. update is not supported while in state STARTING [EID: 0x5993ae50c674c78]">
E 07-10 10:49:21 cloud_vm_ray_backend.py:2202] Failed to terminate test-tpu-vm-zongheng-4edc-eb. If you want to ignore this error and remove the cluster from from Sky's status table, use `sky down --purge`.
E 07-10 10:49:21 cloud_vm_ray_backend.py:2202] **** STDOUT ****
E 07-10 10:49:21 cloud_vm_ray_backend.py:2202] Did not find any active Ray processes.
E 07-10 10:49:21 cloud_vm_ray_backend.py:2202] 2022-07-10 10:43:49,316  VWARN commands.py:267 -- Loaded cached provider configuration from /var/folders/8f/56gzvwkd3n3293xjlrztr6600000gp/T/ray-config-99d97c1e6b19b1fe3b9780752131c8a122d9d308
E 07-10 10:49:21 cloud_vm_ray_backend.py:2202] 2022-07-10 10:43:49,316  WARN commands.py:273 -- If you experience issues with the cloud provider, try re-running the command with --no-config-cache.
E 07-10 10:49:21 cloud_vm_ray_backend.py:2202] 2022-07-10 10:43:49,316  INFO commands.py:340 -- Destroying cluster. Confirm [y/N]: y [automatic, due to --yes]
E 07-10 10:49:21 cloud_vm_ray_backend.py:2202] 2022-07-10 10:44:00,886  INFO command_runner.py:361 -- Waiting for IP
E 07-10 10:49:21 cloud_vm_ray_backend.py:2202] 2022-07-10 10:44:01,593  INFO command_runner.py:369 -- Not yet available, retrying in 5 seconds
...
E 07-10 10:49:21 cloud_vm_ray_backend.py:2202] 2022-07-10 10:48:15,307  INFO command_runner.py:369 -- Not yet available, retrying in 5 seconds
E 07-10 10:49:21 cloud_vm_ray_backend.py:2202] 2022-07-10 10:48:20,624  INFO command_runner.py:366 -- Received: 34.135.73.1
E 07-10 10:49:21 cloud_vm_ray_backend.py:2202] 2022-07-10 10:48:20,625  VINFO command_runner.py:509 -- Running `ray stop`
E 07-10 10:49:21 cloud_vm_ray_backend.py:2202] 2022-07-10 10:48:20,625  VVINFO command_runner.py:512 -- Full command is `ssh -tt -i ~/.ssh/sky-key -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_3b1d3af515/b18c78ace6/%C -o ControlPersist=10s -o ConnectTimeout=120s gcpuser@34.135.73.1 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (ray stop)'`
E 07-10 10:49:21 cloud_vm_ray_backend.py:2202] 2022-07-10 10:49:21,191  INFO node_provider.py:207 -- Stopping instance projects/intercloud-320520/locations/us-central1-c/nodes/ray-test-tpu-vm-zongheng-4edc-eb-head-779c3415-tpu (to terminate instead, set `cache_stopped_nodes: False` under `provider` in the cluster configuration)
E 07-10 10:49:21 cloud_vm_ray_backend.py:2202] 
E 07-10 10:49:21 cloud_vm_ray_backend.py:2202] **** STDERR ****
E 07-10 10:49:21 cloud_vm_ray_backend.py:2202] 2022-07-10 10:43:49,313  INFO util.py:282 -- setting max workers for head node type to 0
E 07-10 10:49:21 cloud_vm_ray_backend.py:2202] 2022-07-10 10:43:49,327  INFO util.py:282 -- setting max workers for head node type to 0
E 07-10 10:49:21 cloud_vm_ray_backend.py:2202] Warning: Permanently added '34.135.73.1' (ECDSA) to the list of known hosts.

E 07-10 10:49:21 cloud_vm_ray_backend.py:2202] Shared connection to 34.135.73.1 closed.

E 07-10 10:49:21 cloud_vm_ray_backend.py:2202] 2022-07-10 10:49:21,190  INFO node_provider.py:159 -- NodeProvider: projects/intercloud-320520/locations/us-central1-c/nodes/ray-test-tpu-vm-zongheng-4edc-eb-head-779c3415-tpu: Terminating node
E 07-10 10:49:21 cloud_vm_ray_backend.py:2202] Traceback (most recent call last):
E 07-10 10:49:21 cloud_vm_ray_backend.py:2202]   File "/Users/zongheng/anaconda/envs/py37/bin/ray", line 8, in <module>
E 07-10 10:49:21 cloud_vm_ray_backend.py:2202]     sys.exit(main())
E 07-10 10:49:21 cloud_vm_ray_backend.py:2202]   File "/Users/zongheng/anaconda/envs/py37/lib/python3.7/site-packages/ray/scripts/scripts.py", line 1989, in main
E 07-10 10:49:21 cloud_vm_ray_backend.py:2202]     return cli()
E 07-10 10:49:21 cloud_vm_ray_backend.py:2202]   File "/Users/zongheng/anaconda/envs/py37/lib/python3.7/site-packages/click/core.py", line 1128, in __call__
E 07-10 10:49:21 cloud_vm_ray_backend.py:2202]     return self.main(*args, **kwargs)
E 07-10 10:49:21 cloud_vm_ray_backend.py:2202]   File "/Users/zongheng/anaconda/envs/py37/lib/python3.7/site-packages/click/core.py", line 1053, in main
E 07-10 10:49:21 cloud_vm_ray_backend.py:2202]     rv = self.invoke(ctx)
E 07-10 10:49:21 cloud_vm_ray_backend.py:2202]   File "/Users/zongheng/anaconda/envs/py37/lib/python3.7/site-packages/click/core.py", line 1659, in invoke
E 07-10 10:49:21 cloud_vm_ray_backend.py:2202]     return _process_result(sub_ctx.command.invoke(sub_ctx))
E 07-10 10:49:21 cloud_vm_ray_backend.py:2202]   File "/Users/zongheng/anaconda/envs/py37/lib/python3.7/site-packages/click/core.py", line 1395, in invoke
E 07-10 10:49:21 cloud_vm_ray_backend.py:2202]     return ctx.invoke(self.callback, **ctx.params)
E 07-10 10:49:21 cloud_vm_ray_backend.py:2202]   File "/Users/zongheng/anaconda/envs/py37/lib/python3.7/site-packages/click/core.py", line 754, in invoke
E 07-10 10:49:21 cloud_vm_ray_backend.py:2202]     return __callback(*args, **kwargs)
E 07-10 10:49:21 cloud_vm_ray_backend.py:2202]   File "/Users/zongheng/anaconda/envs/py37/lib/python3.7/site-packages/ray/scripts/scripts.py", line 1015, in down
E 07-10 10:49:21 cloud_vm_ray_backend.py:2202]     keep_min_workers)
E 07-10 10:49:21 cloud_vm_ray_backend.py:2202]   File "/Users/zongheng/anaconda/envs/py37/lib/python3.7/site-packages/ray/autoscaler/_private/commands.py", line 443, in teardown_cluster
E 07-10 10:49:21 cloud_vm_ray_backend.py:2202]     provider.terminate_nodes(A)
E 07-10 10:49:21 cloud_vm_ray_backend.py:2202]   File "/Users/zongheng/anaconda/envs/py37/lib/python3.7/site-packages/ray/autoscaler/node_provider.py", line 160, in terminate_nodes
E 07-10 10:49:21 cloud_vm_ray_backend.py:2202]     self.terminate_node(node_id)
E 07-10 10:49:21 cloud_vm_ray_backend.py:2202]   File "/Users/zongheng/Dropbox/workspace/riselab/sky-computing/sky/skylet/providers/gcp/node_provider.py", line 41, in method_with_retries
E 07-10 10:49:21 cloud_vm_ray_backend.py:2202]     return method(self, *args, **kwargs)
E 07-10 10:49:21 cloud_vm_ray_backend.py:2202]   File "/Users/zongheng/Dropbox/workspace/riselab/sky-computing/sky/skylet/providers/gcp/node_provider.py", line 212, in terminate_node
E 07-10 10:49:21 cloud_vm_ray_backend.py:2202]     result = resource.stop_instance(node_id=node_id)
E 07-10 10:49:21 cloud_vm_ray_backend.py:2202]   File "/Users/zongheng/Dropbox/workspace/riselab/sky-computing/sky/skylet/providers/gcp/node.py", line 703, in stop_instance
E 07-10 10:49:21 cloud_vm_ray_backend.py:2202]     name=node_id).execute()
E 07-10 10:49:21 cloud_vm_ray_backend.py:2202]   File "/Users/zongheng/anaconda/envs/py37/lib/python3.7/site-packages/googleapiclient/_helpers.py", line 131, in positional_wrapper
E 07-10 10:49:21 cloud_vm_ray_backend.py:2202]     return wrapped(*args, **kwargs)
E 07-10 10:49:21 cloud_vm_ray_backend.py:2202]   File "/Users/zongheng/anaconda/envs/py37/lib/python3.7/site-packages/googleapiclient/http.py", line 937, in execute
E 07-10 10:49:21 cloud_vm_ray_backend.py:2202]     raise HttpError(resp, content, uri=self.uri)
E 07-10 10:49:21 cloud_vm_ray_backend.py:2202] googleapiclient.errors.HttpError: <HttpError 409 when requesting https://tpu.googleapis.com/v2alpha1/projects/intercloud-320520/locations/us-central1-c/nodes/ray-test-tpu-vm-zongheng-4edc-eb-head-779c3415-tpu:stop?alt=json returned "unable to queue the operation". Details: "[{'@type': 'type.googleapis.com/google.rpc.ResourceInfo', 'resourceName': 'projects/intercloud-320520/locations/us-central1-c/nodes/ray-test-tpu-vm-zongheng-4edc-eb-head-779c3415-tpu'}]">
E 07-10 10:49:21 cloud_vm_ray_backend.py:2202] 
sky.exceptions.ResourcesUnavailableError: Failed to acquire resources to restart the stopped cluster test-tpu-vm-zongheng-4edc-eb on Region(name='us-central1'). Please retry again later.
To keep retrying until the cluster is up, use the `--retry-until-up` flag.
Failed.
Reason: sky start -y test-tpu-vm-zongheng-4edc-eb
Log: less /var/folders/8f/56gzvwkd3n3293xjlrztr6600000gp/T/tpu_vm_app-lxb9er8x.log

We should guard against

I 07-10 10:43:48 cloud_vm_ray_backend.py:506] Got googleapiclient.errors.HttpError: <HttpError 400 when requesting https://tpu.googleapis.com/v2alpha1/projects/intercloud-320520/locations/us-central1-c/nodes/ray-test-tpu-vm-zongheng-4edc-eb-head-779c3415-tpu?updateMask=labels&alt=json returned "Cloud TPU received a bad request. update is not supported while in state STARTING [EID: 0x5993ae50c674c78]". Details: "Cloud TPU received a bad request. update is not supported while in state STARTING [EID: 0x5993ae50c674c78]">

by retrying + not prematurely going into failover loop. This message was observed outside of smoke tests as well.

Michaelvll commented 2 years ago

Got a similar issue during run_smoke_tests.sh.

Got googleapiclient.errors.HttpError: <HttpError 400 when requesting https://tpu.googleapis.com/v2alpha1/projects/intercloud-320520/locations/us-central1-b/nodes/ray-test-tpu-vm-zhwu-164b-15-head-822bdd52-tpu?updateMask=labels&alt=json returned "Cloud TPU received a bad request. update is not supported while in state STARTING [EID: 0x59a3b65ce30f480c]". Details: "Cloud TPU received a bad request. update is not supported while in state STARTING [EID: 0x59a3b65ce30f480c]">
infwinston commented 2 years ago

@concretevitamin @Michaelvll I'm still trying to reproduce the error. If this happens to you again could you send the entire logs to me thanks!

concretevitamin commented 2 years ago

Will do, maybe try an offending test (or a command) in a loop? Alternatively we can try to code against the error directly even if repro fails.

infwinston commented 2 years ago

I ran the below script which stop/start TPU VM for 100 times yesterday but unfortunately haven't encountered the error :( Not sure if it's also related to gcloud version. I'm using the latest (Google Cloud SDK 395.0.0).

#!/bin/bash

sky launch examples/tpu/tpuvm_mnist.yaml -c tpuvm -y
for i in {1..100}
do
  sleep 60
  sky stop -y tpuvm
  sky status --refresh | grep tpuvm | grep STOPPED
  sky start --retry-until-up -y tpuvm
done

I investigated a bit and we may be able to add some try except with node.py or node_provider.py for this particular error. Will continue debugging.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.

github-actions[bot] commented 1 year ago

This issue was closed because it has been stalled for 10 days with no activity.