skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.85k stars 518 forks source link

[Provisioner] Multi-node provisioning for large clusters fails on GCP #1260

Closed romilbhardwaj closed 11 months ago

romilbhardwaj commented 2 years ago

Provisioning a multi-node cluster with num_nodes ~50 fails on GCP. Works on AWS. Works on GCP for 2 nodes.

Example YAML:

resources:
  cloud: gcp

num_nodes: 50

run: |
echo $HOSTNAME
Task from YAML spec: provision_tutorial_docker.yaml
I 10-16 16:21:49 optimizer.py:605] == Optimizer ==
I 10-16 16:21:49 optimizer.py:628] Estimated cost: $23.7 / hour
I 10-16 16:21:49 optimizer.py:628] 
I 10-16 16:21:49 optimizer.py:685] Considered resources (50 nodes):
I 10-16 16:21:49 optimizer.py:713] -------------------------------------------------------------------
I 10-16 16:21:49 optimizer.py:713]  CLOUD   INSTANCE       vCPUs   ACCELERATORS   COST ($)   CHOSEN   
I 10-16 16:21:49 optimizer.py:713] -------------------------------------------------------------------
I 10-16 16:21:49 optimizer.py:713]  GCP     n1-highmem-8   8       -              23.66         ✔     
I 10-16 16:21:49 optimizer.py:713] -------------------------------------------------------------------
I 10-16 16:21:49 optimizer.py:713] 
Launching a new cluster 'tut'. Proceed? [Y/n]: Y
I 10-16 16:21:56 cloud_vm_ray_backend.py:2705] Creating a new cluster: "tut" [50x GCP(n1-highmem-8)].
I 10-16 16:21:56 cloud_vm_ray_backend.py:2705] Tip: to reuse an existing cluster, specify --cluster (-c). Run `sky status` to see existing clusters.
I 10-16 16:21:56 cloud_vm_ray_backend.py:903] To view detailed progress: tail -n100 -f /home/romilb/sky_logs/sky-2022-10-16-16-21-48-994226/provision.log
I 10-16 16:22:00 cloud_vm_ray_backend.py:1153] Launching on GCP us-central1 (us-central1-a)
I 10-16 16:23:08 log_utils.py:45] Head node is up.
I 10-16 16:24:14 cloud_vm_ray_backend.py:1260] Successfully provisioned or found existing head VM. Waiting for workers.
E 10-16 16:25:32 backend_utils.py:914] Timed out: waited for more than 60 seconds for new workers to be provisioned, but no progress.
E 10-16 16:25:32 cloud_vm_ray_backend.py:1031] *** Failed provisioning the cluster. ***
E 10-16 16:25:32 cloud_vm_ray_backend.py:1033] *** Terminating the failed cluster. ***
I 10-16 16:38:37 cloud_vm_ray_backend.py:1153] Launching on GCP us-west1 (us-west1-a)
I 10-16 16:39:49 log_utils.py:45] Head node is up.
I 10-16 16:40:51 cloud_vm_ray_backend.py:1260] Successfully provisioned or found existing head VM. Waiting for workers.
E 10-16 16:42:09 backend_utils.py:914] Timed out: waited for more than 60 seconds for new workers to be provisioned, but no progress.
E 10-16 16:42:09 cloud_vm_ray_backend.py:1031] *** Failed provisioning the cluster. ***
E 10-16 16:42:09 cloud_vm_ray_backend.py:1033] *** Terminating the failed cluster. ***

I can see the workers ready on the google cloud console, but the terminal shows 0 of 49 workers ready.... ray status also shows waiting for workers (provisioning logs).

Might be related to too short a timeout here? https://github.com/skypilot-org/skypilot/blob/2d4fee3a30ad0a7c8a5a7a7e0f5ff62f332a9708/sky/backends/cloud_vm_ray_backend.py#L63-L65

Michaelvll commented 2 years ago

Great finding @romilbhardwaj ! I just tried it out with infinite _NODES_LAUNCHING_PROGRESS_TIMEOUT, and it seems that ray status takes 6 minutes to find out the first 10 workers have been successfully added to the ray cluster here.

It is a bit tricky to increase the number, as that will increase the overhead for failing over to another region. For example, we increase the _NODES_LAUNCHING_PROGRESS_TIMEOUT to 10 minutes. If a user tries to launch 5 nodes and we assume that each worker node becomes available at the last minute, but the last one fails to due to quota issue (we don't have a way to distinguish the quota issue and the resource unavailable for multi-node at the moment), then it will take head_provision_time + 10 4 > 40 minutes before our system starts to fail over to another region, i.e. it can take more than 10 40 min > 6 hours to go through us regions in GCP.

romilbhardwaj commented 2 years ago

Good point! I'm tempted to suggest using google cloud python API to check worker status (how many have been provisioned) since those are fairly fast, and make a early decision based on that instead of waiting for ray to report ready worker count. However that might be a little brittle since we're side-stepping the ray autoscaler...

Michaelvll commented 2 years ago

Another problem found when trying out this issue. Our sky down cluster_name for the 50 node cluster does not work well producing the following error.

E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] Failed to terminate large-scale-50. If you want to ignore this error and remove the cluster from the status table, use `sky down --purge`.
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] **** STDOUT ****
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] 2022-10-16 21:50:36,235  INFO commands.py:348 -- Checking External environment settings
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] 2022-10-16 21:50:40,638  INFO commands.py:403 -- Destroying cluster. Confirm [y/N]: y [automatic, due to --yes]
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] 2022-10-16 21:50:40,650  VWARN commands.py:317 -- Loaded cached provider configuration from /var/folders/bx/cnr_jmsj4n79c9p0q4v68xvw0000gp/T/ray-config-8b1e38e306017a2bfb0f13baf6f80f8b9d6c358b
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] 2022-10-16 21:50:40,651  WARN commands.py:323 -- If you experience issues with the cloud provider, try re-running the command with --no-config-cache.
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] 2022-10-16 21:50:43,175  INFO command_runner.py:394 -- Fetched IP: 34.172.138.58
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] 2022-10-16 21:50:43,177  VINFO command_runner.py:552 -- Running `ray stop`
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] 2022-10-16 21:50:43,177  VVINFO command_runner.py:554 -- Full command is `ssh -tt -i ~/.ssh/sky-key -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_638d9623ae/12b6a76eef/%C -o ControlPersist=10s -o ConnectTimeout=120s gcpuser@34.172.138.58 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (ray stop)'`
Stopped all 7 Ray processes.y_backend.py:2545] 1/7 stopped.
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] 
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] **** STDERR ****
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] 2022-10-16 21:50:35,625  INFO util.py:335 -- setting max workers for head node type to 0
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] 2022-10-16 21:50:40,650  INFO util.py:335 -- setting max workers for head node type to 0
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] Warning: Permanently added '34.172.138.58' (ED25519) to the list of known hosts.
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] Shared connection to 34.172.138.58 closed.
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] 2022-10-16 21:50:52,246  INFO node_provider.py:162 -- NodeProvider: ray-large-scale-50-head-4cb2c524-compute: Terminating node
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] 2022-10-16 21:51:08,761  INFO node_provider.py:162 -- NodeProvider: ray-large-scale-50-worker-00ff3de2-compute: Terminating node
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] 2022-10-16 21:52:12,209  INFO node_provider.py:162 -- NodeProvider: ray-large-scale-50-worker-028f54f7-compute: Terminating node
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] 2022-10-16 21:53:16,490  INFO node_provider.py:162 -- NodeProvider: ray-large-scale-50-worker-0ddbb16e-compute: Terminating node
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] 2022-10-16 21:54:21,075  INFO node_provider.py:162 -- NodeProvider: ray-large-scale-50-worker-0eeba840-compute: Terminating node
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] 2022-10-16 21:55:25,889  INFO node_provider.py:162 -- NodeProvider: ray-large-scale-50-worker-1ba846d1-compute: Terminating node
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] 2022-10-16 21:56:29,988  INFO node_provider.py:162 -- NodeProvider: ray-large-scale-50-worker-3927d51f-compute: Terminating node
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] 2022-10-16 21:57:34,198  INFO node_provider.py:162 -- NodeProvider: ray-large-scale-50-worker-3eb35fc8-compute: Terminating node
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] 2022-10-16 21:58:38,301  INFO node_provider.py:162 -- NodeProvider: ray-large-scale-50-worker-4d9e523b-compute: Terminating node
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] 2022-10-16 21:59:41,991  INFO node_provider.py:162 -- NodeProvider: ray-large-scale-50-worker-4edccf2f-compute: Terminating node
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] 2022-10-16 22:00:45,790  INFO node_provider.py:162 -- NodeProvider: ray-large-scale-50-worker-5b4b9871-compute: Terminating node
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] Traceback (most recent call last):
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]   File "/Users/zhwu/miniconda3/envs/sky-dev/bin/ray", line 8, in <module>
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]     sys.exit(main())
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]   File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/site-packages/ray/scripts/scripts.py", line 2339, in main
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]     return cli()
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]   File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/site-packages/click/core.py", line 1128, in __call__
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]     return self.main(*args, **kwargs)
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]   File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/site-packages/click/core.py", line 1053, in main
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]     rv = self.invoke(ctx)
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]   File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/site-packages/click/core.py", line 1659, in invoke
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]     return _process_result(sub_ctx.command.invoke(sub_ctx))
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]   File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/site-packages/click/core.py", line 1395, in invoke
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]     return ctx.invoke(self.callback, **ctx.params)
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]   File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/site-packages/click/core.py", line 754, in invoke
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]     return __callback(*args, **kwargs)
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]   File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/site-packages/ray/autoscaler/_private/cli_logger.py", line 852, in wrapper
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]     return f(*args, **kwargs)
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]   File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/site-packages/ray/scripts/scripts.py", line 1223, in down
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]     teardown_cluster(
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]   File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/site-packages/ray/autoscaler/_private/commands.py", line 510, in teardown_cluster
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]     provider.terminate_nodes(A)
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]   File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/site-packages/ray/autoscaler/node_provider.py", line 163, in terminate_nodes
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]     self.terminate_node(node_id)
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]   File "/Users/zhwu/Library/CloudStorage/OneDrive-Personal/AResource/PhD/Research/sky-computing/code/sky-experiment-dev/sky/skylet/providers/gcp/node_provider.py", line 54, in method_with_retries
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]     return method(self, *args, **kwargs)
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]   File "/Users/zhwu/Library/CloudStorage/OneDrive-Personal/AResource/PhD/Research/sky-computing/code/sky-experiment-dev/sky/skylet/providers/gcp/node_provider.py", line 257, in terminate_node
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]     result = resource.delete_instance(
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]   File "/Users/zhwu/Library/CloudStorage/OneDrive-Personal/AResource/PhD/Research/sky-computing/code/sky-experiment-dev/sky/skylet/providers/gcp/node.py", line 598, in delete_instance
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]     self.resource.instances()
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]   File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/site-packages/googleapiclient/_helpers.py", line 130, in positional_wrapper
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]     return wrapped(*args, **kwargs)
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]   File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/site-packages/googleapiclient/http.py", line 923, in execute
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]     resp, content = _retry_request(
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]   File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/site-packages/googleapiclient/http.py", line 222, in _retry_request
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]     raise exception
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]   File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/site-packages/googleapiclient/http.py", line 191, in _retry_request
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]     resp, content = http.request(uri, method, *args, **kwargs)
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]   File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/site-packages/google_auth_httplib2.py", line 218, in request
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]     response, content = self.http.request(
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]   File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/site-packages/httplib2/__init__.py", line 1725, in request
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]     (response, content) = self._request(
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]   File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/site-packages/httplib2/__init__.py", line 1441, in _request
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]     (response, content) = self._conn_request(conn, request_uri, method, body, headers)
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]   File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/site-packages/httplib2/__init__.py", line 1393, in _conn_request
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]     response = conn.getresponse()
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]   File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/http/client.py", line 1348, in getresponse
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]     response.begin()
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]   File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/http/client.py", line 316, in begin
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]     version, status, reason = self._read_status()
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]   File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/http/client.py", line 277, in _read_status
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]     line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]   File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/socket.py", line 669, in readinto
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]     return self._sock.recv_into(b)
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]   File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/ssl.py", line 1241, in recv_into
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]     return self.read(nbytes, buffer)
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]   File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/ssl.py", line 1099, in read
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]     return self._sslobj.read(len, buffer)
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] socket.timeout: The read operation timed out
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] 
juancamilog commented 1 year ago

I hit this error today trying to launch a cluster with 16 nodes. My yaml config is set up to download a docker image from a private gcr registry. Downloading the image take a couple minutes, although I'm not sure whether this is the cause for the time out.

Terminating the cluster also takes longer than it should since it is stopping/deleting the gcp instances one by one, when it could do it in parallel (resulting in extra cost for the idle instances)

concretevitamin commented 1 year ago

Thanks for the report @juancamilog. cc @suquark: according to the discussions above it seems like having a GCP impl of the new provisioner interface would solve this problem. Do you see an easy way to fix it in the current provisioner?

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.

Michaelvll commented 1 year ago

This is being solved by #2681