Closed romilbhardwaj closed 11 months ago
Great finding @romilbhardwaj ! I just tried it out with infinite _NODES_LAUNCHING_PROGRESS_TIMEOUT
, and it seems that ray status
takes 6 minutes to find out the first 10 workers have been successfully added to the ray cluster here.
It is a bit tricky to increase the number, as that will increase the overhead for failing over to another region. For example, we increase the _NODES_LAUNCHING_PROGRESS_TIMEOUT
to 10 minutes. If a user tries to launch 5 nodes and we assume that each worker node becomes available at the last minute, but the last one fails to due to quota issue (we don't have a way to distinguish the quota issue and the resource unavailable for multi-node at the moment), then it will take head_provision_time + 10 4 > 40 minutes before our system starts to fail over to another region, i.e. it can take more than 10 40 min > 6 hours to go through us regions in GCP.
Good point! I'm tempted to suggest using google cloud python API to check worker status (how many have been provisioned) since those are fairly fast, and make a early decision based on that instead of waiting for ray to report ready worker count. However that might be a little brittle since we're side-stepping the ray autoscaler...
Another problem found when trying out this issue. Our sky down cluster_name
for the 50 node cluster does not work well producing the following error.
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] Failed to terminate large-scale-50. If you want to ignore this error and remove the cluster from the status table, use `sky down --purge`.
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] **** STDOUT ****
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] 2022-10-16 21:50:36,235 INFO commands.py:348 -- Checking External environment settings
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] 2022-10-16 21:50:40,638 INFO commands.py:403 -- Destroying cluster. Confirm [y/N]: y [automatic, due to --yes]
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] 2022-10-16 21:50:40,650 VWARN commands.py:317 -- Loaded cached provider configuration from /var/folders/bx/cnr_jmsj4n79c9p0q4v68xvw0000gp/T/ray-config-8b1e38e306017a2bfb0f13baf6f80f8b9d6c358b
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] 2022-10-16 21:50:40,651 WARN commands.py:323 -- If you experience issues with the cloud provider, try re-running the command with --no-config-cache.
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] 2022-10-16 21:50:43,175 INFO command_runner.py:394 -- Fetched IP: 34.172.138.58
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] 2022-10-16 21:50:43,177 VINFO command_runner.py:552 -- Running `ray stop`
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] 2022-10-16 21:50:43,177 VVINFO command_runner.py:554 -- Full command is `ssh -tt -i ~/.ssh/sky-key -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_638d9623ae/12b6a76eef/%C -o ControlPersist=10s -o ConnectTimeout=120s gcpuser@34.172.138.58 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (ray stop)'`
Stopped all 7 Ray processes.y_backend.py:2545] 1/7 stopped.
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] **** STDERR ****
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] 2022-10-16 21:50:35,625 INFO util.py:335 -- setting max workers for head node type to 0
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] 2022-10-16 21:50:40,650 INFO util.py:335 -- setting max workers for head node type to 0
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] Warning: Permanently added '34.172.138.58' (ED25519) to the list of known hosts.
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] Shared connection to 34.172.138.58 closed.
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] 2022-10-16 21:50:52,246 INFO node_provider.py:162 -- NodeProvider: ray-large-scale-50-head-4cb2c524-compute: Terminating node
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] 2022-10-16 21:51:08,761 INFO node_provider.py:162 -- NodeProvider: ray-large-scale-50-worker-00ff3de2-compute: Terminating node
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] 2022-10-16 21:52:12,209 INFO node_provider.py:162 -- NodeProvider: ray-large-scale-50-worker-028f54f7-compute: Terminating node
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] 2022-10-16 21:53:16,490 INFO node_provider.py:162 -- NodeProvider: ray-large-scale-50-worker-0ddbb16e-compute: Terminating node
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] 2022-10-16 21:54:21,075 INFO node_provider.py:162 -- NodeProvider: ray-large-scale-50-worker-0eeba840-compute: Terminating node
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] 2022-10-16 21:55:25,889 INFO node_provider.py:162 -- NodeProvider: ray-large-scale-50-worker-1ba846d1-compute: Terminating node
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] 2022-10-16 21:56:29,988 INFO node_provider.py:162 -- NodeProvider: ray-large-scale-50-worker-3927d51f-compute: Terminating node
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] 2022-10-16 21:57:34,198 INFO node_provider.py:162 -- NodeProvider: ray-large-scale-50-worker-3eb35fc8-compute: Terminating node
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] 2022-10-16 21:58:38,301 INFO node_provider.py:162 -- NodeProvider: ray-large-scale-50-worker-4d9e523b-compute: Terminating node
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] 2022-10-16 21:59:41,991 INFO node_provider.py:162 -- NodeProvider: ray-large-scale-50-worker-4edccf2f-compute: Terminating node
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] 2022-10-16 22:00:45,790 INFO node_provider.py:162 -- NodeProvider: ray-large-scale-50-worker-5b4b9871-compute: Terminating node
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] Traceback (most recent call last):
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] File "/Users/zhwu/miniconda3/envs/sky-dev/bin/ray", line 8, in <module>
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] sys.exit(main())
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/site-packages/ray/scripts/scripts.py", line 2339, in main
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] return cli()
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/site-packages/click/core.py", line 1128, in __call__
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] return self.main(*args, **kwargs)
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/site-packages/click/core.py", line 1053, in main
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] rv = self.invoke(ctx)
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/site-packages/click/core.py", line 1659, in invoke
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] return _process_result(sub_ctx.command.invoke(sub_ctx))
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/site-packages/click/core.py", line 1395, in invoke
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] return ctx.invoke(self.callback, **ctx.params)
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/site-packages/click/core.py", line 754, in invoke
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] return __callback(*args, **kwargs)
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/site-packages/ray/autoscaler/_private/cli_logger.py", line 852, in wrapper
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] return f(*args, **kwargs)
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/site-packages/ray/scripts/scripts.py", line 1223, in down
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] teardown_cluster(
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/site-packages/ray/autoscaler/_private/commands.py", line 510, in teardown_cluster
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] provider.terminate_nodes(A)
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/site-packages/ray/autoscaler/node_provider.py", line 163, in terminate_nodes
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] self.terminate_node(node_id)
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] File "/Users/zhwu/Library/CloudStorage/OneDrive-Personal/AResource/PhD/Research/sky-computing/code/sky-experiment-dev/sky/skylet/providers/gcp/node_provider.py", line 54, in method_with_retries
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] return method(self, *args, **kwargs)
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] File "/Users/zhwu/Library/CloudStorage/OneDrive-Personal/AResource/PhD/Research/sky-computing/code/sky-experiment-dev/sky/skylet/providers/gcp/node_provider.py", line 257, in terminate_node
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] result = resource.delete_instance(
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] File "/Users/zhwu/Library/CloudStorage/OneDrive-Personal/AResource/PhD/Research/sky-computing/code/sky-experiment-dev/sky/skylet/providers/gcp/node.py", line 598, in delete_instance
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] self.resource.instances()
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/site-packages/googleapiclient/_helpers.py", line 130, in positional_wrapper
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] return wrapped(*args, **kwargs)
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/site-packages/googleapiclient/http.py", line 923, in execute
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] resp, content = _retry_request(
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/site-packages/googleapiclient/http.py", line 222, in _retry_request
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] raise exception
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/site-packages/googleapiclient/http.py", line 191, in _retry_request
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] resp, content = http.request(uri, method, *args, **kwargs)
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/site-packages/google_auth_httplib2.py", line 218, in request
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] response, content = self.http.request(
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/site-packages/httplib2/__init__.py", line 1725, in request
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] (response, content) = self._request(
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/site-packages/httplib2/__init__.py", line 1441, in _request
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] (response, content) = self._conn_request(conn, request_uri, method, body, headers)
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/site-packages/httplib2/__init__.py", line 1393, in _conn_request
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] response = conn.getresponse()
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/http/client.py", line 1348, in getresponse
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] response.begin()
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/http/client.py", line 316, in begin
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] version, status, reason = self._read_status()
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/http/client.py", line 277, in _read_status
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/socket.py", line 669, in readinto
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] return self._sock.recv_into(b)
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/ssl.py", line 1241, in recv_into
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] return self.read(nbytes, buffer)
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/ssl.py", line 1099, in read
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] return self._sslobj.read(len, buffer)
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] socket.timeout: The read operation timed out
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]
I hit this error today trying to launch a cluster with 16 nodes. My yaml config is set up to download a docker image from a private gcr registry. Downloading the image take a couple minutes, although I'm not sure whether this is the cause for the time out.
Terminating the cluster also takes longer than it should since it is stopping/deleting the gcp instances one by one, when it could do it in parallel (resulting in extra cost for the idle instances)
Thanks for the report @juancamilog. cc @suquark: according to the discussions above it seems like having a GCP impl of the new provisioner interface would solve this problem. Do you see an easy way to fix it in the current provisioner?
This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.
This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.
This is being solved by #2681
Provisioning a multi-node cluster with
num_nodes ~50
fails on GCP. Works on AWS. Works on GCP for 2 nodes.Example YAML:
I can see the workers ready on the google cloud console, but the terminal shows
0 of 49 workers ready...
.ray status
also shows waiting for workers (provisioning logs).Might be related to too short a timeout here? https://github.com/skypilot-org/skypilot/blob/2d4fee3a30ad0a7c8a5a7a7e0f5ff62f332a9708/sky/backends/cloud_vm_ray_backend.py#L63-L65