skypilot-org / skypilot

SkyPilot: Run LLMs, AI, and Batch jobs on any cloud. Get maximum savings, highest GPU availability, and managed execution—all with a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.27k stars 431 forks source link

sky.exceptions.FetchClusterInfoError in sky serve #3630

Open KarthikeyanVijayanEasyLLM opened 1 month ago

KarthikeyanVijayanEasyLLM commented 1 month ago

I have got sky.exceptions.FetchClusterInfoError when launching "sky serve up". I am using Azure cloud.

After facing the above error, I retried the "sky serve up", I faced a new error.

azure.core.exceptions.ResourceExistsError: (DeploymentActive) Unable to edit or replace deployment 'ray-config': previous deployment from '6/4/2024 6:48:57 AM' is still active (expiration time is '6/11/2024 6:46:55 AM'). Please see https://aka.ms/arm-deploy-resources for usage details. Code: DeploymentActive Message: Unable to edit or replace deployment 'ray-config': previous deployment from '6/4/2024 6:48:57 AM' is still active (expiration time is '6/11/2024 6:46:55 AM'). Please see https://aka.ms/arm-deploy-resources for usage details.

RuntimeError: Errors occurred during provision; check logs above.

Version & Commit info:

KarthikeyanVijayanEasyLLM commented 1 month ago

Traceback for sky.exceptions.FetchClusterInfoError:

Traceback (most recent call last): File "/home/skypilot/miniconda3/envs/sky6/lib/python3.11/site-packages/sky/backends/backend_utils.py", line 1355, in _query_head_ip_with_retries out = subprocess_utils.run( ^^^^^^^^^^^^^^^^^^^^^ File "/home/skypilot/miniconda3/envs/sky6/lib/python3.11/site-packages/sky/utils/common_utils.py", line 388, in _record return f(*args, **kwargs) ^^^^^^^^^^^^^^^^^^ File "/home/skypilot/miniconda3/envs/sky6/lib/python3.11/site-packages/sky/utils/subprocess_utils.py", line 31, in run return subprocess.run(cmd, ^^^^^^^^^^^^^^^^^^^ File "/home/skypilot/miniconda3/envs/sky6/lib/python3.11/subprocess.py", line 571, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command 'ray get-head-ip '/home/skypilot/.sky/generated/sky-serve-controller-3d76a700.yml'' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/skypilot/miniconda3/envs/sky6/bin/sky", line 8, in sys.exit(cli()) ^^^^^ File "/home/skypilot/miniconda3/envs/sky6/lib/python3.11/site-packages/click/core.py", line 1157, in call return self.main(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/skypilot/miniconda3/envs/sky6/lib/python3.11/site-packages/click/core.py", line 1078, in main rv = self.invoke(ctx) ^^^^^^^^^^^^^^^^ File "/home/skypilot/miniconda3/envs/sky6/lib/python3.11/site-packages/sky/utils/common_utils.py", line 367, in _record return f(*args, *kwargs) ^^^^^^^^^^^^^^^^^^ File "/home/skypilot/miniconda3/envs/sky6/lib/python3.11/site-packages/sky/cli.py", line 805, in invoke return super().invoke(ctx) ^^^^^^^^^^^^^^^^^^^ File "/home/skypilot/miniconda3/envs/sky6/lib/python3.11/site-packages/click/core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/skypilot/miniconda3/envs/sky6/lib/python3.11/site-packages/sky/utils/common_utils.py", line 367, in _record return f(args, kwargs) ^^^^^^^^^^^^^^^^^^ File "/home/skypilot/miniconda3/envs/sky6/lib/python3.11/site-packages/sky/cli.py", line 805, in invoke return super().invoke(ctx) ^^^^^^^^^^^^^^^^^^^ File "/home/skypilot/miniconda3/envs/sky6/lib/python3.11/site-packages/click/core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/skypilot/miniconda3/envs/sky6/lib/python3.11/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, ctx.params) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/skypilot/miniconda3/envs/sky6/lib/python3.11/site-packages/click/core.py", line 783, in invoke return __callback(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/skypilot/miniconda3/envs/sky6/lib/python3.11/site-packages/sky/utils/common_utils.py", line 388, in _record return f(args, kwargs) ^^^^^^^^^^^^^^^^^^ File "/home/skypilot/miniconda3/envs/sky6/lib/python3.11/site-packages/sky/utils/common_utils.py", line 388, in _record return f(*args, kwargs) ^^^^^^^^^^^^^^^^^^ File "/home/skypilot/miniconda3/envs/sky6/lib/python3.11/site-packages/sky/cli.py", line 4049, in serve_up serve_lib.up(task, service_name) File "/home/skypilot/miniconda3/envs/sky6/lib/python3.11/site-packages/sky/utils/common_utils.py", line 388, in _record return f(*args, *kwargs) ^^^^^^^^^^^^^^^^^^ File "/home/skypilot/miniconda3/envs/sky6/lib/python3.11/site-packages/sky/serve/core.py", line 195, in up controller_job_id, controller_handle = sky.launch( ^^^^^^^^^^^ File "/home/skypilot/miniconda3/envs/sky6/lib/python3.11/site-packages/sky/utils/common_utils.py", line 388, in _record return f(args, kwargs) ^^^^^^^^^^^^^^^^^^ File "/home/skypilot/miniconda3/envs/sky6/lib/python3.11/site-packages/sky/utils/common_utils.py", line 388, in _record return f(*args, kwargs) ^^^^^^^^^^^^^^^^^^ File "/home/skypilot/miniconda3/envs/sky6/lib/python3.11/site-packages/sky/execution.py", line 456, in launch return _execute( ^^^^^^^^^ File "/home/skypilot/miniconda3/envs/sky6/lib/python3.11/site-packages/sky/execution.py", line 271, in _execute handle = backend.provision(task, ^^^^^^^^^^^^^^^^^^^^^^^ File "/home/skypilot/miniconda3/envs/sky6/lib/python3.11/site-packages/sky/utils/common_utils.py", line 388, in _record return f(*args, *kwargs) ^^^^^^^^^^^^^^^^^^ File "/home/skypilot/miniconda3/envs/sky6/lib/python3.11/site-packages/sky/utils/common_utils.py", line 367, in _record return f(args, kwargs) ^^^^^^^^^^^^^^^^^^ File "/home/skypilot/miniconda3/envs/sky6/lib/python3.11/site-packages/sky/backends/backend.py", line 57, in provision return self._provision(task, to_provision, dryrun, stream_logs, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/skypilot/miniconda3/envs/sky6/lib/python3.11/site-packages/sky/backends/cloud_vm_ray_backend.py", line 2814, in _provision config_dict = retry_provisioner.provision_with_retries( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/skypilot/miniconda3/envs/sky6/lib/python3.11/site-packages/sky/utils/common_utils.py", line 388, in _record return f(*args, *kwargs) ^^^^^^^^^^^^^^^^^^ File "/home/skypilot/miniconda3/envs/sky6/lib/python3.11/site-packages/sky/backends/cloud_vm_ray_backend.py", line 2027, in provision_with_retries config_dict = self._retry_zones( ^^^^^^^^^^^^^^^^^^ File "/home/skypilot/miniconda3/envs/sky6/lib/python3.11/site-packages/sky/backends/cloud_vm_ray_backend.py", line 1603, in _retry_zones handle.update_cluster_ips( File "/home/skypilot/miniconda3/envs/sky6/lib/python3.11/site-packages/sky/backends/cloud_vm_ray_backend.py", line 2350, in update_cluster_ips cluster_feasible_ips = backend_utils.get_node_ips( ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/skypilot/miniconda3/envs/sky6/lib/python3.11/site-packages/sky/utils/common_utils.py", line 388, in _record return f(args, **kwargs) ^^^^^^^^^^^^^^^^^^ File "/home/skypilot/miniconda3/envs/sky6/lib/python3.11/site-packages/sky/backends/backend_utils.py", line 1443, in get_node_ips head_ip = _query_head_ip_with_retries(cluster_yaml, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/skypilot/miniconda3/envs/sky6/lib/python3.11/site-packages/sky/backends/backend_utils.py", line 1378, in _query_head_ip_with_retries raise exceptions.FetchClusterInfoError(