Closed HysunHe closed 1 week ago
Hi Zhanghao @Michaelvll and Tianxia @cblmemo , I just submitted this PR to enable skyserve for OCI. Could you please help review or assign someone to help on this?
Thanks @cblmemo & @Michaelvll on this PR. I just updated the PR comment to include the detailed test output for your reference. If you need to have a quick & temporary try with OCI account, please contact me via slack DM.
Another nit: from the cli output seems like this (INFO:oci.circuit_breaker:Default Auth client Circuit breaker strategy enabled
) m essage is displayed whenever any sky serve command is called. Is it possible to suppress it?
Another nit: from the cli output seems like this (
INFO:oci.circuit_breaker:Default Auth client Circuit breaker strategy enabled
) m essage is displayed whenever any sky serve command is called. Is it possible to suppress it?
Emm. This would due to the env settings. Please see the "Test2" which uses my PC.
Another nit: from the cli output seems like this (
INFO:oci.circuit_breaker:Default Auth client Circuit breaker strategy enabled
) m essage is displayed whenever any sky serve command is called. Is it possible to suppress it?Emm. This would due to the env settings. Please see the "Test2" which uses my PC.
Got it! LGTM.
One confusion I had is how OCI manages security rules with overlapping ports. Azure requires it must have different priority level. Does such things exist in OCI? Or we could only open new ports (using set difference/subtraction), so we dont need to worry about that.
We only create rules for new ports, please see the code:
new_ports = resources_utils.port_ranges_to_set(ports)
existing_ports = resources_utils.port_ranges_to_set(
existing_port_ranges)
if new_ports.issubset(existing_ports):
# ports already contains in the existing rules, nothing to add.
return
union_ports = new_ports.union(existing_ports)
union_port_ranges = resources_utils.port_set_to_ranges(union_ports)
Hi @HysunHe , I tried this PR and got the following error:
(sky) ➜ skypilot git:(HysunHe/master) ✗ sky serve status http-old
Services
NAME VERSION UPTIME STATUS REPLICAS ENDPOINT
http-old - - NO_REPLICA 0/2 34.69.223.187:30001
Service Replicas
SERVICE_NAME ID VERSION ENDPOINT LAUNCHED RESOURCES STATUS REGION
http-old 1 1 - - - FAILED_PROVISION -
http-old 2 1 - - - FAILED_PROVISION -
http-old 3 1 - - - FAILED_PROVISION -
http-old 4 1 - - - FAILED_PROVISION -
http-old 5 1 - - - FAILED_PROVISION -
http-old 6 1 - - - FAILED_PROVISION -
http-old 7 1 - - - FAILED_PROVISION -
http-old 8 1 - - - FAILED_PROVISION -
http-old 9 1 - - - FAILED_PROVISION -
http-old 10 1 - - - FAILED_PROVISION -
... (use --all to show all replicas)
(sky) ➜ skypilot git:(HysunHe/master) ✗ ssl http-old 1
Start streaming logs for launching process of replica 1.
I 11-15 03:37:33 storage.py:870] Storage type StoreType.AZURE already exists under storage account 'sky6356402b1bba56ce'.
I 11-15 03:37:33 replica_managers.py:84] Launching replica (id: 1) cluster http-old-1 with resources: {OCI(cpus=2+, ports=['8080'])}
E 11-15 03:37:33 ux_utils.py:117] Failed to run launch_cluster. Details: RuntimeError: Failed to launch the sky serve replica cluster http-old-1.
E 11-15 03:37:33 ux_utils.py:120] Traceback:
E 11-15 03:37:33 ux_utils.py:120] Traceback (most recent call last):
E 11-15 03:37:33 ux_utils.py:120] File "/home/gcpuser/skypilot-runtime/lib/python3.10/site-packages/sky/serve/replica_managers.py", line 98, in launch_cluster
E 11-15 03:37:33 ux_utils.py:120] sky.launch(task,
E 11-15 03:37:33 ux_utils.py:120] File "/home/gcpuser/skypilot-runtime/lib/python3.10/site-packages/sky/utils/common_utils.py", line 386, in _record
E 11-15 03:37:33 ux_utils.py:120] return f(*args, **kwargs)
E 11-15 03:37:33 ux_utils.py:120] File "/home/gcpuser/skypilot-runtime/lib/python3.10/site-packages/sky/utils/common_utils.py", line 386, in _record
E 11-15 03:37:33 ux_utils.py:120] return f(*args, **kwargs)
E 11-15 03:37:33 ux_utils.py:120] File "/home/gcpuser/skypilot-runtime/lib/python3.10/site-packages/sky/execution.py", line 489, in launch
E 11-15 03:37:33 ux_utils.py:120] return _execute(
E 11-15 03:37:33 ux_utils.py:120] File "/home/gcpuser/skypilot-runtime/lib/python3.10/site-packages/sky/execution.py", line 270, in _execute
E 11-15 03:37:33 ux_utils.py:120] dag = sky.optimize(dag, minimize=optimize_target)
E 11-15 03:37:33 ux_utils.py:120] File "/home/gcpuser/skypilot-runtime/lib/python3.10/site-packages/sky/optimizer.py", line 127, in optimize
E 11-15 03:37:33 ux_utils.py:120] _check_specified_clouds(dag)
E 11-15 03:37:33 ux_utils.py:120] File "/home/gcpuser/skypilot-runtime/lib/python3.10/site-packages/sky/optimizer.py", line 1250, in _check_specified_clouds
E 11-15 03:37:33 ux_utils.py:120] raise exceptions.ResourcesUnavailableError(msg)
E 11-15 03:37:33 ux_utils.py:120] sky.exceptions.ResourcesUnavailableError: Task 'http-old' requires OCI which is not enabled. To enable access, change the task cloud requirement or run: sky check oci
E 11-15 03:37:33 ux_utils.py:120]
E 11-15 03:37:33 ux_utils.py:120] The above exception was the direct cause of the following exception:
E 11-15 03:37:33 ux_utils.py:120]
E 11-15 03:37:33 ux_utils.py:120] Traceback (most recent call last):
E 11-15 03:37:33 ux_utils.py:120] File "/home/gcpuser/skypilot-runtime/lib/python3.10/site-packages/sky/utils/ux_utils.py", line 115, in run
E 11-15 03:37:33 ux_utils.py:120] self.func(*args, **kwargs)
E 11-15 03:37:33 ux_utils.py:120] File "/home/gcpuser/skypilot-runtime/lib/python3.10/site-packages/sky/serve/replica_managers.py", line 116, in launch_cluster
E 11-15 03:37:33 ux_utils.py:120] raise RuntimeError('Failed to launch the sky serve replica '
E 11-15 03:37:33 ux_utils.py:120] RuntimeError: Failed to launch the sky serve replica cluster http-old-1.
E 11-15 03:37:33 ux_utils.py:120]
I 11-15 03:37:49 replica_managers.py:155] Replica cluster http-old-1 is already terminated.
Shared connection to 34.69.223.187 closed.
IIUC the oci is not enabled on the controller. My controller is at gcp. should we add oci dependencies to here?
Hi @HysunHe , I tried this PR and got the following error:
(sky) ➜ skypilot git:(HysunHe/master) ✗ sky serve status http-old Services NAME VERSION UPTIME STATUS REPLICAS ENDPOINT http-old - - NO_REPLICA 0/2 34.69.223.187:30001 Service Replicas SERVICE_NAME ID VERSION ENDPOINT LAUNCHED RESOURCES STATUS REGION http-old 1 1 - - - FAILED_PROVISION - http-old 2 1 - - - FAILED_PROVISION - http-old 3 1 - - - FAILED_PROVISION - http-old 4 1 - - - FAILED_PROVISION - http-old 5 1 - - - FAILED_PROVISION - http-old 6 1 - - - FAILED_PROVISION - http-old 7 1 - - - FAILED_PROVISION - http-old 8 1 - - - FAILED_PROVISION - http-old 9 1 - - - FAILED_PROVISION - http-old 10 1 - - - FAILED_PROVISION - ... (use --all to show all replicas) (sky) ➜ skypilot git:(HysunHe/master) ✗ ssl http-old 1 Start streaming logs for launching process of replica 1. I 11-15 03:37:33 storage.py:870] Storage type StoreType.AZURE already exists under storage account 'sky6356402b1bba56ce'. I 11-15 03:37:33 replica_managers.py:84] Launching replica (id: 1) cluster http-old-1 with resources: {OCI(cpus=2+, ports=['8080'])} E 11-15 03:37:33 ux_utils.py:117] Failed to run launch_cluster. Details: RuntimeError: Failed to launch the sky serve replica cluster http-old-1. E 11-15 03:37:33 ux_utils.py:120] Traceback: E 11-15 03:37:33 ux_utils.py:120] Traceback (most recent call last): E 11-15 03:37:33 ux_utils.py:120] File "/home/gcpuser/skypilot-runtime/lib/python3.10/site-packages/sky/serve/replica_managers.py", line 98, in launch_cluster E 11-15 03:37:33 ux_utils.py:120] sky.launch(task, E 11-15 03:37:33 ux_utils.py:120] File "/home/gcpuser/skypilot-runtime/lib/python3.10/site-packages/sky/utils/common_utils.py", line 386, in _record E 11-15 03:37:33 ux_utils.py:120] return f(*args, **kwargs) E 11-15 03:37:33 ux_utils.py:120] File "/home/gcpuser/skypilot-runtime/lib/python3.10/site-packages/sky/utils/common_utils.py", line 386, in _record E 11-15 03:37:33 ux_utils.py:120] return f(*args, **kwargs) E 11-15 03:37:33 ux_utils.py:120] File "/home/gcpuser/skypilot-runtime/lib/python3.10/site-packages/sky/execution.py", line 489, in launch E 11-15 03:37:33 ux_utils.py:120] return _execute( E 11-15 03:37:33 ux_utils.py:120] File "/home/gcpuser/skypilot-runtime/lib/python3.10/site-packages/sky/execution.py", line 270, in _execute E 11-15 03:37:33 ux_utils.py:120] dag = sky.optimize(dag, minimize=optimize_target) E 11-15 03:37:33 ux_utils.py:120] File "/home/gcpuser/skypilot-runtime/lib/python3.10/site-packages/sky/optimizer.py", line 127, in optimize E 11-15 03:37:33 ux_utils.py:120] _check_specified_clouds(dag) E 11-15 03:37:33 ux_utils.py:120] File "/home/gcpuser/skypilot-runtime/lib/python3.10/site-packages/sky/optimizer.py", line 1250, in _check_specified_clouds E 11-15 03:37:33 ux_utils.py:120] raise exceptions.ResourcesUnavailableError(msg) E 11-15 03:37:33 ux_utils.py:120] sky.exceptions.ResourcesUnavailableError: Task 'http-old' requires OCI which is not enabled. To enable access, change the task cloud requirement or run: sky check oci E 11-15 03:37:33 ux_utils.py:120] E 11-15 03:37:33 ux_utils.py:120] The above exception was the direct cause of the following exception: E 11-15 03:37:33 ux_utils.py:120] E 11-15 03:37:33 ux_utils.py:120] Traceback (most recent call last): E 11-15 03:37:33 ux_utils.py:120] File "/home/gcpuser/skypilot-runtime/lib/python3.10/site-packages/sky/utils/ux_utils.py", line 115, in run E 11-15 03:37:33 ux_utils.py:120] self.func(*args, **kwargs) E 11-15 03:37:33 ux_utils.py:120] File "/home/gcpuser/skypilot-runtime/lib/python3.10/site-packages/sky/serve/replica_managers.py", line 116, in launch_cluster E 11-15 03:37:33 ux_utils.py:120] raise RuntimeError('Failed to launch the sky serve replica ' E 11-15 03:37:33 ux_utils.py:120] RuntimeError: Failed to launch the sky serve replica cluster http-old-1. E 11-15 03:37:33 ux_utils.py:120] I 11-15 03:37:49 replica_managers.py:155] Replica cluster http-old-1 is already terminated. Shared connection to 34.69.223.187 closed.
IIUC the oci is not enabled on the controller. My controller is at gcp. should we add oci dependencies to here?
OCI now supports open_ports. Move the OCI dependancies install from "if controller == Controllers.JOBS_CONTROLLER:" to "elif isinstance(cloud, clouds.OCI):"
This PR is to enable the SkyServe for OCI.
Test1: Serve QWen-7B on 2 A10 instances (serve-qwen-7b.yaml file is under the examples/oci folder):
sky serve up serve-qwen-7b.yaml
Test Result: ...... I 11-12 22:59:05 cloud_vm_ray_backend.py:3252] ✓ Setup completed. View logs at: ~/sky_logs/sky-2024-11-12-22-56-39-066981/setup-*.log D 11-12 22:59:06 cloud_vm_ray_backend.py:596] Added Task with options: num_cpus=0.25, scheduling_strategy=ray.util.scheduling_strategies.PlacementGroupSchedulingStrategy(placement_group=pg, placement_group_bundle_index=0) I 11-12 22:59:08 cloud_vm_ray_backend.py:3358] ⚙︎ Service registered.
Service name: sky-service-4e05 Endpoint URL: 146.235.200.24:30001 📋 Useful Commands ├── To check service status: sky serve status sky-service-4e05 [--endpoint] ├── To teardown the service: sky serve down sky-service-4e05 ├── To see replica logs: sky serve logs sky-service-4e05 [REPLICA_ID] ├── To see load balancer logs: sky serve logs --load-balancer sky-service-4e05 ├── To see controller logs: sky serve logs --controller sky-service-4e05 ├── To monitor the status: watch -n10 sky serve status sky-service-4e05 └── To send a test request: curl 146.235.200.24:30001
✓ Service is spinning up and replicas will be ready shortly. (sky) ubuntu@prd-20231114:~/Hysun/skypilot_oci/examples/oci$ sky serve status INFO:oci.circuit_breaker:Default Auth client Circuit breaker strategy enabled Services NAME VERSION UPTIME STATUS REPLICAS ENDPOINT sky-service-4e05 - - NO_REPLICA 0/2 146.235.200.24:30001
Service Replicas SERVICE_NAME ID VERSION ENDPOINT LAUNCHED RESOURCES STATUS REGION sky-service-4e05 1 1 http://192.18.130.152:8080 2 mins ago 1x OCI({'A10': 1}) STARTING us-sanjose-1 sky-service-4e05 2 1 http://146.235.204.55:8080 2 mins ago 1x OCI({'A10': 1}) STARTING us-sanjose-1 (sky) ubuntu@prd-20231114:~/Hysun/skypilot_oci/examples/oci$ sky serve status INFO:oci.circuit_breaker:Default Auth client Circuit breaker strategy enabled Services NAME VERSION UPTIME STATUS REPLICAS ENDPOINT sky-service-4e05 1 58s READY 2/2 146.235.200.24:30001
Service Replicas SERVICE_NAME ID VERSION ENDPOINT LAUNCHED RESOURCES STATUS REGION sky-service-4e05 1 1 http://192.18.130.152:8080 5 mins ago 1x OCI({'A10': 1}) READY us-sanjose-1 sky-service-4e05 2 1 http://146.235.204.55:8080 5 mins ago 1x OCI({'A10': 1}) READY us-sanjose-1 (sky) ubuntu@prd-20231114:~/Hysun/skypilot_oci/examples/oci$ (sky) ubuntu@prd-20231114:~/Hysun/skypilot_oci/examples/oci$ curl http://146.235.200.24:30001/v1/chat/completions -X POST -d '{"model": "Qwen2-7B-Instruct", "mess ages": [{"role": "user", "content": "Who are you?"}]}' -H 'Content-Type: application/json' {"id":"chat-53558189196e42108f864160f92c24f8","object":"chat.completion","created":1731452944,"model":"Qwen2-7B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"I am Qwen, a large language model created by Alibaba Cloud. I am designed to assist with generating human-like text, answering a wide range of questions, and providing information. Feel free to ask me anything, and I'll do my best to help!","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":23,"total_tokens":76,"completion_tokens":53},"prompt_logprobs":null} (sky) ubuntu@prd-20231114:~/Hysun/skypilot_oci/examples/oci$
Test2: Network Security Group Per Cluster for open_ports/cleanup_ports
I 11-13 20:04:11 cloud_vm_ray_backend.py:3252] ✓ Setup completed. View logs at: ~/sky_logs/sky-2024-11-13-20-01-38-564927/setup-*.log D 11-13 20:04:13 cloud_vm_ray_backend.py:596] Added Task with options: num_cpus=0.25, scheduling_strategy=ray.util.scheduling_strategies.PlacementGroupSchedulingStrategy(placement_group=pg, placement_group_bundle_index=0) I 11-13 20:04:16 cloud_vm_ray_backend.py:3358] ⚙︎ Service registered.
Service name: sky-service-ee3b Endpoint URL: 167.234.215.42:30001 📋 Useful Commands ├── To check service status: sky serve status sky-service-ee3b [--endpoint] ├── To teardown the service: sky serve down sky-service-ee3b ├── To see replica logs: sky serve logs sky-service-ee3b [REPLICA_ID] ├── To see load balancer logs: sky serve logs --load-balancer sky-service-ee3b ├── To see controller logs: sky serve logs --controller sky-service-ee3b ├── To monitor the status: watch -n10 sky serve status sky-service-ee3b └── To send a test request: curl 167.234.215.42:30001
✓ Service is spinning up and replicas will be ready shortly. (sky) hysunhe@HYHE-PF1ZGYCQ:~/prjdev/skypilot/skypilot_oci/examples/oci$ sky serve status Services NAME VERSION UPTIME STATUS REPLICAS ENDPOINT sky-service-ee3b - - NO_REPLICA 0/1 167.234.215.42:30001
Service Replicas SERVICE_NAME ID VERSION ENDPOINT LAUNCHED RESOURCES STATUS REGION sky-service-ee3b 1 1 http://138.2.230.128:8080 2 mins ago 1x OCI({'A10': 1}) STARTING us-sanjose-1 (sky) hysunhe@HYHE-PF1ZGYCQ:~/prjdev/skypilot/skypilot_oci/examples/oci$ (sky) hysunhe@HYHE-PF1ZGYCQ:~/prjdev/skypilot/skypilot_oci/examples/oci$ sky serve status Services NAME VERSION UPTIME STATUS REPLICAS ENDPOINT sky-service-ee3b 1 2m 3s READY 1/1 167.234.215.42:30001
Service Replicas SERVICE_NAME ID VERSION ENDPOINT LAUNCHED RESOURCES STATUS REGION sky-service-ee3b 1 1 http://138.2.230.128:8080 6 mins ago 1x OCI({'A10': 1}) READY us-sanjose-1 (sky) hysunhe@HYHE-PF1ZGYCQ:~/prjdev/skypilot/skypilot_oci/examples/oci$ (sky) hysunhe@HYHE-PF1ZGYCQ:~/prjdev/skypilot/skypilot_oci/examples/oci$ curl http://167.234.215.42:30001/v1/chat/completions -X POST -d '{"model": "Qwen2-7B-Instruct", "messages": [{"role": "user", "content": "Who are you?"}]}' -H 'Content-Type: application/json' {"id":"chat-aae51c7b196546f09cfbb632bdcf87bc","object":"chat.completion","created":1731500064,"model":"Qwen2-7B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"I am Qwen, a large language model created by Alibaba Cloud. I am designed to assist with generating human-like text, answering a wide range of questions, and providing information. Feel free to ask me anything, and I'll do my best to help!","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":23,"total_tokens":76,"completion_tokens":53},"prompt_logprobs":null}(sky) hysunhe@HYHE-PF1ZGYCQ:~/prjdev/skypilot/skypilot_oci/examples/oci$ (sky) hysunhe@HYHE-PF1ZGYCQ:~/prjdev/skypilot/skypilot_oci/examples/oci$ (sky) hysunhe@HYHE-PF1ZGYCQ:~/prjdev/skypilot/skypilot_oci/examples/oci$ sky serve down sky-service-ee3b Terminating service(s) 'sky-service-ee3b'. Proceed? [Y/n]: Service 'sky-service-ee3b' is scheduled to be terminated. (sky) hysunhe@HYHE-PF1ZGYCQ:~/prjdev/skypilot/skypilot_oci/examples/oci$ sky serve status Services No existing services. (sky) hysunhe@HYHE-PF1ZGYCQ:~/prjdev/skypilot/skypilot_oci/examples/oci$ sky status Clusters NAME LAUNCHED RESOURCES STATUS AUTOSTOP COMMAND sky-serve-controller-3973382d 13 mins ago 1x OCI(VM.Standard.E4.Flex$_4_16, disk_size=200, ports=['30001-30020']... UP 10m sky serve up serve-qwen-7...
Managed jobs No in-progress managed jobs. (See: sky jobs -h)
Services No existing services.
Tested (run the relevant ones):
bash format.sh
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
conda deactivate; bash -i tests/backward_compatibility_tests.sh