skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.81k stars 513 forks source link

[OCI] Enable SkyServe for OCI #4338

Closed HysunHe closed 1 week ago

HysunHe commented 1 week ago

This PR is to enable the SkyServe for OCI.

Test1: Serve QWen-7B on 2 A10 instances (serve-qwen-7b.yaml file is under the examples/oci folder):

sky serve up serve-qwen-7b.yaml

Test Result: ...... I 11-12 22:59:05 cloud_vm_ray_backend.py:3252] ✓ Setup completed. View logs at: ~/sky_logs/sky-2024-11-12-22-56-39-066981/setup-*.log D 11-12 22:59:06 cloud_vm_ray_backend.py:596] Added Task with options: num_cpus=0.25, scheduling_strategy=ray.util.scheduling_strategies.PlacementGroupSchedulingStrategy(placement_group=pg, placement_group_bundle_index=0) I 11-12 22:59:08 cloud_vm_ray_backend.py:3358] ⚙︎ Service registered.

Service name: sky-service-4e05 Endpoint URL: 146.235.200.24:30001 📋 Useful Commands ├── To check service status: sky serve status sky-service-4e05 [--endpoint] ├── To teardown the service: sky serve down sky-service-4e05 ├── To see replica logs: sky serve logs sky-service-4e05 [REPLICA_ID] ├── To see load balancer logs: sky serve logs --load-balancer sky-service-4e05 ├── To see controller logs: sky serve logs --controller sky-service-4e05 ├── To monitor the status: watch -n10 sky serve status sky-service-4e05 └── To send a test request: curl 146.235.200.24:30001

✓ Service is spinning up and replicas will be ready shortly. (sky) ubuntu@prd-20231114:~/Hysun/skypilot_oci/examples/oci$ sky serve status INFO:oci.circuit_breaker:Default Auth client Circuit breaker strategy enabled Services NAME VERSION UPTIME STATUS REPLICAS ENDPOINT sky-service-4e05 - - NO_REPLICA 0/2 146.235.200.24:30001

Service Replicas SERVICE_NAME ID VERSION ENDPOINT LAUNCHED RESOURCES STATUS REGION sky-service-4e05 1 1 http://192.18.130.152:8080 2 mins ago 1x OCI({'A10': 1}) STARTING us-sanjose-1 sky-service-4e05 2 1 http://146.235.204.55:8080 2 mins ago 1x OCI({'A10': 1}) STARTING us-sanjose-1 (sky) ubuntu@prd-20231114:~/Hysun/skypilot_oci/examples/oci$ sky serve status INFO:oci.circuit_breaker:Default Auth client Circuit breaker strategy enabled Services NAME VERSION UPTIME STATUS REPLICAS ENDPOINT sky-service-4e05 1 58s READY 2/2 146.235.200.24:30001

Service Replicas SERVICE_NAME ID VERSION ENDPOINT LAUNCHED RESOURCES STATUS REGION sky-service-4e05 1 1 http://192.18.130.152:8080 5 mins ago 1x OCI({'A10': 1}) READY us-sanjose-1 sky-service-4e05 2 1 http://146.235.204.55:8080 5 mins ago 1x OCI({'A10': 1}) READY us-sanjose-1 (sky) ubuntu@prd-20231114:~/Hysun/skypilot_oci/examples/oci$ (sky) ubuntu@prd-20231114:~/Hysun/skypilot_oci/examples/oci$ curl http://146.235.200.24:30001/v1/chat/completions -X POST -d '{"model": "Qwen2-7B-Instruct", "mess ages": [{"role": "user", "content": "Who are you?"}]}' -H 'Content-Type: application/json' {"id":"chat-53558189196e42108f864160f92c24f8","object":"chat.completion","created":1731452944,"model":"Qwen2-7B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"I am Qwen, a large language model created by Alibaba Cloud. I am designed to assist with generating human-like text, answering a wide range of questions, and providing information. Feel free to ask me anything, and I'll do my best to help!","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":23,"total_tokens":76,"completion_tokens":53},"prompt_logprobs":null} (sky) ubuntu@prd-20231114:~/Hysun/skypilot_oci/examples/oci$

Test2: Network Security Group Per Cluster for open_ports/cleanup_ports

I 11-13 20:04:11 cloud_vm_ray_backend.py:3252] ✓ Setup completed. View logs at: ~/sky_logs/sky-2024-11-13-20-01-38-564927/setup-*.log D 11-13 20:04:13 cloud_vm_ray_backend.py:596] Added Task with options: num_cpus=0.25, scheduling_strategy=ray.util.scheduling_strategies.PlacementGroupSchedulingStrategy(placement_group=pg, placement_group_bundle_index=0) I 11-13 20:04:16 cloud_vm_ray_backend.py:3358] ⚙︎ Service registered.

Service name: sky-service-ee3b Endpoint URL: 167.234.215.42:30001 📋 Useful Commands ├── To check service status: sky serve status sky-service-ee3b [--endpoint] ├── To teardown the service: sky serve down sky-service-ee3b ├── To see replica logs: sky serve logs sky-service-ee3b [REPLICA_ID] ├── To see load balancer logs: sky serve logs --load-balancer sky-service-ee3b ├── To see controller logs: sky serve logs --controller sky-service-ee3b ├── To monitor the status: watch -n10 sky serve status sky-service-ee3b └── To send a test request: curl 167.234.215.42:30001

✓ Service is spinning up and replicas will be ready shortly. (sky) hysunhe@HYHE-PF1ZGYCQ:~/prjdev/skypilot/skypilot_oci/examples/oci$ sky serve status Services NAME VERSION UPTIME STATUS REPLICAS ENDPOINT sky-service-ee3b - - NO_REPLICA 0/1 167.234.215.42:30001

Service Replicas SERVICE_NAME ID VERSION ENDPOINT LAUNCHED RESOURCES STATUS REGION sky-service-ee3b 1 1 http://138.2.230.128:8080 2 mins ago 1x OCI({'A10': 1}) STARTING us-sanjose-1 (sky) hysunhe@HYHE-PF1ZGYCQ:~/prjdev/skypilot/skypilot_oci/examples/oci$ (sky) hysunhe@HYHE-PF1ZGYCQ:~/prjdev/skypilot/skypilot_oci/examples/oci$ sky serve status Services NAME VERSION UPTIME STATUS REPLICAS ENDPOINT sky-service-ee3b 1 2m 3s READY 1/1 167.234.215.42:30001

Service Replicas SERVICE_NAME ID VERSION ENDPOINT LAUNCHED RESOURCES STATUS REGION sky-service-ee3b 1 1 http://138.2.230.128:8080 6 mins ago 1x OCI({'A10': 1}) READY us-sanjose-1 (sky) hysunhe@HYHE-PF1ZGYCQ:~/prjdev/skypilot/skypilot_oci/examples/oci$ (sky) hysunhe@HYHE-PF1ZGYCQ:~/prjdev/skypilot/skypilot_oci/examples/oci$ curl http://167.234.215.42:30001/v1/chat/completions -X POST -d '{"model": "Qwen2-7B-Instruct", "messages": [{"role": "user", "content": "Who are you?"}]}' -H 'Content-Type: application/json' {"id":"chat-aae51c7b196546f09cfbb632bdcf87bc","object":"chat.completion","created":1731500064,"model":"Qwen2-7B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"I am Qwen, a large language model created by Alibaba Cloud. I am designed to assist with generating human-like text, answering a wide range of questions, and providing information. Feel free to ask me anything, and I'll do my best to help!","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":23,"total_tokens":76,"completion_tokens":53},"prompt_logprobs":null}(sky) hysunhe@HYHE-PF1ZGYCQ:~/prjdev/skypilot/skypilot_oci/examples/oci$ (sky) hysunhe@HYHE-PF1ZGYCQ:~/prjdev/skypilot/skypilot_oci/examples/oci$ (sky) hysunhe@HYHE-PF1ZGYCQ:~/prjdev/skypilot/skypilot_oci/examples/oci$ sky serve down sky-service-ee3b Terminating service(s) 'sky-service-ee3b'. Proceed? [Y/n]: Service 'sky-service-ee3b' is scheduled to be terminated. (sky) hysunhe@HYHE-PF1ZGYCQ:~/prjdev/skypilot/skypilot_oci/examples/oci$ sky serve status Services No existing services. (sky) hysunhe@HYHE-PF1ZGYCQ:~/prjdev/skypilot/skypilot_oci/examples/oci$ sky status Clusters NAME LAUNCHED RESOURCES STATUS AUTOSTOP COMMAND sky-serve-controller-3973382d 13 mins ago 1x OCI(VM.Standard.E4.Flex$_4_16, disk_size=200, ports=['30001-30020']... UP 10m sky serve up serve-qwen-7...

Managed jobs No in-progress managed jobs. (See: sky jobs -h)

Services No existing services.

Tested (run the relevant ones):

HysunHe commented 1 week ago

Hi Zhanghao @Michaelvll and Tianxia @cblmemo , I just submitted this PR to enable skyserve for OCI. Could you please help review or assign someone to help on this?

HysunHe commented 1 week ago

Thanks @cblmemo & @Michaelvll on this PR. I just updated the PR comment to include the detailed test output for your reference. If you need to have a quick & temporary try with OCI account, please contact me via slack DM.

cblmemo commented 1 week ago

Another nit: from the cli output seems like this (INFO:oci.circuit_breaker:Default Auth client Circuit breaker strategy enabled) m essage is displayed whenever any sky serve command is called. Is it possible to suppress it?

HysunHe commented 1 week ago

Another nit: from the cli output seems like this (INFO:oci.circuit_breaker:Default Auth client Circuit breaker strategy enabled) m essage is displayed whenever any sky serve command is called. Is it possible to suppress it?

Emm. This would due to the env settings. Please see the "Test2" which uses my PC.

cblmemo commented 1 week ago

Another nit: from the cli output seems like this (INFO:oci.circuit_breaker:Default Auth client Circuit breaker strategy enabled) m essage is displayed whenever any sky serve command is called. Is it possible to suppress it?

Emm. This would due to the env settings. Please see the "Test2" which uses my PC.

Got it! LGTM.

HysunHe commented 1 week ago

One confusion I had is how OCI manages security rules with overlapping ports. Azure requires it must have different priority level. Does such things exist in OCI? Or we could only open new ports (using set difference/subtraction), so we dont need to worry about that.

We only create rules for new ports, please see the code:

        new_ports = resources_utils.port_ranges_to_set(ports)
        existing_ports = resources_utils.port_ranges_to_set(
            existing_port_ranges)
        if new_ports.issubset(existing_ports):
            # ports already contains in the existing rules, nothing to add.
            return

        union_ports = new_ports.union(existing_ports)
        union_port_ranges = resources_utils.port_set_to_ranges(union_ports)
cblmemo commented 1 week ago

Hi @HysunHe , I tried this PR and got the following error:

(sky) ➜  skypilot git:(HysunHe/master) ✗ sky serve status http-old
Services
NAME      VERSION  UPTIME  STATUS      REPLICAS  ENDPOINT             
http-old  -        -       NO_REPLICA  0/2       34.69.223.187:30001  

Service Replicas
SERVICE_NAME  ID  VERSION  ENDPOINT  LAUNCHED  RESOURCES  STATUS            REGION  
http-old      1   1        -         -         -          FAILED_PROVISION  -       
http-old      2   1        -         -         -          FAILED_PROVISION  -       
http-old      3   1        -         -         -          FAILED_PROVISION  -       
http-old      4   1        -         -         -          FAILED_PROVISION  -       
http-old      5   1        -         -         -          FAILED_PROVISION  -       
http-old      6   1        -         -         -          FAILED_PROVISION  -       
http-old      7   1        -         -         -          FAILED_PROVISION  -       
http-old      8   1        -         -         -          FAILED_PROVISION  -       
http-old      9   1        -         -         -          FAILED_PROVISION  -       
http-old      10  1        -         -         -          FAILED_PROVISION  -       
... (use --all to show all replicas)
(sky) ➜  skypilot git:(HysunHe/master) ✗ ssl http-old 1        
Start streaming logs for launching process of replica 1.
I 11-15 03:37:33 storage.py:870] Storage type StoreType.AZURE already exists under storage account 'sky6356402b1bba56ce'.
I 11-15 03:37:33 replica_managers.py:84] Launching replica (id: 1) cluster http-old-1 with resources: {OCI(cpus=2+, ports=['8080'])}
E 11-15 03:37:33 ux_utils.py:117] Failed to run launch_cluster. Details: RuntimeError: Failed to launch the sky serve replica cluster http-old-1.
E 11-15 03:37:33 ux_utils.py:120]   Traceback:
E 11-15 03:37:33 ux_utils.py:120] Traceback (most recent call last):
E 11-15 03:37:33 ux_utils.py:120]   File "/home/gcpuser/skypilot-runtime/lib/python3.10/site-packages/sky/serve/replica_managers.py", line 98, in launch_cluster
E 11-15 03:37:33 ux_utils.py:120]     sky.launch(task,
E 11-15 03:37:33 ux_utils.py:120]   File "/home/gcpuser/skypilot-runtime/lib/python3.10/site-packages/sky/utils/common_utils.py", line 386, in _record
E 11-15 03:37:33 ux_utils.py:120]     return f(*args, **kwargs)
E 11-15 03:37:33 ux_utils.py:120]   File "/home/gcpuser/skypilot-runtime/lib/python3.10/site-packages/sky/utils/common_utils.py", line 386, in _record
E 11-15 03:37:33 ux_utils.py:120]     return f(*args, **kwargs)
E 11-15 03:37:33 ux_utils.py:120]   File "/home/gcpuser/skypilot-runtime/lib/python3.10/site-packages/sky/execution.py", line 489, in launch
E 11-15 03:37:33 ux_utils.py:120]     return _execute(
E 11-15 03:37:33 ux_utils.py:120]   File "/home/gcpuser/skypilot-runtime/lib/python3.10/site-packages/sky/execution.py", line 270, in _execute
E 11-15 03:37:33 ux_utils.py:120]     dag = sky.optimize(dag, minimize=optimize_target)
E 11-15 03:37:33 ux_utils.py:120]   File "/home/gcpuser/skypilot-runtime/lib/python3.10/site-packages/sky/optimizer.py", line 127, in optimize
E 11-15 03:37:33 ux_utils.py:120]     _check_specified_clouds(dag)
E 11-15 03:37:33 ux_utils.py:120]   File "/home/gcpuser/skypilot-runtime/lib/python3.10/site-packages/sky/optimizer.py", line 1250, in _check_specified_clouds
E 11-15 03:37:33 ux_utils.py:120]     raise exceptions.ResourcesUnavailableError(msg)
E 11-15 03:37:33 ux_utils.py:120] sky.exceptions.ResourcesUnavailableError: Task 'http-old' requires OCI which is not enabled. To enable access, change the task cloud requirement or run: sky check oci
E 11-15 03:37:33 ux_utils.py:120] 
E 11-15 03:37:33 ux_utils.py:120] The above exception was the direct cause of the following exception:
E 11-15 03:37:33 ux_utils.py:120] 
E 11-15 03:37:33 ux_utils.py:120] Traceback (most recent call last):
E 11-15 03:37:33 ux_utils.py:120]   File "/home/gcpuser/skypilot-runtime/lib/python3.10/site-packages/sky/utils/ux_utils.py", line 115, in run
E 11-15 03:37:33 ux_utils.py:120]     self.func(*args, **kwargs)
E 11-15 03:37:33 ux_utils.py:120]   File "/home/gcpuser/skypilot-runtime/lib/python3.10/site-packages/sky/serve/replica_managers.py", line 116, in launch_cluster
E 11-15 03:37:33 ux_utils.py:120]     raise RuntimeError('Failed to launch the sky serve replica '
E 11-15 03:37:33 ux_utils.py:120] RuntimeError: Failed to launch the sky serve replica cluster http-old-1.
E 11-15 03:37:33 ux_utils.py:120] 
I 11-15 03:37:49 replica_managers.py:155] Replica cluster http-old-1 is already terminated.

Shared connection to 34.69.223.187 closed.

IIUC the oci is not enabled on the controller. My controller is at gcp. should we add oci dependencies to here?

https://github.com/skypilot-org/skypilot/blob/fa798d7c095dbc6f2c3adaf55717db31d2ceb7c4/sky/utils/controller_utils.py#L191-L192

HysunHe commented 1 week ago

Hi @HysunHe , I tried this PR and got the following error:

(sky) ➜  skypilot git:(HysunHe/master) ✗ sky serve status http-old
Services
NAME      VERSION  UPTIME  STATUS      REPLICAS  ENDPOINT             
http-old  -        -       NO_REPLICA  0/2       34.69.223.187:30001  

Service Replicas
SERVICE_NAME  ID  VERSION  ENDPOINT  LAUNCHED  RESOURCES  STATUS            REGION  
http-old      1   1        -         -         -          FAILED_PROVISION  -       
http-old      2   1        -         -         -          FAILED_PROVISION  -       
http-old      3   1        -         -         -          FAILED_PROVISION  -       
http-old      4   1        -         -         -          FAILED_PROVISION  -       
http-old      5   1        -         -         -          FAILED_PROVISION  -       
http-old      6   1        -         -         -          FAILED_PROVISION  -       
http-old      7   1        -         -         -          FAILED_PROVISION  -       
http-old      8   1        -         -         -          FAILED_PROVISION  -       
http-old      9   1        -         -         -          FAILED_PROVISION  -       
http-old      10  1        -         -         -          FAILED_PROVISION  -       
... (use --all to show all replicas)
(sky) ➜  skypilot git:(HysunHe/master) ✗ ssl http-old 1        
Start streaming logs for launching process of replica 1.
I 11-15 03:37:33 storage.py:870] Storage type StoreType.AZURE already exists under storage account 'sky6356402b1bba56ce'.
I 11-15 03:37:33 replica_managers.py:84] Launching replica (id: 1) cluster http-old-1 with resources: {OCI(cpus=2+, ports=['8080'])}
E 11-15 03:37:33 ux_utils.py:117] Failed to run launch_cluster. Details: RuntimeError: Failed to launch the sky serve replica cluster http-old-1.
E 11-15 03:37:33 ux_utils.py:120]   Traceback:
E 11-15 03:37:33 ux_utils.py:120] Traceback (most recent call last):
E 11-15 03:37:33 ux_utils.py:120]   File "/home/gcpuser/skypilot-runtime/lib/python3.10/site-packages/sky/serve/replica_managers.py", line 98, in launch_cluster
E 11-15 03:37:33 ux_utils.py:120]     sky.launch(task,
E 11-15 03:37:33 ux_utils.py:120]   File "/home/gcpuser/skypilot-runtime/lib/python3.10/site-packages/sky/utils/common_utils.py", line 386, in _record
E 11-15 03:37:33 ux_utils.py:120]     return f(*args, **kwargs)
E 11-15 03:37:33 ux_utils.py:120]   File "/home/gcpuser/skypilot-runtime/lib/python3.10/site-packages/sky/utils/common_utils.py", line 386, in _record
E 11-15 03:37:33 ux_utils.py:120]     return f(*args, **kwargs)
E 11-15 03:37:33 ux_utils.py:120]   File "/home/gcpuser/skypilot-runtime/lib/python3.10/site-packages/sky/execution.py", line 489, in launch
E 11-15 03:37:33 ux_utils.py:120]     return _execute(
E 11-15 03:37:33 ux_utils.py:120]   File "/home/gcpuser/skypilot-runtime/lib/python3.10/site-packages/sky/execution.py", line 270, in _execute
E 11-15 03:37:33 ux_utils.py:120]     dag = sky.optimize(dag, minimize=optimize_target)
E 11-15 03:37:33 ux_utils.py:120]   File "/home/gcpuser/skypilot-runtime/lib/python3.10/site-packages/sky/optimizer.py", line 127, in optimize
E 11-15 03:37:33 ux_utils.py:120]     _check_specified_clouds(dag)
E 11-15 03:37:33 ux_utils.py:120]   File "/home/gcpuser/skypilot-runtime/lib/python3.10/site-packages/sky/optimizer.py", line 1250, in _check_specified_clouds
E 11-15 03:37:33 ux_utils.py:120]     raise exceptions.ResourcesUnavailableError(msg)
E 11-15 03:37:33 ux_utils.py:120] sky.exceptions.ResourcesUnavailableError: Task 'http-old' requires OCI which is not enabled. To enable access, change the task cloud requirement or run: sky check oci
E 11-15 03:37:33 ux_utils.py:120] 
E 11-15 03:37:33 ux_utils.py:120] The above exception was the direct cause of the following exception:
E 11-15 03:37:33 ux_utils.py:120] 
E 11-15 03:37:33 ux_utils.py:120] Traceback (most recent call last):
E 11-15 03:37:33 ux_utils.py:120]   File "/home/gcpuser/skypilot-runtime/lib/python3.10/site-packages/sky/utils/ux_utils.py", line 115, in run
E 11-15 03:37:33 ux_utils.py:120]     self.func(*args, **kwargs)
E 11-15 03:37:33 ux_utils.py:120]   File "/home/gcpuser/skypilot-runtime/lib/python3.10/site-packages/sky/serve/replica_managers.py", line 116, in launch_cluster
E 11-15 03:37:33 ux_utils.py:120]     raise RuntimeError('Failed to launch the sky serve replica '
E 11-15 03:37:33 ux_utils.py:120] RuntimeError: Failed to launch the sky serve replica cluster http-old-1.
E 11-15 03:37:33 ux_utils.py:120] 
I 11-15 03:37:49 replica_managers.py:155] Replica cluster http-old-1 is already terminated.

Shared connection to 34.69.223.187 closed.

IIUC the oci is not enabled on the controller. My controller is at gcp. should we add oci dependencies to here?

https://github.com/skypilot-org/skypilot/blob/fa798d7c095dbc6f2c3adaf55717db31d2ceb7c4/sky/utils/controller_utils.py#L191-L192

OCI now supports open_ports. Move the OCI dependancies install from "if controller == Controllers.JOBS_CONTROLLER:" to "elif isinstance(cloud, clouds.OCI):"