skypilot-org / skypilot

SkyPilot: Run LLMs, AI, and Batch jobs on any cloud. Get maximum savings, highest GPU availability, and managed execution—all with a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.17k stars 425 forks source link

FOZEN SCREEN BEFORE PRICING ++ PROVISIONING Loop #3454

Open kyegomez opened 2 months ago

kyegomez commented 2 months ago

Sky pilot keeps freezing when I try to serve something and also, when it does work it says PROVISIONING forever and never works across multiple clouds. I need help asap

Service from YAML spec: sky_serve.yaml
Service Spec:
Readiness probe method:           GET /health
Readiness initial delay seconds:  1200
Replica autoscaling policy:       Autoscaling from 2 to 10 replicas        
Each replica will use the following resources (estimated):

SKY YAML

envs:
  MODEL_NAME: cogvlm-chat-17b
  HF_HUB_ENABLE_HF_TRANSFER: True

# service.yaml
service:
  # An actual request for readiness probe.
  readiness_probe:
    path: /v1/chat/completions
    post_data:
      model: $MODEL_NAME
      messages:
        - role: user
          content: Hello! What is your name?
      max_tokens: 1
  readiness_probe: /health
  replica_policy:
    min_replicas: 2
    max_replicas: 10
    target_qps_per_replica: 2.5
    upscale_delay_seconds: 300
    downscale_delay_seconds: 1200

# # # Advanced Kubernetes configurations (optional).
# kubernetes:
#   # The networking mode for accessing SSH jump pod (optional).
#   #
#   # This must be either: 'nodeport' or 'portforward'. If not specified,
#   # defaults to 'portforward'.

#   #
#   # nodeport: Exposes the jump pod SSH service on a static port number on each
#   # Node, allowing external access to using <NodeIP>:<NodePort>. Using this
#   # mode requires opening multiple ports on nodes in the Kubernetes cluster.
#   #
#   # portforward: Uses `kubectl port-forward` to create a tunnel and directly
#   # access the jump pod SSH service in the Kubernetes cluster. Does not
#   # require opening ports the cluster nodes and is more secure. 'portforward'
#   # is used as default if 'networking' is not specified.
#   networking: portforward

#   # The mode to use for opening ports on Kubernetes
#   #
#   # This must be either: 'ingress' or 'loadbalancer'. If not specified,
#   # defaults to 'loadbalancer'.
#   #
#   # loadbalancer: Creates services of type `LoadBalancer` to expose ports.
#   # See https://skypilot.readthedocs.io/en/latest/reference/kubernetes/kubernetes-setup.html#loadbalancer-service.
#   # This mode is supported out of the box on most cloud managed Kubernetes
#   # environments (e.g., GKE, EKS).
#   #
#   # ingress: Creates an ingress and a ClusterIP service for each port opened.
#   # Requires an Nginx ingress controller to be configured on the Kubernetes cluster.
#   # Refer to https://skypilot.readthedocs.io/en/latest/reference/kubernetes/kubernetes-setup.html#nginx-ingress
#   # for details on deploying the NGINX ingress controller.
#   ports: loadbalancer

#   # Attach custom metadata to Kubernetes objects created by SkyPilot
#   #
#   # Uses the same schema as Kubernetes metadata object: https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.26/#objectmeta-v1-meta
#   #
#   # Since metadata is applied to all all objects created by SkyPilot,
#   # specifying 'name' and 'namespace' fields here is not allowed.
#   # custom_metadata:
#   #   labels:
#   #     mylabel: myvalue
#   #   annotations:
#   #     myannotation: myvalue

#   # Additional fields to override the pod fields used by SkyPilot (optional)
#   #
#   # Any key:value pairs added here would get added to the pod spec used to
#   # create SkyPilot pods. The schema follows the same schema for a Pod object
#   # in the Kubernetes API:
#   # https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.26/#pod-v1-core
#   #
#   # # Some example use cases are shown below. All fields are optional.
#   # pod_config:
#   #   spec:
#   #     runtimeClassName: nvidia    # Custom runtimeClassName for GPU pods.
#   #     containers:
#   #       - env:                # Custom environment variables for the pod, e.g., for proxy
#   #         - name: HTTP_PROXY
#   #           value: http://proxy-host:3128
#   #         volumeMounts:       # Custom volume mounts for the pod
#   #           - mountPath: /foo
#   #             name: swarms
#   #             readOnly: true
#   #     volumes:
#   #       - name: swarms
#   #         hostPath:
#   #           path: /tmp
#   #           type: Directory
#   #       - name: swarms          # Use this to modify the /dev/shm volume mounted by SkyPilot
#   #         emptyDir:
#   #           medium: Memory
#   #           sizeLimit: 3Gi    # Set a size limit for the /dev/shm volume

# Fields below describe each replica.
resources:
  accelerators: {L4:8, A10g:8, A100:4, A100:8, A100-80GB:2}
  # cpus: 32+
  # memory: 512+
  # use_spot: True
  # disk_size: 512  # Ensure model checkpoints (~246GB) can fit.
  # disk_tier: best
  ports: 8000  # Expose to internet traffic.
  # spot_recovery: none

# workdir: ~/swarms-cloud/servers/cogvlm

setup: |
  docker build -t cogvlm .

run: |
  docker run --gpus all cogvlm

Version & Commit info:

concretevitamin commented 2 months ago

Hi @kyegomez I just tried on commit https://github.com/skypilot-org/skypilot/commit/1e4e871398e121708d3e9809c0a98b905bf9f212 and it also failed for me (not frozen):

I 04-19 14:50:51 provisioner.py:553] Successfully provisioned cluster: sky-serve-controller-8a3968f2

...

E 04-19 14:52:16 subprocess_utils.py:84] ValueError: Failed to register service 'sky-service-1f06' on the SkyServe controller. Reason:
E 04-19 14:52:16 subprocess_utils.py:84] Traceback (most recent call last):
E 04-19 14:52:16 subprocess_utils.py:84]   File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
E 04-19 14:52:16 subprocess_utils.py:84]     return _run_code(code, main_globals, None,
E 04-19 14:52:16 subprocess_utils.py:84]   File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
E 04-19 14:52:16 subprocess_utils.py:84]     exec(code, run_globals)
E 04-19 14:52:16 subprocess_utils.py:84]   File "/opt/conda/lib/python3.10/site-packages/sky/serve/service.py", line 260, in <module>
E 04-19 14:52:16 subprocess_utils.py:84]     _start(args.service_name, args.task_yaml, args.job_id)
E 04-19 14:52:16 subprocess_utils.py:84]   File "/opt/conda/lib/python3.10/site-packages/sky/serve/service.py", line 147, in _start
E 04-19 14:52:16 subprocess_utils.py:84]     success = serve_state.add_service(
E 04-19 14:52:16 subprocess_utils.py:84]   File "/opt/conda/lib/python3.10/site-packages/sky/serve/serve_state.py", line 225, in add_service
E 04-19 14:52:16 subprocess_utils.py:84]     _DB.conn.commit()
E 04-19 14:52:16 subprocess_utils.py:84] sqlite3.OperationalError: database is locked
E 04-19 14:52:16 subprocess_utils.py:84]
E 04-19 14:52:16 subprocess_utils.py:84]
RuntimeError: Failed to spin up the service. Please check the logs above for more details.

Could you install the latest nightly? pip uninstall -y skypilot; pip install skypilot-nightly[..your clouds..]? On today's main branch commit 24fcb44e7 it worked.

kyegomez commented 2 months ago

yeah now im getting this error with the present llama3 file, it should be accepting it, maybe I don't have the right clouds, but man it's been error after error

ervice from YAML spec: sky_serve.yaml
INFO:numexpr.utils:NumExpr defaulting to 8 threads.
Service Spec:
Readiness probe method:           POST /v1/chat/completions {"model": "meta-llama/Meta-Llama-3-70B-Instruct", "messages": [{"role": "user", "content": "Hello! What is your name?"}], "max_tokens": 1}
Readiness initial delay seconds:  1200
Replica autoscaling policy:       Fixed 2 replicas
Spot Policy:                      No spot policy

Each replica will use the following resources (estimated):
I 04-19 17:59:22 optimizer.py:1208] No resource satisfying <Cloud>({'L40': 1}, ports=['8081']) on [AWS, Azure, RunPod].
I 04-19 17:59:22 optimizer.py:1208] No resource satisfying <Cloud>({'A40': 1}, ports=['8081']) on [AWS, Azure, RunPod].
I 04-19 17:59:22 optimizer.py:1212] Did you mean: ['A100-80GB:8']
I 04-19 17:59:22 optimizer.py:1208] No resource satisfying <Cloud>({'A100': 1}, ports=['8081']) on [AWS, Azure, RunPod].
I 04-19 17:59:22 optimizer.py:1212] Did you mean: ['A100-80GB:1', 'A100-80GB:2', 'A100-80GB:4', 'A100-80GB:8', 'A100:8', 'A10G:1', 'A10G:4', 'A10G:8']
I 04-19 17:59:22 optimizer.py:693] == Optimizer ==
I 04-19 17:59:22 optimizer.py:704] Target: minimizing cost
I 04-19 17:59:22 optimizer.py:716] Estimated cost: $0.5 / hour
I 04-19 17:59:22 optimizer.py:716]
I 04-19 17:59:22 optimizer.py:839] Considered resources (1 node):
I 04-19 17:59:22 optimizer.py:909] -------------------------------------------------------------------------------------------------------
I 04-19 17:59:22 optimizer.py:909]  CLOUD   INSTANCE                   vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE   COST ($)   CHOSEN
I 04-19 17:59:22 optimizer.py:909] -------------------------------------------------------------------------------------------------------
I 04-19 17:59:22 optimizer.py:909]  Azure   Standard_NV6ads_A10_v5     6       55        A10:1          eastus        0.45          ✔
I 04-19 17:59:22 optimizer.py:909]  AWS     g6.xlarge                  4       16        L4:1           us-east-1     0.80  
I 04-19 17:59:22 optimizer.py:909]  AWS     g5.xlarge                  4       16        A10G:1         us-east-1     1.01  
I 04-19 17:59:22 optimizer.py:909]  Azure   Standard_NC24ads_A100_v4   24      220       A100-80GB:1    eastus        3.67  
I 04-19 17:59:22 optimizer.py:909] -------------------------------------------------------------------------------------------------------
I 04-19 17:59:22 optimizer.py:909]
I 04-19 17:59:22 optimizer.py:927] Multiple Azure instances satisfy A10:1. The cheapest Azure(Standard_NV6ads_A10_v5, {'A10': 1}, ports=['8081']) is considered among:
I 04-19 17:59:22 optimizer.py:927] ['Standard_NV6ads_A10_v5', 'Standard_NV12ads_A10_v5', 'Standard_NV18ads_A10_v5', 'Standard_NV36ads_A10_v5', 'Standard_NV36adms_A10_v5'].
I 04-19 17:59:22 optimizer.py:927]
I 04-19 17:59:22 optimizer.py:927] Multiple AWS instances satisfy A10:1. The cheapest AWS(g5.xlarge, {'A10G': 1}, ports=['8081']) is considered among:
I 04-19 17:59:22 optimizer.py:927] ['g5.xlarge', 'g5.2xlarge', 'g5.4xlarge', 'g5.8xlarge', 'g5.16xlarge'].
I 04-19 17:59:22 optimizer.py:927]
I 04-19 17:59:22 optimizer.py:933] To list more details, run 'sky show-gpus A10'.
Launching a new service 'sky-service-c37d'. Proceed? [Y/n]: Y
Launching controller for 'sky-service-c37d'...
sky.exceptions.ResourcesUnavailableError: Catalog does not contain any instances satisfying the request:
Task<name=sky-service-c37d>(run='# Start sky serve se...')
  resources: default instances.

To fix: relax or change the resource requirements.

Hint: sky show-gpus to list available accelerators.
      sky check to check the enabled clouds.

SKY YAML

# Serving Meta Llama-3 on your own infra.
#
# Usage:
#
#  HF_TOKEN=xxx sky launch llama3.yaml -c llama3 --env HF_TOKEN
#
# curl /v1/chat/completions:
#
#   ENDPOINT=$(sky status --endpoint 8081 llama3)
#  
#   # We need to manually specify the stop_token_ids to make sure the model finish
#   # on <|eot_id|>.
#   curl http://$ENDPOINT/v1/chat/completions \
#     -H "Content-Type: application/json" \
#     -d '{
#       "model": "meta-llama/Meta-Llama-3-8B-Instruct",
#       "messages": [
#         {
#           "role": "system",
#           "content": "You are a helpful assistant."
#         },
#         {
#           "role": "user",
#           "content": "Who are you?"
#         }
#       ],
#       "stop_token_ids": [128009,  128001]
#     }'
#
# Chat with model with Gradio UI:
#
#   Running on local URL:  http://127.0.0.1:8811
#   Running on public URL: https://<hash>.gradio.live
#
# Scale up with SkyServe:
#  HF_TOKEN=xxx sky serve up llama3.yaml -n llama3 --env HF_TOKEN
#
# curl /v1/chat/completions:
#
#   ENDPOINT=$(sky serve status --endpoint llama3)
#   curl -L $ENDPOINT/v1/models
#   curl -L http://$ENDPOINT/v1/chat/completions \
#     -H "Content-Type: application/json" \
#     -d '{
#       "model": "databricks/llama3-instruct",
#       "messages": [
#         {
#           "role": "system",
#           "content": "You are a helpful assistant."
#         },
#         {
#           "role": "user",
#           "content": "Who are you?"
#         }
#       ]
#     }'

envs:
  MODEL_NAME: meta-llama/Meta-Llama-3-70B-Instruct
  # MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
  HF_TOKEN: <your-huggingface-token>  # Change to your own huggingface token, or use --env to pass.

service:
  replicas: 2
  # An actual request for readiness probe.
  readiness_probe:
    path: /v1/chat/completions
    post_data:
      model: $MODEL_NAME
      messages:
        - role: user
          content: Hello! What is your name?
      max_tokens: 1

resources:
  accelerators: {L4:8, A10g:8, A10:8, A100:4, A100:8, A100-80GB:2, A100-80GB:4, A100-80GB:8}
  # accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.
  cpus: 32+
  use_spot: True
  disk_size: 512  # Ensure model checkpoints can fit.
  disk_tier: best
  ports: 8081  # Expose to internet traffic.

setup: |
  conda activate vllm
  if [ $? -ne 0 ]; then
    conda create -n vllm python=3.10 -y
    conda activate vllm
  fi

  pip install vllm==0.4.0.post1
  # Install Gradio for web UI.
  pip install gradio openai
  pip install flash-attn==2.5.7

run: |
  conda activate vllm
  echo 'Starting vllm api server...'

  # https://github.com/vllm-project/vllm/issues/3098
  export PATH=$PATH:/sbin

  # NOTE: --gpu-memory-utilization 0.95 needed for 4-GPU nodes.
  python -u -m vllm.entrypoints.openai.api_server \
    --port 8081 \
    --model $MODEL_NAME \
    --trust-remote-code --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
    --gpu-memory-utilization 0.95 \
    --max-num-seqs 64 \
    2>&1 | tee api_server.log &

  while ! `cat api_server.log | grep -q 'Uvicorn running on'`; do
    echo 'Waiting for vllm api server to start...'
    sleep 5
  done

  echo 'Starting gradio server...'
  git clone https://github.com/vllm-project/vllm.git || true
  python vllm/examples/gradio_openai_chatbot_webserver.py \
    -m $MODEL_NAME \
    --port 8811 \
    --model-url http://localhost:8081/v1 \
    --stop-token-ids 128009,128001
concretevitamin commented 2 months ago

I'd suggest using sky launch <yaml> first. It's for troubleshooting if launching a single instance works ;)

sky.exceptions.ResourcesUnavailableError: Catalog does not contain any instances satisfying the request:
Task<name=sky-service-c37d>(run='# Start sky serve se...')
  resources: default instances.

This suggests a default CPU-only serve controller cannot be launched. Could you run sky launch --down to see if it works well? This is also just for getting past any initial quota/permission errors.

kyegomez commented 2 months ago

@concretevitamin it builds now with sky launch but we'll see if it passes provisioning. I'm able to launch a cluster but then it just says provisioning 24/7

concretevitamin commented 2 months ago

It's most commonly due to quota issues. You could use sky serve logs <service_name> 1 to check replica 1's provision logs. Btw, feel free to join https://slack.skypilot.co/ too for quick debugging.