skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.82k stars 512 forks source link

Skypilot only wants to spawn 4 core cpu controller when sky serve up #4197

Open mainey opened 3 weeks ago

mainey commented 3 weeks ago

When sky serve up a manifest with spot resources, skypilot would only want to launch a controller instance with 4 core

~/.sky/config.yaml

serve:
  controller:
    resources:
      cloud: aws
      region: ap-southeast-1
      instance_type: c6a.large
      disk_size: 50

jobs:
  controller:
    resources:
      cloud: aws
      region: ap-southeast-1
      instance_type: c6a.large
      disk_size: 50

allowed_clouds:
  - aws

sky serve up prod.yaml

service:
  replica_policy:
    min_replicas: 1
    max_replicas: 1
    target_qps_per_replica: 2
  readiness_probe:
    path: /embeddings
    headers:
      Authorization: Bearer $AUTH_TOKEN
    post_data:
      model: $MODEL_NAME
      user: "user"
      input:
        "a"

resources:
  cloud: aws
  disk_tier: best
  use_spot: true
  disk_size: 100
  ports: 8000
  any_of:
    - cloud: aws
      region: ap-southeast-1
      accelerators: T4g

envs:
  #censored

setup: |
  #censored

run: |
  docker run --runtime nvidia --gpus all -p 8000:8000 \
    --env #CENSORED \
    censored \
    --model-id $MODEL_NAME \
    --port 8000

Error log

ValueError: c6a.large does not have enough vCPUs. c6a.large has 2.0 vCPUs, but 4+ is requested.

The above exception was the direct cause of the following exception:

ValueError: Serve controller resources is not valid, please check ~/.sky/config.yaml file and make sure serve.controller.resources is a valid resources spec. details:
  [valueerror] c6a.large does not have enough vcpus. c6a.large has 2.0 vcpus, but 4+ is requested.

Version & Commit info:

cblmemo commented 3 weeks ago

Hi @mainey ! Thanks for reporting this. Could you try adding cpus: 2+ under the {jobs,serve}.controller.resoruces field?

This makes me wondering should we ignore default settings (at least cpu) if we find a customized resource. cc @Michaelvll for a look here

mainey commented 3 weeks ago

@cblmemo it works, thanks. Before that, i tried cpus: 2 it doesnt work, but 2+ works.

edit: reopening the issue if there will be more discussions.

cblmemo commented 3 weeks ago

I just tried with cpus: 2 and it seems to work for me. Do you still happen to keep the error logs for cpus: 2?

serve:
  controller:
    resources:
      cloud: aws
      region: ap-southeast-1
      instance_type: c6a.large
      disk_size: 50
      cpus: 2