skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.82k stars 513 forks source link

stuck at "STARTING" when launching with a custom image on runpod #4285

Open alita-moore opened 2 weeks ago

alita-moore commented 2 weeks ago

I am trying to run a docker server on runpod that utilizes a fastapi server on port 8000, when I provision the resources the system gets stuck at status "STARTING". Here are the details:

resources:
  any_of:
    - image_id: docker:teamwoven/convert:sultan-1.0
      cloud: runpod
      ports: 8000
      accelerators: RTX4090:1
    - image_id: docker:teamwoven/convert:sultan-1.0
      cloud: aws
      ports: 8000
      accelerators: RTX4090:1

service:
  readiness_probe: /status

envs:
  SKYPILOT_DOCKER_USERNAME:
  SKYPILOT_DOCKER_PASSWORD:
  SKYPILOT_DOCKER_SERVER: docker.io

setup: |
  #

run: |
  /usr/local/bin/uvicorn extractor_inference.app:app --host "0.0.0.0" --port 8000 --workers 2

with the following log outputs

> sky serve status                 
Services
NAME              VERSION  UPTIME  STATUS      REPLICAS  ENDPOINT           
sky-service-0fc2  -        -       NO_REPLICA  0/2       3.223.6.236:30001  

Service Replicas
SERVICE_NAME      ID  VERSION  ENDPOINT                    LAUNCHED     RESOURCES                  STATUS    REGION  
sky-service-0fc2  2   2        http://89.187.159.54:40012  17 mins ago  1x RunPod({'RTX4090': 1})  STARTING  CZ      
sky-service-0fc2  3   3        http://89.187.159.54:40014  13 mins ago  1x RunPod({'RTX4090': 1})  STARTING  CZ    

and

> sky serve logs sky-service-0fc2 3
Start streaming logs for launching process of replica 3.
I 11-07 10:50:27 replica_managers.py:84] Launching replica (id: 3) cluster sky-service-0fc2-3 with resources: {RunPod({'RTX4090': 1}, image_id=docker:teamwoven/convert:sultan-1.0, ports=['8000']), AWS({'RTX4090': 1}, image_id=docker:teamwoven/convert:sultan-1.0, ports=['8000'])}
I 11-07 10:50:27 optimizer.py:1318] No resource satisfying AWS({'RTX4090': 1}, image_id=docker:teamwoven/convert:sultan-1.0, ports=['8000']) on AWS.
I 11-07 10:50:27 optimizer.py:737] Target: minimizing cost
I 11-07 10:50:27 optimizer.py:750] Estimated cost: $0.7 / hour
I 11-07 10:50:27 optimizer.py:750] 
I 11-07 10:50:27 optimizer.py:885] Considered resources (1 node):
I 11-07 10:50:27 optimizer.py:955] -------------------------------------------------------------------------------------------------
I 11-07 10:50:27 optimizer.py:955]  CLOUD    INSTANCE            vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE   COST ($)   CHOSEN   
I 11-07 10:50:27 optimizer.py:955] -------------------------------------------------------------------------------------------------
I 11-07 10:50:27 optimizer.py:955]  RunPod   1x_RTX4090_SECURE   16      24        RTX4090:1      CA            0.74          ✔     
I 11-07 10:50:27 optimizer.py:955] -------------------------------------------------------------------------------------------------
Key already exists
I 11-07 10:50:29 cloud_vm_ray_backend.py:1505] ⚙︎ Launching on RunPod CA.
W 11-07 10:50:30 instance.py:94] run_instances error: There are no longer any instances available with the requested specifications. Please refresh and try again.
W 11-07 10:50:30 cloud_vm_ray_backend.py:2017] sky.exceptions.ResourcesUnavailableError: Failed to acquire resources in all zones in CA for {AWS({'RTX4090': 1}, image_id=docker:teamwoven/convert:sultan-1.0, ports=['8000']), RunPod({'RTX4090': 1}, image_id=docker:teamwoven/convert:sultan-1.0, ports=['8000'])}. 
W 11-07 10:50:30 cloud_vm_ray_backend.py:2051] 
W 11-07 10:50:30 cloud_vm_ray_backend.py:2051] ↺ Trying other potential resources.
I 11-07 10:50:30 optimizer.py:1318] No resource satisfying AWS({'RTX4090': 1}, image_id=docker:teamwoven/convert:sultan-1.0, ports=['8000']) on AWS.
I 11-07 10:50:30 optimizer.py:737] Target: minimizing cost
I 11-07 10:50:30 optimizer.py:750] Estimated cost: $0.7 / hour
I 11-07 10:50:30 optimizer.py:750] 
I 11-07 10:50:30 optimizer.py:885] Considered resources (1 node):
I 11-07 10:50:30 optimizer.py:955] -------------------------------------------------------------------------------------------------
I 11-07 10:50:30 optimizer.py:955]  CLOUD    INSTANCE            vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE   COST ($)   CHOSEN   
I 11-07 10:50:30 optimizer.py:955] -------------------------------------------------------------------------------------------------
I 11-07 10:50:30 optimizer.py:955]  RunPod   1x_RTX4090_SECURE   16      24        RTX4090:1      CZ            0.74          ✔     
I 11-07 10:50:30 optimizer.py:955] -------------------------------------------------------------------------------------------------
Key already exists
I 11-07 10:50:31 cloud_vm_ray_backend.py:1505] ⚙︎ Launching on RunPod CZ.
I 11-07 10:51:08 provisioner.py:445] └── Instance is up.
I 11-07 10:52:21 provisioner.py:550] ✓ Cluster launched: sky-service-0fc2-3.  View logs at: ~/sky_logs/sky-2024-11-07-10-50-27-698777/provision.log
I 11-07 10:52:21 execution.py:299] ⚙︎ Mounting files.
I 11-07 10:52:23 cloud_vm_ray_backend.py:3244] ✓ Setup completed.
I 11-07 10:52:23 cloud_vm_ray_backend.py:3456] Multiple resources are specified for the task, using: RunPod({'RTX4090': 1}, image_id=docker:teamwoven/convert:sultan-1.0, ports=['8000'])
I 11-07 10:52:28 cloud_vm_ray_backend.py:3355] ⚙︎ Job submitted, ID: 1
I 11-07 10:52:28 cloud_vm_ray_backend.py:3391] 
I 11-07 10:52:28 cloud_vm_ray_backend.py:3391] 📋 Useful Commands
I 11-07 10:52:28 cloud_vm_ray_backend.py:3391] Job ID: 1
I 11-07 10:52:28 cloud_vm_ray_backend.py:3391] ├── To cancel the job:           sky cancel sky-service-0fc2-3 1
I 11-07 10:52:28 cloud_vm_ray_backend.py:3391] ├── To stream job logs:          sky logs sky-service-0fc2-3 1
I 11-07 10:52:28 cloud_vm_ray_backend.py:3391] └── To view job queue:           sky queue sky-service-0fc2-3
I 11-07 10:52:28 cloud_vm_ray_backend.py:3483] 
I 11-07 10:52:28 cloud_vm_ray_backend.py:3483] Cluster name: sky-service-0fc2-3
I 11-07 10:52:28 cloud_vm_ray_backend.py:3483] ├── To log into the head VM:     ssh sky-service-0fc2-3
I 11-07 10:52:28 cloud_vm_ray_backend.py:3483] ├── To submit a job:             sky exec sky-service-0fc2-3 yaml_file
I 11-07 10:52:28 cloud_vm_ray_backend.py:3483] ├── To stop the cluster: sky stop sky-service-0fc2-3
I 11-07 10:52:28 cloud_vm_ray_backend.py:3483] └── To teardown the cluster:     sky down sky-service-0fc2-3

I 11-07 10:52:28 replica_managers.py:104] Replica cluster sky-service-0fc2-3 launched.
Start streaming logs for task job of replica 3...
Job ID not provided. Streaming the logs of the latest job.
├── Waiting for task resources on 1 node.
└── Job started. Streaming logs... (Ctrl-C to exit log streaming; job will not be killed)

Version & Commit info:

alita-moore commented 2 weeks ago

This randomly started working 🤷‍♀️

concretevitamin commented 2 weeks ago

Maybe it was spending time to initialize / pull packages or checkpoints?

alita-moore commented 2 weeks ago

I'm not sure, I didn't see the logs moving at all / the init script had finished. I'll let you know if it happens again.

Michaelvll commented 1 week ago

It might be some availability issue with runpod, i.e. SkyServe kept trying to get resources from RunPod but failed due to availability. There will be some useful logs with sky serve logs sky-service-0fc2 3 which will show the logs for that specific replica.