skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.56k stars 470 forks source link

[Storage][Serve] Surface the error message when SkyServe failed on workdir storage #3951

Open cblmemo opened 1 week ago

cblmemo commented 1 week ago

On the latest master, the service with workdir will fail to sync up their workdir due to assertion errors. Services without workdir do not have this problem. This prevent a common usecase of sky serve and also let a bunch of smoke test to fail. We should fix this.

$ sky serve up examples/serve/http_server/task.yaml
Service from YAML spec: examples/serve/http_server/task.yaml
Service Spec:
Readiness probe method:           GET /health
Readiness initial delay seconds:  20
Readiness probe timeout seconds:  15
Replica autoscaling policy:       Fixed 2 replicas
Spot Policy:                      No spot fallback policy

Each replica will use the following resources (estimated):
I 09-16 20:57:39 optimizer.py:719] == Optimizer ==
I 09-16 20:57:39 optimizer.py:730] Target: minimizing cost
I 09-16 20:57:39 optimizer.py:742] Estimated cost: $0.0 / hour
I 09-16 20:57:39 optimizer.py:742] 
I 09-16 20:57:39 optimizer.py:867] Considered resources (1 node):
I 09-16 20:57:39 optimizer.py:937] --------------------------------------------------------------------------------------------------------
I 09-16 20:57:39 optimizer.py:937]  CLOUD        INSTANCE             vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE     COST ($)   CHOSEN   
I 09-16 20:57:39 optimizer.py:937] --------------------------------------------------------------------------------------------------------
I 09-16 20:57:39 optimizer.py:937]  Kubernetes   2CPU--2GB            2       2         -              kubernetes      0.00          ✔     
I 09-16 20:57:39 optimizer.py:937]  AWS          m6i.large            2       8         -              us-east-1       0.10                
I 09-16 20:57:39 optimizer.py:937]  Azure        Standard_D2s_v5      2       8         -              eastus          0.10                
I 09-16 20:57:39 optimizer.py:937]  GCP          n2-standard-2        2       8         -              us-central1-a   0.10                
I 09-16 20:57:39 optimizer.py:937]  RunPod       1x_RTXA4000_SECURE   6       16        RTXA4000:1     CA              0.34                
I 09-16 20:57:39 optimizer.py:937] --------------------------------------------------------------------------------------------------------
I 09-16 20:57:39 optimizer.py:937] 
Launching a new service 'sky-service-93ad'. Proceed? [Y/n]: 
I 09-16 20:57:42 controller_utils.py:600] Translating workdir to SkyPilot Storage...
I 09-16 20:57:42 controller_utils.py:625] Workdir 'examples/serve/http_server' will be synced to cloud storage 'skypilot-workdir-txia-5e091ebd'.
I 09-16 20:57:42 controller_utils.py:698] Uploading sources to cloud storage. See: sky storage ls
E 09-16 20:57:43 storage.py:902] Could not create StoreType.S3 store with name skypilot-workdir-txia-5e091ebd.
AssertionError: ('We only support one store type for now.', {})

Version & Commit info:

cblmemo commented 1 week ago

cc @landscapepainter - is this related to some recent storage update?

cblmemo commented 6 days ago

Update - this is due to too many bucket created. After deleting all local staled storage it works well. We should surface this error to users.

https://github.com/skypilot-org/skypilot/blob/e870839aeed16c118c0eb1f4889efc20006c27c4/sky/data/storage.py#L1465-L1468

E 09-17 10:04:58 storage.py:902] Could not create StoreType.S3 store with name skypilot-workdir-memory-2972ca61. Error: Attempted to create a bucket skypilot-workdir-memory-2972ca61 but failed.Error: An error occurred (TooManyBuckets) when calling the CreateBucket operation: You have attempted to create more buckets than allowed