skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.78k stars 509 forks source link

[Storage] Fail early for COPY mode storage field without source specified #3477

Closed Michaelvll closed 2 months ago

Michaelvll commented 6 months ago

The following yaml will fail for sky spot launch, causing FAILED_CONTROLLER if the zhwu-bucket-test-2 exist in the sky storage ls on laptop, but not on the spot controller.

resources:
  cpus: 2+

file_mounts:
  test.yaml: test.yaml
  /my_mount:
    name: zhwu-bucket-test-2
    store: r2
    mode: COPY

run: |
  ls /my_mount/llm

Error:

(test-cloudflare, pid=8291) Process Process-1:
(test-cloudflare, pid=8291) sky.exceptions.StorageSourceError: New storage object: source must be specified when using COPY mode.
(test-cloudflare, pid=8291) I 04-24 22:46:42 controller.py:475] Killing controller process 23421.
(test-cloudflare, pid=8291) I 04-24 22:46:42 controller.py:483] Controller process 23421 killed.
(test-cloudflare, pid=8291) I 04-24 22:46:42 controller.py:485] Cleaning up any spot cluster for job 6.
(test-cloudflare, pid=8291) sky.exceptions.StorageSourceError: New storage object: source must be specified when using COPY mode.
ERROR: Job 6 failed with return code list: [1] 

The error should expose early when user calls sky spot launch, instead of waiting for the controller to be up and the job is scheduled.

Version & Commit info:

github-actions[bot] commented 2 months ago

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.

github-actions[bot] commented 2 months ago

This issue was closed because it has been stalled for 10 days with no activity.