skypilot-org / skypilot

SkyPilot: Run LLMs, AI, and Batch jobs on any cloud. Get maximum savings, highest GPU availability, and managed execution—all with a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.17k stars 425 forks source link

[Storage] Sky managed job controller unaware of local storages #3705

Open visatish opened 3 days ago

visatish commented 3 days ago

I've noticed that the local sky-managed storage state is not synced to the sky managed job controller. I.e. sky storage ls in the controller != sky storage ls in the local CLI from which the controller is launched. Thus if you refer to a local storage by name, it will be recreated in the default region - assuming it was initially created in some other region.

I'm not sure if this is the intended behavior or a bug, but if the former it should be documented better for first-time users. Otherwise you unknowingly end up with empty mount points (empty container is created in default region and mounted).

Version & Commit info:

landscapepainter commented 3 days ago

Thanks for the elaboration, and this seems to be a duplicate of #795

romilbhardwaj commented 2 days ago

This should be fixed by https://github.com/skypilot-org/skypilot/pull/3671.

visatish commented 2 days ago

@romilbhardwaj is that quite the same issue? This issue is specifically w.r.t. using locally-created buckets in a managed job. https://github.com/skypilot-org/skypilot/issues/3666 seems to be creating new buckets. https://github.com/skypilot-org/skypilot/issues/795 seems to be the opposite of this issue once https://github.com/skypilot-org/skypilot/issues/3666 is supported - viewing controller-created buckets locally.

romilbhardwaj commented 2 days ago

Hey @visatish - I agree, I don't think #795 is not related to this and it is not solved yet.

3671 should address this issue (and #3666, will test). #3671 now translates name: based mounts to source: based mounts when they are run on the controller.

E.g.,

/output/:
  name: my-local-storage
  store: s3
  mode: MOUNT

is translated to this when run on the controller:

/output/:
  source: s3://my-local-storage
  mode: MOUNT

As a result, if you already have a local storage, it will automatically get converted to a source based storage, which the controller will interpret as "This object store already exists, simply use it. Do not create."

visatish commented 2 days ago

@romilbhardwaj ah okay, I was thrown off by the emphasis on newly-created buckets in https://github.com/skypilot-org/skypilot/pull/3671, but I now see how it will also fix this issue - that is also the workflow patch I have been using locally. Thanks for the clarification!

romilbhardwaj commented 1 day ago

Confirming this is resolved by #3671. Launched a task with my pre-existing storage object romil-output-bucket-15 shown in sky storage ls:

file_mounts:
  /outputs:
    name: romil-output-bucket-15
    store: s3
    mode: MOUNT

run: |
  ls -la /outputs
  cat /outputs/hello.txt

This loads my existing bucket as expected:

(t, pid=1417) total 9
(t, pid=1417) drwxr-xr-x 2 sky  sky  4096 Jun 29 17:51 .
(t, pid=1417) drwxr-xr-x 1 root root 4096 Jun 29 17:51 ..
(t, pid=1417) -rw-r--r-- 1 sky  sky    14 Jun 27 01:03 hello.txt
(t, pid=1417) Hello, world!
visatish commented 1 day ago

@romilbhardwaj thanks for confirming! And to clarify, store: is unnecessary here right? I.e. it should be part of the metadata.

romilbhardwaj commented 1 day ago

Yes, store: is not necessary in the above example.