skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.82k stars 513 forks source link

[Jobs] Allowing to specify intermediate bucket for file upload #4257

Open zpoint opened 2 weeks ago

zpoint commented 2 weeks ago

Feature for #3978

For managed jobs:

Even using the same bucket, each job creates its own subdirectory based on the '{run-id}', and subdirectories of auto-created files will be cleaned up after the job is finished.

This allows users who don't have permission to create a bucket to manually specify a bucket name under ~/.sky/config.yaml.

Test plan

smoke test

pytest -s tests/test_smoke.py::test_managed_jobs_storage
pytest -s tests/test_smoke.py::TestStorageWithCredentials::test_bucket_sub_path --aws

custom test

(sky) ➜  cat ~/.sky/config.yaml
jobs:
  bucket:
    s3: "bucket-jobs-s3"
    gcs: "bucket-jobs-gcs"
    default: "bucket-jobs-default"
(sky) ➜ cat ~/Desktop/hello-sky/work_dir_1.yaml
name: test_workdir_bucket_name_2

workdir: .

resources:
  cloud: aws
  instance_type: t3.small

file_mounts:
  # this will use the user config
  /checkpoint:
    name: zpoint-filemounts-bucket
    source: ~/Desktop/dir1
    mode: MOUNT
    store: azure

  # these will all use same bucket configured in ~/.sky/config.yaml jobs->bucket now for bucket storage
  /dir1: ~/Desktop/dir1
  /dir2: ~/Desktop/dir2
  /dir3/dir3.py: ~/Desktop/dir1/dir1.py

run: |
  for i in {1..5}; do
    echo "Hello, SkyPilot World! $(date)"
    sleep 2
  done%   

Launch

(sky) ➜ sky jobs launch ~/Desktop/hello-sky/work_dir_1.yaml
⠋ Syncing ~/Desktop/dir1 -> https://xxx/zpoint-filemounts-bucket/
⠹ Syncing . -> s3://zpoint-bucket-s3/
⠧ Syncing ~/Desktop/dir1 -> s3://zpoint-bucket-s3/
⠦ Syncing ~/Desktop/dir2 -> s3://zpoint-bucket-s3/
⠴ Syncing /var/folders/83/zxqx914s57x310rfnhq8kk9r0000gn/T/skypilot-filemounts-files-aca97801 -> s3://zpoint-bucket-s3/

Looks good

Tested (run the relevant ones):

romilbhardwaj commented 1 week ago

Thanks @zpoint! I'm trying to run this yaml:

resources:
  cloud: aws

file_mounts:
  ~/aws: ~/aws

workdir: ~/tmp-workdir

num_nodes: 1

run: |
  echo "Hello, world!"
  ls ~/aws
  ls .

The task output should show me contents of my workdir and file_mount. Instead, I get:

├── Waiting for task resources on 1 node.
└── Job started. Streaming logs... (Ctrl-C to exit log streaming; job will not be killed)
(sky-3ba1-romilb, pid=2100) Hello, world!
(sky-3ba1-romilb, pid=2100) job-89ab894f
(sky-3ba1-romilb, pid=2100) job-89ab894f
✓ Managed job finished: 1 (status: SUCCEEDED).

Related to this: https://github.com/skypilot-org/skypilot/pull/4257#discussion_r1837141465?

romilbhardwaj commented 1 week ago

I think https://github.com/skypilot-org/skypilot/pull/4257#issuecomment-2469050952 is still not resolved.

I ran sky jobs launch with this YAML:

resources:
  cloud: aws

file_mounts:
  ~/aws: ~/aws

workdir: ~/tmp-workdir

num_nodes: 1

run: |
  echo "Hello, world!"
  ls ~/aws
  ls .

And I'm still getting

├── Waiting for task resources on 1 node.
└── Job started. Streaming logs... (Ctrl-C to exit log streaming; job will not be killed)
(sky-e0d7-romilb, pid=2076) Hello, world!
(sky-e0d7-romilb, pid=2076) job-24ad1167
(sky-e0d7-romilb, pid=2076) job-24ad1167

Instead of the actual contents of my workdir and mounted dir.

zpoint commented 1 week ago

@romilbhardwaj Thanks for the reply

4257 is now resolved, and I also updated the description of the PR to describe the mainly changes