skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.71k stars 495 forks source link

Files not uploading to S3 #1601

Closed pschafhalter closed 8 months ago

pschafhalter commented 1 year ago

When using GCP for compute and S3 for storage, files generated by the task aren't automatically uploaded to S3. My SkyPilot version is 0.2.2.

Minimal example to reproduce the issue:

  1. Create an AWS bucket, e.g. skypilot-s3-bug.
  2. Launch the following task with sky launch -c test-s3-bug s3_bug.yaml. The contents of s3_bug.yaml are the following:
    
    resources:
    cloud: gcp

workdir: .

file_mounts: /data: name: skypilot-s3-bug mode: MOUNT

run: | echo "hello world" > /data/test.txt

4. After the task completes, the AWS bucket is empty.
5. Note that the file exists on the remote machine:

$ ssh test-s3-bug $ cd /data/ $ ls test.txt $ cat test.txt hello world



A workaround is to manually sync to S3 by SSHing into the remote machine and running `aws s3 sync /data s3://skypilot-s3-bug`.
concretevitamin commented 1 year ago

Thanks for the report and the workaround @pschafhalter! We use FUSE tools to mount a bucket and it's been observed that writes sometimes have consistency problems like this. We should look deeper.

Cc @romilbhardwaj.

concretevitamin commented 1 year ago

@pschafhalter I removed the workdir: . field and ran this example with a new bucket name:

sky launch -c dbg --down test.yaml --use-spot

A few minutes afterwards, aws s3 ls <bucket> did show the file.

A few things we can check

pschafhalter commented 1 year ago

Thanks for looking into this @concretevitamin.

With the provided config, the task successfully completes and the file does not show up. This is also happening for me in another task. Do you have an idea why workdir: . might cause the issue?

nakkaya commented 1 year ago

This is also happening on GCP with Cloud Storage. When the machine launches for the first time everything works as expected files are uploaded to the bucket however when starting and using a stopped VM when the job completes new files in the mounted folder are not uploaded to the bucket but are present in the directory.

concretevitamin commented 1 year ago

@nakkaya Do you mean this could happen outside of SkyPilot?

nakkaya commented 1 year ago

@concretevitamin No I meant machines stopped (--autostop) and started by skypilot.

romilbhardwaj commented 1 year ago

Thanks for the report @nakkaya!

This is a known issue tracked in #1203. As a temporary workaround, can you try using sky launch -c <your_cluster> --no-setup mytask.yaml? This should re-mount any buckets.

To help us find a good solution to this, can you tell us a little more about your usage of SkyPilot - are you using it to run batch jobs through the job queue interface or are you ssh-ing into the machine for interactive development?

nakkaya commented 1 year ago

@romilbhardwaj Thanks for the reply.

sky launch -c --no-setup mytask.yaml

I am running a long running job, I'll try this when it completes.

I primarily use it to run batch jobs through the job queue interface but once in a while I do ssh into instance to debug why something fails.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.

github-actions[bot] commented 8 months ago

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.

github-actions[bot] commented 8 months ago

This issue was closed because it has been stalled for 10 days with no activity.