skypilot-org / skypilot

SkyPilot: Run LLMs, AI, and Batch jobs on any cloud. Get maximum savings, highest GPU availability, and managed execution—all with a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.2k stars 426 forks source link

[UX] Show which files are currently being uploaded during provisioning #1688

Open romilbhardwaj opened 1 year ago

romilbhardwaj commented 1 year ago

Paraphrased story from user:

Sometimes I have large temporary files in my workdir. When I run sky launch, it takes a long time uploading these large files which I do not want. Worse, there's no message on which files are being uploaded, so I don't know why it's taking so long. You do show a warning, but that’s meaningless since my workdir is large and I see that warning every time. I know you have gitignore integration, but that's not useful since some of these files actually need to be uploaded. Can you show which file you’re currently uploading so I know what’s going on and I can ctrl + c?

landscapepainter commented 1 year ago

I will take on this issue

landscapepainter commented 1 year ago

Seems like there are 4 categories where we can show the uploading/downloading status:

  1. Local to VM
    1. workdir sync
    2. direct sync under file_mounts
  2. Local to Cloud Storage
    1. Bucket created through Task YAML
  3. Cloud Storage to VM
    1. direct sync under file_mounts
    2. Storage object COPY mode
  4. Cloud Storage to Cloud Storage(incompleted)
    1. x_to_y() functions under sky/data/data_transfer.py

I was thinking of starting with 1.Local to VM in the first PR and then move on to next tasks in later PRs.

What do you think on how the tasks are divided? Am I missing any items that should be included in the tasks above?

concretevitamin commented 1 year ago

I ran into this today. The issue is regular file_mounts show a helpful log file, but storage doesn't.

Former:

I 03-15 17:32:09 cloud_vm_ray_backend.py:3396] To view detailed progress: tail -n100 -f ~/sky_logs/sky-2023-03-15-17-25-12-029820/file_mounts.log
I 03-15 17:32:09 backend_utils.py:1196] Syncing (to 1 node): /xxx -> ~/yyy

I can tail the log file and see what's up.

Latter:

I 03-15 17:47:44 storage.py:1358] Created GCS bucket xxx in US-CENTRAL1 with storage class STANDARD
⠏ Syncing /xxx to gs://yyy

I think exposing the underlying tool's stdout in such a log file will be a big UX improvement.

romilbhardwaj commented 1 year ago

Bumping up the priority for this, it's important to give visibility into what's happening under the hood. User said:

Each time I launch a script, skypilot spends a few minutes on this syncing, even though I have not changed the dataset. (See image below) Is it copying data?

cc @landscapepainter

concretevitamin commented 1 year ago

+1. It'd be helpful in that log file also includes the exact command being run, as the user wondered whether it's gsutil cp or gsutil rsync.

landscapepainter commented 1 year ago

@romilbhardwaj @concretevitamin I'll try to resolve each feature one by one in separate PRs. I'm currently working on displaying a progress bar for files being synced during work_dir and non-cloud file_mount syncs(Local to VM). Mostly done, just need to brush up a bit.

concretevitamin commented 1 year ago

Thanks! If the progress bar issues (multinode; overriding existing bar; etc) are not easy to fix, I think from a user perspective even having such info (what files being synced) in the log file will be very helpful.

romilbhardwaj commented 11 months ago

Bumping this again - was uploading a big dir today and would've been useful to be able to just see the logs of the underlying gsutil/aws s3 sync command.

landscapepainter commented 11 months ago

I'll go ahead and wrap this up adding the logs for now.

romilbhardwaj commented 8 months ago

Bumping this... I'm often stuck at:

⠴ Syncing ~/mydata to gs://romil-test-bucket/

without any hints as to what is going on. Logs would be really nice to have here.

landscapepainter commented 8 months ago

Note: from offline discussion with the team, it was concluded that a refactoring is necessary to support the logging for the upload. The refactoring includes migrating the sync process of local file_mounts to cloud storage to execution.py/_execute. This is necessary to share the log path set when backend is initialized at _execute. Also, it is more ideal to keep the sync process in _execute.

github-actions[bot] commented 4 months ago

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.

github-actions[bot] commented 1 week ago

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.