skypilot-org / skypilot

SkyPilot: Run LLMs, AI, and Batch jobs on any cloud. Get maximum savings, highest GPU availability, and managed execution—all with a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.22k stars 427 forks source link

[Storage] Investigate using rclone sync to replace gsutil rsync #2771

Open landscapepainter opened 8 months ago

landscapepainter commented 8 months ago

According to some user report, rclone sync is an order of magnitude faster than gsutil rsync, which SkyPilot uses to upload from local node to cloud storage and to download from cloud storage to local node.

The user reported a case of fetching millions of images and json files worth of ~500GB from the storage to the node:

Yeah, in my case, it was a dataset with millions of images and json files. The full dataset was ~500GB. Not sure the average file size.

With some loose benchmark on 4 categories below, it turns out that using rclone sync is much faster than gsutil rsync in a case there's huge number of small files being fetched from cloud storage(GS) to node(GCP).

  1. uploading 6 of 10GB files from local node(GCP) to cloud storage(GS)
  2. uploading 1000 of 1MB files from local node(GCP) to cloud storage(GS)
  3. downloading 6 of 10GB files from cloud storage(GS) to local node(GCP)
  4. downloading 1000 of 1MB files from cloud storage(GS) to local node(GCP)

image

We should investigate if rclone sync has all the features we need to run smoothly with SkyPilot, and replace gsutil rsync with rclone sync if feasible.

Update: Added a new benchmark of gsutil -m copy -r, and used a larger amount of data using 10000 of 1MB files as opposed to the previous benchmark above which has 1000 of 1MB files.

image

dtran24 commented 6 months ago

Hey, happy to pick this up! Some thoughts so far:

Please let me know if I'm on the right track, or if there's anything else I should know before diving more into the implementation.

landscapepainter commented 6 months ago

Welcome to Skypilot @dtran24!

upload method is a great place to start. After, you can take a look into _execute_file_mounts with cloud_stores.py/make_sync_dir_command and make_sync_file_command for fetching files/dirs from GCS to remote VM.

One thing to note is that IBM COS uses rclone as well, so we have an abstraction for rclone, data_utils.Rclone, that can be utilized.

sqr00t commented 6 months ago

Heya! I wonder if it's worth to add exploring s5cmd. For AWS, there's some benchmarking results to take with a grain of salt. Maybe an adaptor over this would allow a more unified API across clouds with better performance?

landscapepainter commented 6 months ago

Hey @sqr00t, thanks for sharing the benchmark results and suggestion. We actually do have a PR for adding s5cmd with crt client! Are you an active user of s5cmd outside of skypilot? Was wondering if you encountered any edge cases while using it compared to aws cli :)

aseriesof-tubes commented 5 months ago

Hi, I'm interested on working on this! If nobody's working on it, could I give it a shot?

landscapepainter commented 5 months ago

Hey @aseriesof-tubes thanks for taking on this issue. It's a very important issue to enhance user experience. I just assigned you to the problem.

github-actions[bot] commented 1 month ago

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.