skypilot-org / skypilot

SkyPilot: Run LLMs, AI, and Batch jobs on any cloud. Get maximum savings, highest GPU availability, and managed execution—all with a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.16k stars 424 forks source link

[Storage] Investigate `rclone mount` with VFS caching #3353

Open concretevitamin opened 3 months ago

concretevitamin commented 3 months ago

https://rclone.org/commands/rclone_mount/#vfs-file-caching

Goal is to (1) improve the performance of datasets reading/writing, checkpoints writing (2) support things like appends, compared to regular bucket mounting via mode: MOUNT. rclone mount with VFS caching seems to be using local SSDs for reads/writes but also syncs up to the object storage bucket.

This mode may mimic some CSYNC-like functionality #2336.

concretevitamin commented 3 months ago

On one GCP node:

file_mounts:
  ~/.config/rclone/rclone.conf: ~/.config/rclone/rclone.conf

setup: |
  set -e
  rclone version || {{ curl https://rclone.org/install.sh | sudo bash }}
  # rclone requires fuse3
  sudo apt-get update && sudo apt-get install fuse3 -y
  rclone mount gcs:my-bucket /mnt/my-bucket \
    --daemon --daemon-wait 0 \
    --allow-other --rc --vfs-cache-mode full
  rclone rc vfs/refresh

Writing a large file to /mnt/my-bucket/ triggers immediate sync up (csync like behavior), with good speed (probably some caching):

$ bash dd_test.sh /mnt/my-bucket/testfile-rclone-vfs
Starting write test...
4294967296 bytes (4.3 GB, 4.0 GiB) copied, 17.3448 s, 248 MB/s
Starting read test...
4294967296 bytes (4.3 GB, 4.0 GiB) copied, 3.82129 s, 1.1 GB/s

Writing to a non-mounted path (waited for background cloud sync triggered by above to finish): for some reason slower

$ bash dd_test.sh /mnt/normal-disk
Starting write test...
4294967296 bytes (4.3 GB, 4.0 GiB) copied, 38.8702 s, 110 MB/s
Starting read test...
4294967296 bytes (4.3 GB, 4.0 GiB) copied, 19.2575 s, 223 MB/s

dd_test.sh:

#!/bin/bash
# Check if a test file name was provided
if [ -z "$1" ]; then
      echo "Usage: $0 <testfile>"
        exit 1
fi

# Use the provided argument as the test file name
TESTFILE=$1

# Size of the test file
FILESIZE=4294967296 # 4 GB

# Block size
BLOCKSIZE=65536

# Write test
echo "Starting write test..."
dd if=/dev/urandom of=$TESTFILE bs=$BLOCKSIZE count=$(($FILESIZE/$BLOCKSIZE)) conv=fdatasync oflag=direct 2>&1 | grep -E "copied|bytes"

# Read test
echo "Starting read test..."
dd if=$TESTFILE of=/dev/null bs=$BLOCKSIZE 2>&1 | grep -E "copied|bytes"
shethhriday29 commented 1 month ago

Some findings after implementing a cache via Rclone:

shethhriday29 commented 1 month ago

for reference: draft pull request w/ rclone cache implementation #3455

romilbhardwaj commented 1 month ago

This is fantastic - thanks for sharing @shethhriday29!

What EBS disk_tier was used for these benchmarks?

Looks like this is a "high latency high throughput" (FUSE) vs "low latency low throughput" (Rclone, and potentially CSYNC?) situation:

  1. For write heavy workloads: a. Writing large files (e.g., Vicuna) - FUSE based approaches are faster than rclone. b. Writing many small files (e.g., logs, other workloads) - rclone is better. Especially since it can provide POSIX compatibility.

  2. For read heavy workloads: a. Reading large files (e.g., datasets stored as parquet files) - FUSE is better (?) b. Reading many small files (e.g., datasets stored as individual files/records) - rclone may be faster (?)

cc @landscapepainter - did you have similar observations for csync? any thoughts?

shethhriday29 commented 1 month ago

Thank you, Romil! Cannot 100% recall for the fio benchmarks, but at least for training Vicuna, we did have disk_tier: best.

shethhriday29 commented 1 month ago

Just did some of the fio testing on disk_tier: best and got the same general results — inconsistent speeds (sometimes blazingly fast reads/writes, otherwise very slow reads/writes).

landscapepainter commented 1 month ago

@romilbhardwaj I have not particularly benchmarked the performance on read workloads, but rclone vfs used in this PR, #3455, by @shethhriday29 seemed to be performing better than FUSE based tools(goofys, gcsfuse) when training script is writing checkpoints, and even on par with CSYNC, #2336. So I looked into it more in-depth and measured the time taken for each MOUNT, rclone vfs(#3455), CSYNC mode to write checkpoint(the time training script is stuck) by overriding the training class we use for Vicuna. I'm currently trying to organize the data I see now, so I'll update more detailed results soon, and here is the tl;dr:

rclone vfs is almost as fast as CSYNC writing the checkpoint on the local ssd(CSYNC ~150s vs. rclone vfs ~175s) . So it's a pass in terms of letting training script to continue as quick as possible. But there are two issues:

I'll try to update more details soon, but just wanted to give everyone a heads up.

landscapepainter commented 1 month ago

@romilbhardwaj @shethhriday29 @concretevitamin I'll explain the data I shared in the previous comment with more depth.

To measure the time of writing checkpoint with each storage mode, I override the _save_checkpoint() method of transformers.Trainer class we use in the script to train Vicuna. And spot A100-80GB:8 with default medium disk tier was used to train Vicuna 13B. For CSYNC, interval_seconds: 180 was set. Preemption did not happen while I was measuring the time.

The time it takes for the training script to write checkpoint before continuing to train is as follows: gcsfuse MOUNT: 1245.24s rclone vfs: 186.39s CSYNC: 164.16s

And you can also see this visually from W&B plots: Screenshot 2024-05-28 at 6 24 52 PM

The time for the training script to complete writing checkpoint for CSYNC and rclone vfs are similar. But there are two concerns I discovered with rclone vfs.

1. It takes quite a time for it to sync the checkpoints from local to cloud storage, and this can increase the chance of user losing the checkpoint when the spot instance get preempted. Following is the time plot of when the training script first starts to write the larger portion of checkpoints(~60GB) and when the corresponding checkpoint appears on the cloud storage:

rclone vfs: training script starts to write on local ssd: 11:15:39 training script completes to write on local ssd: 11:18:30 all the larger portion of checkpoints appears on cloud storage: 11:44:07

CSYNC: training script starts to write on local ssd: 10:27:01 training script completes to write on local ssd: 10:29:29 all the larger portion of checkpoints appears on cloud storage: 10:31:39

As you can see, the time it takes to fully sync between local and cloud storage for rclone vfs takes some time(CSYNC ~2min vs. rclone vfs ~26min). But one thing to note is that, this 2 minute of CSYNC will increase if we implement a feature to upload each checkpoints in order corresponding to the order it was written to the local ssd.

2. Some trianing framework require all files of a single checkpoint to exist in order if it was to be used, and if not it will crash rather than failing over to another checkpoint, and rclone vfs fails to maintain this order. For Deepspeed, if some of the checkpoint files that are written by the training script in later time(i.e. a file marking completion of writing checkpoint) exist, but the one written earlier does not, it will crash rather than failing over to the other checkpoint, which is an issue brought up by @concretevitamin. Hence, we want to make sure the order of files appear in the cloud storage to match with the files written at local by the training script, just like MOUNT mode. rclone vfs fails to address this problem just like currnet CSYNC. For our Vicuna training script with transformers.Train, the last file written as part of the checkpoint is called complete to mark that the entire checkpoint directory is completely written, and you can see this below from cloud storage that was used as MOUNT mode.

Screenshot 2024-05-28 at 6 56 43 PM

But for rclone vfs, the order is not maintained just like current CSYNC. There are files uploaded to cloud storage after complete file was uploaded.

Screenshot 2024-05-28 at 6 58 24 PM

Screenshot 2024-05-28 at 6 58 31 PM

As rclone vfs is not only slower than CSYNC, but also fails to maintain the order, it seems beneficial for us to complete the one last piece of CSYNC by implementing the feature to maintain the order.

concretevitamin commented 1 month ago

@landscapepainter @shethhriday29 In your benchmarks, what are the other VFS args (https://rclone.org/commands/rclone_mount/#vfs-file-caching) set to? Those may affect certain behavior like time taken to sync up to cloud storage.

shethhriday29 commented 1 month ago

I was mounting via rclone mount {bucket_rclone_profile}:{bucket_name} {mount_path} --daemon --daemon --daemon-wait 0 --allow-other --rc --vfs-cache-mode full && rclone rc vfs/refresh

landscapepainter commented 1 month ago

@concretevitamin @shethhriday29 Let me look into what other options we have to improve that performance.

landscapepainter commented 1 month ago

@concretevitamin @shethhriday29 @romilbhardwaj I did more research on some feasible options to improve performance of the current rclone vfs implementation, and it seems like I found what we wanted.

On top of @shethhriday29's implementation, I added the following two options:

From the following, you can see that with newly added options:

  1. it took only ~8 minutes for checkpoint files(~61G) to be completely uploaded to cloud storage since the files got started to be written by the training script. rng_state_1.pth is the very first file written by the script, and complete is the last one.
  2. the order of checkpoint files uploaded to cloud storage follows the order of creation of each file written by training script.

Screenshot 2024-05-29 at 11 15 45 PM

But we do need additional testing on CPU usage, how large the cache can get and it's implication, reading checkpoint when spot gets preempted(especially options related to reading), and etc. We should add additional options like --cache-dir to specify the directory of the cache just like we were doing for CSYNC.

landscapepainter commented 4 weeks ago

Two additional discoveries on goofys and latest gcsfuse version 2.2.0.

  1. The time training script takes to write the checkpoint with MOUNT mode using goofys on S3 is on par with CSYNC and rclone with additional options, which explains the high throughput on s3 with goofys that @shethhriday29 discovered.

  2. Measured the same metric for latest gcsfuse version, 2.2.0. We currently have 1.3.0 running with MOUNT mode, and there was a good amount of performance improvement with the verison updates. I'll submit a PR for this update.

Screenshot 2024-05-30 at 5 41 42 PM

cc @romilbhardwaj @concretevitamin

shethhriday29 commented 4 weeks ago

Wow, this is awesome @landscapepainter ! Are the additional options just --vfs-cache-poll-interval 10s and --transfers 1? Also, from the graph, it doesn't look like there's a big difference in the performance of rclone vfs with or without the additional options — am I missing something?