Open concretevitamin opened 3 months ago
On one GCP node:
file_mounts:
~/.config/rclone/rclone.conf: ~/.config/rclone/rclone.conf
setup: |
set -e
rclone version || {{ curl https://rclone.org/install.sh | sudo bash }}
# rclone requires fuse3
sudo apt-get update && sudo apt-get install fuse3 -y
rclone mount gcs:my-bucket /mnt/my-bucket \
--daemon --daemon-wait 0 \
--allow-other --rc --vfs-cache-mode full
rclone rc vfs/refresh
Writing a large file to /mnt/my-bucket/
triggers immediate sync up (csync like behavior), with good speed (probably some caching):
$ bash dd_test.sh /mnt/my-bucket/testfile-rclone-vfs
Starting write test...
4294967296 bytes (4.3 GB, 4.0 GiB) copied, 17.3448 s, 248 MB/s
Starting read test...
4294967296 bytes (4.3 GB, 4.0 GiB) copied, 3.82129 s, 1.1 GB/s
Writing to a non-mounted path (waited for background cloud sync triggered by above to finish): for some reason slower
$ bash dd_test.sh /mnt/normal-disk
Starting write test...
4294967296 bytes (4.3 GB, 4.0 GiB) copied, 38.8702 s, 110 MB/s
Starting read test...
4294967296 bytes (4.3 GB, 4.0 GiB) copied, 19.2575 s, 223 MB/s
dd_test.sh
:
#!/bin/bash
# Check if a test file name was provided
if [ -z "$1" ]; then
echo "Usage: $0 <testfile>"
exit 1
fi
# Use the provided argument as the test file name
TESTFILE=$1
# Size of the test file
FILESIZE=4294967296 # 4 GB
# Block size
BLOCKSIZE=65536
# Write test
echo "Starting write test..."
dd if=/dev/urandom of=$TESTFILE bs=$BLOCKSIZE count=$(($FILESIZE/$BLOCKSIZE)) conv=fdatasync oflag=direct 2>&1 | grep -E "copied|bytes"
# Read test
echo "Starting read test..."
dd if=$TESTFILE of=/dev/null bs=$BLOCKSIZE 2>&1 | grep -E "copied|bytes"
Some findings after implementing a cache via Rclone:
List of initial benchmarks (link)
[ignore sections labeled EBS
, they are irrelevant in this context because those were direct reads/writes to EBS, not to S3 via an EBS cache]
Even early on, I noticed that at some points, the advantage of an rclone was very thin/non-existent compared to EBS, like at the 1GB total, 1MB Blocksize AWS/S3 test on the doc (3,258.70 MB/s without cache, 3,633.64 MB/s with cache)
However, these benchmarks (both using fio
and dd
) became very inconsistent after trying them out multiple times, and I specifically found out that for larger quantities of data on these dd/fio benchmarks, accessing cloud storage directly was a lot faster than using an rclone cache (for a 5GB size/1MB blocksize write, was getting only 150 MB/s write speed via rclone cache and 2000+ without the rclone cache)
After more investigation, we realized the following:
The benchmark tests (dd/fio) were leveraging the buff/cache
(aka, memory) for reads/writes, whereas the same benchmark tests for the old method (accessing goofys for AWS/gsutil for GCP) were not leveraging the buff/cache
As a result, we'd sometimes get blazingly fast access speeds for the rclone cache on benchmarks (>3000 MB/s) when the size of the read/write would fit within memory, and when it wouldn't (at larger read/write total sizes), it would degrade to the baseline read/write speed for the local disk/flash storage itself (150-250 MB/s for EBS)
Meanwhile, the original direct-access method (goofys for AWS/gsutil for GCP) did not vary as much in performance (1000-2000 MB/s) as total read/write size changed, because it's largely limited by network and the throughput of S3 itself (which ARE NOT limited by the throughput of local storage, the way the rclone cache is at larger read/write sizes)
This was confirmed by a trial run of fine-tuning Vicuna using both methods (first image for direct to cloud storage access, second image for the rclone cache):
In short, were blocked for only 4 minutes each time the 100 GB needs to be written when there is no rclone cache, vs 5.5 minutes each time the same 100 GB needs to be written when there is an rclone cache
Despite this, there are still reasons to implement the rclone cache as an option for the user:
If the user has frequent reads/writes to files that are not large in size, then the lower latency provided by the rclone cache is advantageous (because we can just access disk directly, not wait for the latency of accessing cloud storage)
Additionally, rclone cache is quite important for supporting POSIX compatibility. This is important for a variety of things, including accessing Jupyter notebooks, which I WAS able to do in rclone-cache-mounted cloud storage but wasn't able to do via the old goofys method in AWS. (can find screenshots of this in the "initial benchmarks" doc linked at the top of this comment)
However, this is not necessarily an endorsement to make rclone cache the default mounting method, rather a supported one if the user chooses
for reference: draft pull request w/ rclone cache implementation #3455
This is fantastic - thanks for sharing @shethhriday29!
What EBS disk_tier
was used for these benchmarks?
Looks like this is a "high latency high throughput" (FUSE) vs "low latency low throughput" (Rclone, and potentially CSYNC?) situation:
For write heavy workloads: a. Writing large files (e.g., Vicuna) - FUSE based approaches are faster than rclone. b. Writing many small files (e.g., logs, other workloads) - rclone is better. Especially since it can provide POSIX compatibility.
For read heavy workloads: a. Reading large files (e.g., datasets stored as parquet files) - FUSE is better (?) b. Reading many small files (e.g., datasets stored as individual files/records) - rclone may be faster (?)
cc @landscapepainter - did you have similar observations for csync? any thoughts?
Thank you, Romil! Cannot 100% recall for the fio
benchmarks, but at least for training Vicuna, we did have disk_tier: best
.
Just did some of the fio
testing on disk_tier: best
and got the same general results — inconsistent speeds (sometimes blazingly fast reads/writes, otherwise very slow reads/writes).
@romilbhardwaj I have not particularly benchmarked the performance on read workloads, but rclone vfs
used in this PR, #3455, by @shethhriday29 seemed to be performing better than FUSE based tools(goofys
, gcsfuse
) when training script is writing checkpoints, and even on par with CSYNC
, #2336. So I looked into it more in-depth and measured the time taken for each MOUNT
, rclone vfs
(#3455), CSYNC
mode to write checkpoint(the time training script is stuck) by overriding the training class we use for Vicuna. I'm currently trying to organize the data I see now, so I'll update more detailed results soon, and here is the tl;dr:
rclone vfs
is almost as fast as CSYNC
writing the checkpoint on the local ssd(CSYNC
~150s vs. rclone vfs
~175s) . So it's a pass in terms of letting training script to continue as quick as possible. But there are two issues:
CSYNC
~2min vs. rclone vfs
~26min).CSYNC
as well, so we need to implement additional feature.I'll try to update more details soon, but just wanted to give everyone a heads up.
@romilbhardwaj @shethhriday29 @concretevitamin I'll explain the data I shared in the previous comment with more depth.
To measure the time of writing checkpoint with each storage mode, I override the _save_checkpoint()
method of transformers.Trainer
class we use in the script to train Vicuna. And spot A100-80GB:8 with default medium disk tier was used to train Vicuna 13B. For CSYNC
, interval_seconds: 180
was set. Preemption did not happen while I was measuring the time.
The time it takes for the training script to write checkpoint before continuing to train is as follows:
gcsfuse MOUNT
: 1245.24s
rclone vfs
: 186.39s
CSYNC
: 164.16s
And you can also see this visually from W&B plots:
The time for the training script to complete writing checkpoint for CSYNC
and rclone vfs
are similar. But there are two concerns I discovered with rclone vfs
.
1. It takes quite a time for it to sync the checkpoints from local to cloud storage, and this can increase the chance of user losing the checkpoint when the spot instance get preempted. Following is the time plot of when the training script first starts to write the larger portion of checkpoints(~60GB) and when the corresponding checkpoint appears on the cloud storage:
rclone vfs
:
training script starts to write on local ssd: 11:15:39
training script completes to write on local ssd: 11:18:30
all the larger portion of checkpoints appears on cloud storage: 11:44:07
CSYNC
:
training script starts to write on local ssd: 10:27:01
training script completes to write on local ssd: 10:29:29
all the larger portion of checkpoints appears on cloud storage: 10:31:39
As you can see, the time it takes to fully sync between local and cloud storage for rclone vfs takes some time(CSYNC
~2min vs. rclone vfs
~26min). But one thing to note is that, this 2 minute of CSYNC will increase if we implement a feature to upload each checkpoints in order corresponding to the order it was written to the local ssd.
2. Some trianing framework require all files of a single checkpoint to exist in order if it was to be used, and if not it will crash rather than failing over to another checkpoint, and rclone vfs
fails to maintain this order. For Deepspeed, if some of the checkpoint files that are written by the training script in later time(i.e. a file marking completion of writing checkpoint) exist, but the one written earlier does not, it will crash rather than failing over to the other checkpoint, which is an issue brought up by @concretevitamin. Hence, we want to make sure the order of files appear in the cloud storage to match with the files written at local by the training script, just like MOUNT
mode. rclone vfs
fails to address this problem just like currnet CSYNC
. For our Vicuna training script with transformers.Train
, the last file written as part of the checkpoint is called complete
to mark that the entire checkpoint directory is completely written, and you can see this below from cloud storage that was used as MOUNT
mode.
But for rclone vfs
, the order is not maintained just like current CSYNC
. There are files uploaded to cloud storage after complete
file was uploaded.
As rclone vfs
is not only slower than CSYNC, but also fails to maintain the order, it seems beneficial for us to complete the one last piece of CSYNC
by implementing the feature to maintain the order.
@landscapepainter @shethhriday29 In your benchmarks, what are the other VFS args (https://rclone.org/commands/rclone_mount/#vfs-file-caching) set to? Those may affect certain behavior like time taken to sync up to cloud storage.
I was mounting via
rclone mount {bucket_rclone_profile}:{bucket_name} {mount_path} --daemon --daemon --daemon-wait 0 --allow-other --rc --vfs-cache-mode full && rclone rc vfs/refresh
@concretevitamin @shethhriday29 Let me look into what other options we have to improve that performance.
@concretevitamin @shethhriday29 @romilbhardwaj I did more research on some feasible options to improve performance of the current rclone vfs
implementation, and it seems like I found what we wanted.
On top of @shethhriday29's implementation, I added the following two options:
--transfers 1
: This limits the concurrent upload from cache to cloud storage to 1. If there are multiple files to be uploaded at the cache, it follows the order of when it was created(modified) at the cache and upload one by one. This resolves the issue 2. mentioned in this comment.--vfs-cache-poll-interval 10s
: This sets the ineterval for how often rclone should check between cache and cloud storage to see the difference and trigger sync. The default value is set to 1 minute, and seems like this was the main cause for delayed upload speed(~26min) mentioned in this comment.From the following, you can see that with newly added options:
rng_state_1.pth
is the very first file written by the script, and complete
is the last one.result of ls -lh --time-style=full-iso
at local
mounting gcs
with rclone vfs
But we do need additional testing on CPU usage, how large the cache can get and it's implication, reading checkpoint when spot gets preempted(especially options related to reading), and etc. We should add additional options like --cache-dir
to specify the directory of the cache just like we were doing for CSYNC.
Two additional discoveries on goofys
and latest gcsfuse
version 2.2.0.
The time training script takes to write the checkpoint with MOUNT
mode using goofys
on S3 is on par with CSYNC
and rclone
with additional options, which explains the high throughput on s3 with goofys
that @shethhriday29 discovered.
Measured the same metric for latest gcsfuse
version, 2.2.0
. We currently have 1.3.0
running with MOUNT
mode, and there was a good amount of performance improvement with the verison updates. I'll submit a PR for this update.
cc @romilbhardwaj @concretevitamin
Wow, this is awesome @landscapepainter ! Are the additional options just --vfs-cache-poll-interval 10s
and --transfers 1
? Also, from the graph, it doesn't look like there's a big difference in the performance of rclone vfs with or without the additional options — am I missing something?
https://rclone.org/commands/rclone_mount/#vfs-file-caching
Goal is to (1) improve the performance of datasets reading/writing, checkpoints writing (2) support things like appends, compared to regular bucket mounting via
mode: MOUNT
.rclone mount
with VFS caching seems to be using local SSDs for reads/writes but also syncs up to the object storage bucket.This mode may mimic some CSYNC-like functionality #2336.