Comparison of bucket indexing tools

charlesbluca commented 4 years ago

As talked about in #7, regularly generating an index of all the files in a bucket/directory could be very useful in:

Generating/updating catalogs or databases of relevant datasets
Keeping track of files for synchronization purposes

There are a lot of tools that could do this work - some exclusive to specific cloud providers, others not. Some of these tools include:

gsutil (supports both Google Cloud and S3)
Rclone (supports a variety of cloud providers, including Google )
AWS CLI (supports only S3)
S3P (supports only S3)

I tested the above tools on both Google Cloud and S3 (when relevant) to get a sense of which would have the best utility in listing the entirety of a large bucket. Some basic parameters of the testing include:

Target bucket(s) - Pangeo's CMIP6 buckets in both Google Cloud and S3 storage; both are ~550 TB of data comprising of some 20,000,000+ files
Command - flat listing of all bucket's contents with size and modification time (relevant mainly for synchronization purposes)
Output redirection - currently all output is written to a file unaltered; this may change if we want to edit output in place before writing to file using something like sed

The output of these tests can be found here. Some observations:

For S3 listing, S3P was by far the fastest, running 4-6x faster than the AWS CLI listing (~40 min versus ~165 min)
For Google Cloud listing, Rclone was by far the fastest, running nearly 4x faster than the gsutil listing (~47 min versus ~173 min)
Both gsutil and Rclone had trouble listing S3 storage, with both commands failing to list the bucket within the 6 hour timeout; the listing of modification time likely influenced these results, as listing is significantly faster in both cases when excluding this information

Obviously additional testing of more cloud listing tools (MinIO client for example) would be ideal, but these results provide some motivation to dig deeper into Rclone and S3P to index CMIP6 data in Google Cloud and S3 storage, respectively.

charlesbluca commented 4 years ago

If using these tools for synchronization purposes, which would entail using a diff of two buckets' index files to check what must be transferred/deleted, it is important that the two files have the same formatting of size, modification time, and directory structure so that they can be compared properly.

Because none of the tools I listed above have the same formatting for their listings, to effectively use indexes for bucket synchronization we will either have to:

Use the same listing tool for all buckets we intend to keep synchronized
Reformat all indexes to follow one standard format

The second option seems preferable, as there are a variety of tools to edit the listing output before or after being written to file, and the extra time this would take seems negligible compared to the amount of time it would take to use a non-ideal listing tool.

charlesbluca commented 4 years ago

After some further investigation of the listing output, I see there is another issue with using index files for synchronization purposes - different listing tools have different ideas of what to list for modification time!

In particular, Rclone seems to work off of an internal modification time, which is unaltered when a file is copied to another bucket - if a file was last modified on Google Cloud on January 10th, and then copied over to S3 on a later date, Rclone would say both files were last modified on January 10th. In contrast, gsutil and AWS CLI (and consequently S3P) use the time a file was uploaded to its containing bucket as the modification time - in the previous example, the file on Google Cloud would've last been modified on January 10th, and the file on S3 would've last been "modified" on whatever date it was copied over.

A workaround to this problem would be to rely instead upon the checksums of files to test if they are identical, in particular MD5 hashes, which are supported on both Google Cloud and S3. The ability to generate file checksums is limited when using AWS CLI (only for individual files) and potentially unavailable in S3P (hard to tell as there is sparse documentation), but can be done while listing a bucket using gsutil or Rclone:

gsutil hash -m ... 
rclone hashsum MD5 ...

Rclone seems to be able to generate hashes on S3 storage significantly faster than modification times, which could make it a viable option for S3.

rabernat commented 4 years ago

Here's another one to try!

https://twitter.com/thundercloudvol/status/1326348841965264896

charlesbluca commented 4 years ago

Thanks for the info! I'll add cloud-files and MinIO Client to the general listing tests; maybe cloud-files could also have some functionality in Python code for tasks not suited for gcsfs or s3fs.

In terms of my testing with listing MD5 checksums, Rclone was still significantly faster than gsutil, and worked much faster in Google Cloud than S3 (30 minutes vs 210 minutes). However, it was able to complete the task within a timeout limit, which could potentially make it useful depending on how often we plan on indexing the buckets.

charlesbluca commented 4 years ago

CloudFile and MinIO Client both seem to perform similarly to Rclone when it comes to Google Cloud storage, getting a listing in ~30 minutes (no mod times or checksums). Unfortunately, they also share slower listing times when it comes to S3, taking around 3-4x longer.

So far, the optimal indexing tools seem to be Rclone for Google Cloud and S3P for S3 if synchronization isn't a concern, and Rclone listing checksums if it is (though this is still very slow in S3).

charlesbluca commented 4 years ago

It looks like S3P is able to generate MD5 checksums using it's each command instead of a standard ls; in fact, roughly the same output as rclone hashsum MD5 can be generated for an S3 bucket using:

s3p each --bucket target-bucket --map "js:(item) => console.log(item.ETag.slice(1,-1), item.Key)" --quiet

Since this tool is backed by s3api, there might be an equivalent to this command using AWS CLI, but I doubt that it would run nearly as fast. I'm going to test this S3P listing and look into CloudFiles and MinIO Client for similar functionality.

rabernat commented 4 years ago

Thanks for continuing to work on this Charles!

charlesbluca commented 4 years ago

The results of the S3P listing are significant - we are able to generate a list of all files with checksums in the same amount of time (sometimes less!) it would take to simply list the S3 bucket keys. This means we should be able to generate indexes of both buckets in around 30 minutes each, with or without checksums!

I'll continue testing on CloudFiles/MinIO to see if either can generate checksums for Google Cloud faster than Rclone, but as it is I expect to see some performance improvements in the synchronization process now that we're able to generate S3 checksums much faster.

rabernat commented 4 years ago

One final thing to keep in mind is the costs. Each of these API requests does have a tiny cost associated with it. Along with the speed information, it would be good to have a ballpark figure on how much each option costs.

charlesbluca commented 4 years ago

Good point - pricing is definitely the biggest motivator behind how often we plan to run these scripts.

Based on the pricing of S3 and Cloud Storage, it looks like a list operation is priced at $0.005 per 1,000 requests (or $5 per 1,000,000 requests). I'm still looking for clear numbers on how many objects a Cloud Storage list operation returns, but going off of S3's numbers this would be 1,000 objects per list operation.

Assuming our CMIP6 buckets have somewhere around 25,000,000 objects each now, we can get a ballpark figure of:

2 25,000,000 ($0.005 / 1,000) / 1,000 = $0.25

Per listing of both buckets, split roughly equally across both cloud providers. I don't expect this number to change dramatically from tool to tool, as from my understanding they tend to differ not in how many requests they send to get the listing, but more so in how those requests are sent (serial vs. parallel). In the case of my Rclone sync workflow, this would mean that (ignoring egress fees) there is a total cost of roughly $30 monthly to list out the buckets for comparison 4x daily, versus a $7.50 monthly cost if we opted to do this.

Another thing to take into account is egress fees, i.e. the cost of downloading the index files, which would likely be in excess of 2 GB each (although compression is definitely an option here). In Cloud Storage general egress (downloading a file to a non-cloud associated resource) is $0.12 / GB, while in S3 it is $0.09 / GB. We can see from this that the total egress costs of downloading both index files to one place would exceed the cost of listing out the buckets themselves. This serves as a motivator to move most of our cataloging/synchronization services to a GCP/AWS resource, so we could bring the egress fees down to minimal or even free (although we would still need to pay full egress fees for one of the index files to download it to a different cloud provider) .

pangeo-forge / cmip6-pipeline

Comparison of bucket indexing tools #9