martindurant / rfsspec

Rust python FS
MIT License
30 stars 2 forks source link

How to do a simple cat with s3 #7

Open rabernat opened 1 year ago

rabernat commented 1 year ago

Thanks for working on this Martin!

I'm trying to do a simple cat operation to look at performance. Here's what I tried

import rfsspec

bucket = "cmip6-pds"
# 53 MB object
key = "CMIP6/CMIP/AS-RCEC/TaiESM1/1pctCO2/r1i1p1f1/Amon/hfls/gn/v20200225/hfls/0.0.0"
url = f"s3://{bucket}/{key}"

rs3.cat(url)
# b'S3 ERRROR: <?xml version="1.0" encoding="UTF-8"?>\n<Error><Code>InvalidBucketName</Code><Message>The specified bucket is not valid.</Message><BucketName>s3:</BucketName><RequestId>QR3RC7B9RMRTCTNN</RequestId><HostId>QoqRLQ03ZkncmistNC7OIEY8McgnijkP1j25CyHHLOON1MmlD0Xp5NuXdWBk5O6scsy0P1yjJ8w=</HostId></Error>'

rs3.cat(f"{bucket}/{key}")
# b'S3 ERRROR: <?xml version="1.0" encoding="UTF-8"?>\n<Error><Code>PermanentRedirect</Code><Message>The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint.</Message><Endpoint>cmip6-pds.s3-us-west-2.amazonaws.com</Endpoint><Bucket>cmip6-pds</Bucket><RequestId>3XWMHZ78EJ409A4J</RequestId><HostId>XIrupd9bzk4GqrN0TBsr44itH3BovkQ5WmKyljcyF6ka8fjBBvRRSUdxcm4nYMetst66bTKr8aU=</HostId></Error>'

I'm stuck! Can you help me figure out what to do?


For reference, here is what I am comparing with

import s3fs
import boto3

s3 = s3fs.S3FileSystem()
s3_client = boto3.client('s3')

%timeit _ = s3.cat(url)
# 1.1 s ± 17.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit _ = s3_client.get_object(Bucket=bucket, Key=key)
# 326 ms ± 5.81 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I've found that boto is 3-5x faster than s3fs for this basic operation, and this performance difference propagates itself through our whole stack. In trying to get to the bottom of this, I decided to try rfsspec for the first time.

rabernat commented 1 year ago

Ok I figured it out!

rs3 = rfsspec.s3.RustyS3FileSystem(endpoint_url="https://cmip6-pds.s3-us-west-2.amazonaws.com")
%timeit _ = rs3.cat(f"{bucket}/{key}")
# 1.84 s ± 611 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Am I doing something wrong? It's much slower than python fsspec.

martindurant commented 1 year ago

Quick note: s3_client.get_object(Bucket=bucket, Key=key) does NOT fetch the data, only the key metadata. I think you need response["Body"].read().

martindurant commented 1 year ago

It seems to correct and much faster call is

rs3 = rfsspec.s3.RustyS3FileSystem(region="us-west-2")
%timeit out = rs3.cat(f"{bucket}/{key}")

this ends up almost as fast as s3fs (which is the same speed as boto3). It is still a bit slower, though, so I'll do some thinking. In the meantime I've done some benchmarking that seems to show the opposite, decent speedup for reading many small files and equal speed when reading big files.

(attaching bench screenshots in next comment)

martindurant commented 1 year ago
Screen Shot 2023-03-15 at 11 01 52 Screen Shot 2023-03-15 at 11 01 13
rabernat commented 1 year ago

Thanks martin! Yes with those changes all methods are basically the same.

martindurant commented 1 year ago

It is possible to find the region of any bucket simply, e.g.,

requests.head("https://cmip6-pds.s3.amazonaws.com").headers["x-amz-bucket-region"]

which could be cached; but I am surprised this is necessary.

To be clear, the purpose of this implementation is for efficiency in many file batch and threaded calls, like the eager xarray fetch of coordinates and dask operations with one or more chunk per partition.

rabernat commented 1 year ago

Totally. It makes sense that this benchmark is the same for all implementations. It's really bound on the speed of S3's response and network latency / bandwidth.