Open rabernat opened 1 year ago
Ok I figured it out!
rs3 = rfsspec.s3.RustyS3FileSystem(endpoint_url="https://cmip6-pds.s3-us-west-2.amazonaws.com")
%timeit _ = rs3.cat(f"{bucket}/{key}")
# 1.84 s ± 611 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Am I doing something wrong? It's much slower than python fsspec.
Quick note: s3_client.get_object(Bucket=bucket, Key=key)
does NOT fetch the data, only the key metadata. I think you need response["Body"].read()
.
It seems to correct and much faster call is
rs3 = rfsspec.s3.RustyS3FileSystem(region="us-west-2")
%timeit out = rs3.cat(f"{bucket}/{key}")
this ends up almost as fast as s3fs (which is the same speed as boto3). It is still a bit slower, though, so I'll do some thinking. In the meantime I've done some benchmarking that seems to show the opposite, decent speedup for reading many small files and equal speed when reading big files.
(attaching bench screenshots in next comment)
Thanks martin! Yes with those changes all methods are basically the same.
It is possible to find the region of any bucket simply, e.g.,
requests.head("https://cmip6-pds.s3.amazonaws.com").headers["x-amz-bucket-region"]
which could be cached; but I am surprised this is necessary.
To be clear, the purpose of this implementation is for efficiency in many file batch and threaded calls, like the eager xarray fetch of coordinates and dask operations with one or more chunk per partition.
Totally. It makes sense that this benchmark is the same for all implementations. It's really bound on the speed of S3's response and network latency / bandwidth.
Thanks for working on this Martin!
I'm trying to do a simple cat operation to look at performance. Here's what I tried
I'm stuck! Can you help me figure out what to do?
For reference, here is what I am comparing with
I've found that boto is 3-5x faster than s3fs for this basic operation, and this performance difference propagates itself through our whole stack. In trying to get to the bottom of this, I decided to try rfsspec for the first time.