Open willgdjones opened 4 years ago
You can see benchmarks by running pytest integration-tests/test_s3.py::test_s3_performance
. This test use the default buffer_size for smart_open.s3.open
. You can probably increase performance substantially by increasing the buffer_size kwarg passed into smart_open.open
. This might require increasing the memory allocated to your lambda function if you use a very large buffer size.
Thanks for the response! I've been modifying the buffer_size
value from 1024 to 262144, in increments that multiply by 4 each time (so 1024, 4096, .... ), and I'm still getting a very similar transfer speed.
Just checked that the default rate is much higher. I've now benchmarked much higher values, from 4 128 1024 in intervals increasing by 4x to 32 128 1024 but I'm still seeing similar results.
What is "default rate" and how did you check it's "much higher"?
By "default rate" I mean DEFAULT_BUFFER_SIZE
that is defined here:
https://github.com/RaRe-Technologies/smart_open/blob/master/smart_open/s3.py#L38
Running the integration-tests from the root directory gives me:
(venv) ➜ smart_open git:(master) ✗ pytest integration-tests/test_s3.py
ERROR: usage: pytest [options] [file_or_dir] [file_or_dir] [...]
pytest: error: unrecognized arguments: --reruns --reruns-delay 1 integration-tests/test_s3.py
inifile: /Users/fonz/Documents/Projects/smart_open/tox.ini
rootdir: /Users/fonz/Documents/Projects/smart_open`
When you say “streaming” are you reading from S3, writing to S3, or both? I think the buffer_size is related to reading and min_part_size kwarg is related to writing
I'm specifically reading from S3. I seem to be able to download the 330mb file file in 6 seconds using boto3's get_object().read()
, but using smart_open
this seems to take 159 seconds.
Additionally, using get_object().iter_lines()
seems to iterate through the file in 8 seconds.
I just want to check if I'm missing anything here!
20x slower is really weird. There should be very little overhead in smart_open
, so the numbers ought to ± match.
Btw get_object().iter_lines()
didn't exist back then, maybe it's worth changing our "S3 read" implementation to that @mpenkov ? Pros: less code in smart_open
, easier maintenance, free updates when the boto API changes. Cons: ?
@willgdjones can you post a full reproducible example, with the exact code you're running? Both for the smart_open code and the native boto code. Thanks.
I've noticed that actually decompressing the file takes up a large amount of time that I was not previously factoring. The following loop for the same file takes ~50 seconds:
with gzip.open(s3_client.get_object(Bucket=bucket, Key=key)["Body"]) as gf:
for x in gf:
pass
whereas this loop takes ~8 seconds:
for x in s3_client.get_object(Bucket=bucket, Key=key)["Body"].iter_lines():
pass
The smart_open
code I am running looks like this, and takes ~159 seconds:
from smart_open import open as new_open
for line in new_open(f"s3://{bucket}/{key}", transport_params=dict(buffer_size=32*128*1024)):
pass
Is there any indication as to why this is the case? Seems strange that smart_open would take 3x longer for gzipped files than the first approach; is it just because it is decompressing chunk-by-chunk? Just curious because we've been seeing similar issues with very slow streaming of small to medium sized gzipped files via smart_open, with basically identical usage to what is described here.
Hi all - thank you for the great package.
This is just a quick question about expected download rates. I'm seeing rates of ~2Mb/s streaming data from an S3 bucket to a Lambda function that are both in the same region. In total, I can stream a 300mb file in ~159 seconds.
Are these rates to be expected using the package or is there something I am missing?
Thank you!