piskvorky / smart_open

Utils for streaming large files (S3, HDFS, gzip, bz2...)
MIT License
3.2k stars 383 forks source link

Streaming data transfer rates #457

Open willgdjones opened 4 years ago

willgdjones commented 4 years ago

Hi all - thank you for the great package.

This is just a quick question about expected download rates. I'm seeing rates of ~2Mb/s streaming data from an S3 bucket to a Lambda function that are both in the same region. In total, I can stream a 300mb file in ~159 seconds.

Are these rates to be expected using the package or is there something I am missing?

Thank you!

petedannemann commented 4 years ago

You can see benchmarks by running pytest integration-tests/test_s3.py::test_s3_performance. This test use the default buffer_size for smart_open.s3.open. You can probably increase performance substantially by increasing the buffer_size kwarg passed into smart_open.open. This might require increasing the memory allocated to your lambda function if you use a very large buffer size.

willgdjones commented 4 years ago

Thanks for the response! I've been modifying the buffer_size value from 1024 to 262144, in increments that multiply by 4 each time (so 1024, 4096, .... ), and I'm still getting a very similar transfer speed.

willgdjones commented 4 years ago

Just checked that the default rate is much higher. I've now benchmarked much higher values, from 4 128 1024 in intervals increasing by 4x to 32 128 1024 but I'm still seeing similar results.

piskvorky commented 4 years ago

What is "default rate" and how did you check it's "much higher"?

willgdjones commented 4 years ago

By "default rate" I mean DEFAULT_BUFFER_SIZE that is defined here:

https://github.com/RaRe-Technologies/smart_open/blob/master/smart_open/s3.py#L38

willgdjones commented 4 years ago

Running the integration-tests from the root directory gives me:

(venv) ➜  smart_open git:(master) ✗ pytest integration-tests/test_s3.py
ERROR: usage: pytest [options] [file_or_dir] [file_or_dir] [...]
pytest: error: unrecognized arguments: --reruns --reruns-delay 1 integration-tests/test_s3.py
  inifile: /Users/fonz/Documents/Projects/smart_open/tox.ini
  rootdir: /Users/fonz/Documents/Projects/smart_open`
petedannemann commented 4 years ago

When you say “streaming” are you reading from S3, writing to S3, or both? I think the buffer_size is related to reading and min_part_size kwarg is related to writing

willgdjones commented 4 years ago

I'm specifically reading from S3. I seem to be able to download the 330mb file file in 6 seconds using boto3's get_object().read(), but using smart_open this seems to take 159 seconds.

willgdjones commented 4 years ago

Additionally, using get_object().iter_lines() seems to iterate through the file in 8 seconds.

I just want to check if I'm missing anything here!

piskvorky commented 4 years ago

20x slower is really weird. There should be very little overhead in smart_open, so the numbers ought to ± match.

Btw get_object().iter_lines() didn't exist back then, maybe it's worth changing our "S3 read" implementation to that @mpenkov ? Pros: less code in smart_open, easier maintenance, free updates when the boto API changes. Cons: ?

@willgdjones can you post a full reproducible example, with the exact code you're running? Both for the smart_open code and the native boto code. Thanks.

willgdjones commented 4 years ago

I've noticed that actually decompressing the file takes up a large amount of time that I was not previously factoring. The following loop for the same file takes ~50 seconds:

with gzip.open(s3_client.get_object(Bucket=bucket, Key=key)["Body"]) as gf:
    for x in gf:
        pass

whereas this loop takes ~8 seconds:

for x in s3_client.get_object(Bucket=bucket, Key=key)["Body"].iter_lines():
    pass

The smart_open code I am running looks like this, and takes ~159 seconds:

from smart_open import open as new_open
for line in new_open(f"s3://{bucket}/{key}", transport_params=dict(buffer_size=32*128*1024)):
    pass
zyd14 commented 3 years ago

Is there any indication as to why this is the case? Seems strange that smart_open would take 3x longer for gzipped files than the first approach; is it just because it is decompressing chunk-by-chunk? Just curious because we've been seeing similar issues with very slow streaming of small to medium sized gzipped files via smart_open, with basically identical usage to what is described here.