piskvorky / smart_open

Utils for streaming large files (S3, HDFS, gzip, bz2...)
MIT License
3.22k stars 383 forks source link

GCP download very slow for slightly large files #555

Open arunmk opened 4 years ago

arunmk commented 4 years ago

Problem description

I am trying to download a slightly large file (1.1GB) and the attached code with smart_open takes a long time (15m40s) while a gsutil cp takes about 25s. The storage.blob API of google is also quite fast (and comparable to gsutil).

Steps/code to reproduce the problem

Code used:

import time
import sys
from smart_open import open as cloud_open

gcs_uri = "<redacted file name>"
dl_path = "./test.pkl"

current_secs_func = lambda: int(round(time.time()))

chunk_size = 256 * 1024 * 1024  # 256M
count = 0
with cloud_open(gcs_uri, mode="rb") as cloud_fd:  # Same slowness even with `transport_params={'min_part_size': chunk_size}`
    with open(dl_path, mode="wb+") as local_fd:
        print("Start time: ", current_secs_func())
        sys.stdout.flush()
        while True:
            current = current_secs_func()
            data = cloud_fd.read(chunk_size)
            print("Read chunk [{}] of at most size [{}] from [{}] to [{}] in [{}] secs".format(count, chunk_size, gcs_uri, dl_path, current_secs_func() - current))
            sys.stdout.flush()
            if not data:
                break
            count += 1
            current = current_secs_func()
            local_fd.write(data)
            print("Wrote chunk [{}] of at most size [{}] from [{}] to [{}] in [{}] secs".format(count, chunk_size, gcs_uri, dl_path, current_secs_func() - current))
            sys.stdout.flush()

Nearly each chunk read above takes close to 230s. (Write to output file on local FS has sub-second latency).

Versions

Please provide the output of:

Python 3.7.7 (default, Apr 18 2020, 02:59:53)
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import platform, sys, smart_open
>>> print(platform.platform())
Linux-5.4.0-1011-gcp-x86_64-with-Ubuntu-20.04-focal
>>> print("Python", sys.version)
Python 3.7.7 (default, Apr 18 2020, 02:59:53)
[GCC 9.3.0]
>>> print("smart_open", smart_open.__version__)
smart_open 3.0.0
arunmk commented 4 years ago

I have tried various options including using a transport, trying to read all 1.1GB at once without chunking etc. They are all in a similar ballpark and very slow compared to gsutil. Also initially tried with v2.1.0 and that was also taking similar times.

mpenkov commented 4 years ago

@petedannemann Are you able to investigate?

petedannemann commented 4 years ago
pytest integration-tests/test_gcs.py::test_gcs_performance

--------------------------------------------- benchmark: 1 tests --------------------------------------------
Name (time in s)            Min     Max    Mean  StdDev  Median     IQR  Outliers     OPS  Rounds  Iterations
-------------------------------------------------------------------------------------------------------------
test_gcs_performance     2.1291  2.2363  2.1769  0.0431  2.1742  0.0688       2;0  0.4594       5           1
-------------------------------------------------------------------------------------------------------------

Yep, this is much slower than it should be. I remember running initial benchmarks during development and seeing numbers much lower than this. I'm not sure if something has changed in the code or if my memory is failing me / I ran improper benchmarks, but these numbers are definitely unacceptable. I can try to do some profiling soon to figure out where the bottlenecks are.

petedannemann commented 4 years ago
pytest --benchmark-cprofile=tottime integration-tests/test_gcs.py::test_gcs_performance

ncalls  tottime percall cumtime percall filename:lineno(function)
930     1.3052  0.0014  1.3052  0.0014  ~:0(<method 'read' of '_ssl._SSLSocket' objects>)
5       0.3016  0.0603  0.3016  0.0603  ~:0(<method 'do_handshake' of '_ssl._SSLSocket' objects>)
5       0.2555  0.0511  0.2555  0.0511  ~:0(<method 'connect' of '_socket.socket' objects>)
12      0.1437  0.0120  0.1437  0.0120  ~:0(<method 'write' of '_ssl._SSLSocket' objects>)
5       0.0599  0.0120  0.0599  0.0120  ~:0(<method 'load_verify_locations' of '_ssl._SSLContext' objects>)

Seems like reading is the bottleneck and writing is performing fine. Since we have so many calls to read from sockets it seems like buffering for reads is probably broken.

tylerkohn commented 3 years ago

Hi @petedannemann have you had a chance to look into this more? We are hoping to use smart_open as soon as this is figured out. Thanks!

petedannemann commented 3 years ago

Hi @petedannemann have you had a chance to look into this more? We are hoping to use smart_open as soon as this is figured out. Thanks!

I don't have time right now to work on this. Feel free to look at it yourself. If you want an alternative, I'd suggest gcsfs

petedannemann commented 3 years ago

Follow up on this. Buffering works as intended. I tested this by adding some logging to smart_open.gcs._RawReader

def _download_blob_chunk(self, size):
        start = position = self._position
        if position == self._size:
            #
            # When reading, we can't seek to the first byte of an empty file.
            # Similarly, we can't seek past the last byte.  Do nothing here.
            #
            binary = b''
        elif size == -1:
            logger.info("downloaded")
            binary = self._blob.download_as_bytes(start=start)
        else:
            end = position + size
            logger.info(f"downloaded from bytes {start} to {end}")
            binary = self._blob.download_as_bytes(start=start, end=end)
        return binary

When reading a large file I see:

INFO     smart_open.gcs:gcs.py:195 downloaded from bytes 0 to 262144
INFO     smart_open.gcs:gcs.py:195 downloaded from bytes 262144 to 524288
INFO     smart_open.gcs:gcs.py:195 downloaded from bytes 524288 to 786432
INFO     smart_open.gcs:gcs.py:195 downloaded from bytes 786432 to 1048576
INFO     smart_open.gcs:gcs.py:195 downloaded from bytes 1048576 to 1310720
...

Note that @arunmk seemed to be using min_part_size instead of buffer_size so I'm not sure his benchmarking actually tested without buffering. @arunmk can you try again with using the buffer_size param?

petedannemann commented 3 years ago

Note that comparing speeds to gsutil is not an apples to apples comparison. gsutil almost certainly using gRPC to download the file over a streaming connection.

For next steps I will profile how long google.cloud.storage.Blob.upload_from_string and google.cloud.storage.Blob.download_as_bytes take to understand what overhead smart-open adds

petedannemann commented 3 years ago

benchmark of smart_open reads (I commented out the file writing in read_write)

pytest --benchmark-cprofile=tottime integration-tests/test_gcs.py::test_gcs_performance

-------------------------------------------------- benchmark: 1 tests -------------------------------------------------
Name (time in ms)             Min       Max      Mean   StdDev    Median      IQR  Outliers     OPS  Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------
test_gcs_performance     762.5399  942.5238  849.3104  65.7851  835.9554  76.6498       2;0  1.1774       5           1
-----------------------------------------------------------------------------------------------------------------------

benchmark of google.cloud.storage.Blob.download_as_bytes (performed on the same file as above).

import google.cloud.storage

def download_blob(bucket, key):
    client = google.cloud.storage.Client()
    blob = client.bucket(bucket).get_blob(key)
    data = blob.download_as_bytes()

def test_performance(benchmark):
    bucket = 'smart-open'
    key = 'tests/performance.txt'
    actual = benchmark(download_blob, bucket, key)
$ pytest --benchmark-cprofile=tottime benchmark_gcs_download.py

------------------------------------------------ benchmark: 1 tests ------------------------------------------------
Name (time in ms)          Min       Max      Mean   StdDev    Median      IQR  Outliers     OPS  Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------
test_performance      802.8116  886.9596  834.0021  32.4173  823.4367  37.8678       1;0  1.1990       5           1
--------------------------------------------------------------------------------------------------------------------

benchmark of smart-open writes (I commented out the file reading in read_write and added a large min_part_size)

$ pytest --benchmark-cprofile=tottime integration-tests/test_gcs.py::test_gcs_performance

--------------------------------------------- benchmark: 1 tests --------------------------------------------
Name (time in s)            Min     Max    Mean  StdDev  Median     IQR  Outliers     OPS  Rounds  Iterations
-------------------------------------------------------------------------------------------------------------
test_gcs_performance     1.0568  1.7717  1.2996  0.2871  1.2318  0.3631       1;0  0.7695       5           1
-------------------------------------------------------------------------------------------------------------

benchmark of google.cloud.storage.Blob.upload_from_string** (performed on the same file as above).

import google.cloud.storage

def upload_blob(bucket, key, data):
    client = google.cloud.storage.Client()
    blob = client.bucket(bucket).blob(key)
    blob.upload_from_string(data)

def test_performance(benchmark):
    bucket = 'smart-open'
    key = 'tests/performance.txt'
    client = google.cloud.storage.Client()
    blob = client.bucket(bucket).get_blob(key)
    data = blob.download_as_bytes()
    actual = benchmark(upload_blob, bucket, key, data)
$ pytest --benchmark-cprofile=tottime benchmark_gcs_upload.py

------------------------------------------------ benchmark: 1 tests ------------------------------------------------
Name (time in ms)          Min       Max      Mean   StdDev    Median      IQR  Outliers     OPS  Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------
test_performance      712.8647  746.4870  731.4032  15.3004  732.1812  28.5586       1;0  1.3672       5           1
--------------------------------------------------------------------------------------------------------------------

Read speed for smart-open is basically the same as using the google.cloud.storage library. Increasing buffer_size should substantially speed up downloading larger files. Writing via smart-open is slower.

piskvorky commented 3 years ago

Thanks for looking into this @petedannemann .

Writing via smart-open is slower.

Do you know the cause?

arunmk commented 3 years ago

Note that @arunmk seemed to be using min_part_size instead of buffer_size so I'm not sure his benchmarking actually tested without buffering. @arunmk can you try again with using the buffer_size param?

@petedannemann I initially used the api calls without the buffer_size parameters. Because they were extremely slow, I added the buffer_size parameters to check if they would help. I have also tried with large buffer sizes (1Gi etc) and they did not help in any noticeable manner.

arunmk commented 3 years ago

@petedannemann could you also mention the version of the library used? I am not able to get to that codebase anymore,but I remember you also being able to replicate the slowness as per the initial comment.

tylerkohn commented 3 years ago

thanks @arunmk for following up on this. As arun noted initially the version we tested was 3.0.0 which is an older one now!

q-aaronzolnailucas commented 11 months ago

Does anyone know if #599 and its resolution #729 solves the slow reading issue? I'm trying to weigh up the benefits of upgrading our infrastructure to use google-cloud-storage >=2.6.0 for use with smart-open.