Open arunmk opened 4 years ago
I have tried various options including using a transport, trying to read all 1.1GB at once without chunking etc. They are all in a similar ballpark and very slow compared to gsutil. Also initially tried with v2.1.0 and that was also taking similar times.
@petedannemann Are you able to investigate?
pytest integration-tests/test_gcs.py::test_gcs_performance
--------------------------------------------- benchmark: 1 tests --------------------------------------------
Name (time in s) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations
-------------------------------------------------------------------------------------------------------------
test_gcs_performance 2.1291 2.2363 2.1769 0.0431 2.1742 0.0688 2;0 0.4594 5 1
-------------------------------------------------------------------------------------------------------------
Yep, this is much slower than it should be. I remember running initial benchmarks during development and seeing numbers much lower than this. I'm not sure if something has changed in the code or if my memory is failing me / I ran improper benchmarks, but these numbers are definitely unacceptable. I can try to do some profiling soon to figure out where the bottlenecks are.
pytest --benchmark-cprofile=tottime integration-tests/test_gcs.py::test_gcs_performance
ncalls tottime percall cumtime percall filename:lineno(function)
930 1.3052 0.0014 1.3052 0.0014 ~:0(<method 'read' of '_ssl._SSLSocket' objects>)
5 0.3016 0.0603 0.3016 0.0603 ~:0(<method 'do_handshake' of '_ssl._SSLSocket' objects>)
5 0.2555 0.0511 0.2555 0.0511 ~:0(<method 'connect' of '_socket.socket' objects>)
12 0.1437 0.0120 0.1437 0.0120 ~:0(<method 'write' of '_ssl._SSLSocket' objects>)
5 0.0599 0.0120 0.0599 0.0120 ~:0(<method 'load_verify_locations' of '_ssl._SSLContext' objects>)
Seems like reading is the bottleneck and writing is performing fine. Since we have so many calls to read from sockets it seems like buffering for reads is probably broken.
Hi @petedannemann have you had a chance to look into this more? We are hoping to use smart_open as soon as this is figured out. Thanks!
Hi @petedannemann have you had a chance to look into this more? We are hoping to use smart_open as soon as this is figured out. Thanks!
I don't have time right now to work on this. Feel free to look at it yourself. If you want an alternative, I'd suggest gcsfs
Follow up on this. Buffering works as intended. I tested this by adding some logging to smart_open.gcs._RawReader
def _download_blob_chunk(self, size):
start = position = self._position
if position == self._size:
#
# When reading, we can't seek to the first byte of an empty file.
# Similarly, we can't seek past the last byte. Do nothing here.
#
binary = b''
elif size == -1:
logger.info("downloaded")
binary = self._blob.download_as_bytes(start=start)
else:
end = position + size
logger.info(f"downloaded from bytes {start} to {end}")
binary = self._blob.download_as_bytes(start=start, end=end)
return binary
When reading a large file I see:
INFO smart_open.gcs:gcs.py:195 downloaded from bytes 0 to 262144
INFO smart_open.gcs:gcs.py:195 downloaded from bytes 262144 to 524288
INFO smart_open.gcs:gcs.py:195 downloaded from bytes 524288 to 786432
INFO smart_open.gcs:gcs.py:195 downloaded from bytes 786432 to 1048576
INFO smart_open.gcs:gcs.py:195 downloaded from bytes 1048576 to 1310720
...
Note that @arunmk seemed to be using min_part_size
instead of buffer_size
so I'm not sure his benchmarking actually tested without buffering. @arunmk can you try again with using the buffer_size
param?
Note that comparing speeds to gsutil is not an apples to apples comparison. gsutil almost certainly using gRPC to download the file over a streaming connection.
For next steps I will profile how long google.cloud.storage.Blob.upload_from_string
and google.cloud.storage.Blob.download_as_bytes
take to understand what overhead smart-open adds
benchmark of smart_open reads (I commented out the file writing in read_write
)
pytest --benchmark-cprofile=tottime integration-tests/test_gcs.py::test_gcs_performance
-------------------------------------------------- benchmark: 1 tests -------------------------------------------------
Name (time in ms) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations
-----------------------------------------------------------------------------------------------------------------------
test_gcs_performance 762.5399 942.5238 849.3104 65.7851 835.9554 76.6498 2;0 1.1774 5 1
-----------------------------------------------------------------------------------------------------------------------
benchmark of google.cloud.storage.Blob.download_as_bytes
(performed on the same file as above).
import google.cloud.storage
def download_blob(bucket, key):
client = google.cloud.storage.Client()
blob = client.bucket(bucket).get_blob(key)
data = blob.download_as_bytes()
def test_performance(benchmark):
bucket = 'smart-open'
key = 'tests/performance.txt'
actual = benchmark(download_blob, bucket, key)
$ pytest --benchmark-cprofile=tottime benchmark_gcs_download.py
------------------------------------------------ benchmark: 1 tests ------------------------------------------------
Name (time in ms) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations
--------------------------------------------------------------------------------------------------------------------
test_performance 802.8116 886.9596 834.0021 32.4173 823.4367 37.8678 1;0 1.1990 5 1
--------------------------------------------------------------------------------------------------------------------
benchmark of smart-open writes (I commented out the file reading in read_write
and added a large min_part_size)
$ pytest --benchmark-cprofile=tottime integration-tests/test_gcs.py::test_gcs_performance
--------------------------------------------- benchmark: 1 tests --------------------------------------------
Name (time in s) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations
-------------------------------------------------------------------------------------------------------------
test_gcs_performance 1.0568 1.7717 1.2996 0.2871 1.2318 0.3631 1;0 0.7695 5 1
-------------------------------------------------------------------------------------------------------------
benchmark of google.cloud.storage.Blob.upload_from_string
** (performed on the same file as above).
import google.cloud.storage
def upload_blob(bucket, key, data):
client = google.cloud.storage.Client()
blob = client.bucket(bucket).blob(key)
blob.upload_from_string(data)
def test_performance(benchmark):
bucket = 'smart-open'
key = 'tests/performance.txt'
client = google.cloud.storage.Client()
blob = client.bucket(bucket).get_blob(key)
data = blob.download_as_bytes()
actual = benchmark(upload_blob, bucket, key, data)
$ pytest --benchmark-cprofile=tottime benchmark_gcs_upload.py
------------------------------------------------ benchmark: 1 tests ------------------------------------------------
Name (time in ms) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations
--------------------------------------------------------------------------------------------------------------------
test_performance 712.8647 746.4870 731.4032 15.3004 732.1812 28.5586 1;0 1.3672 5 1
--------------------------------------------------------------------------------------------------------------------
Read speed for smart-open is basically the same as using the google.cloud.storage library. Increasing buffer_size should substantially speed up downloading larger files. Writing via smart-open is slower.
Thanks for looking into this @petedannemann .
Writing via smart-open is slower.
Do you know the cause?
Note that @arunmk seemed to be using min_part_size instead of buffer_size so I'm not sure his benchmarking actually tested without buffering. @arunmk can you try again with using the buffer_size param?
@petedannemann I initially used the api calls without the buffer_size parameters. Because they were extremely slow, I added the buffer_size parameters to check if they would help. I have also tried with large buffer sizes (1Gi etc) and they did not help in any noticeable manner.
@petedannemann could you also mention the version of the library used? I am not able to get to that codebase anymore,but I remember you also being able to replicate the slowness as per the initial comment.
thanks @arunmk for following up on this. As arun noted initially the version we tested was 3.0.0 which is an older one now!
Does anyone know if #599 and its resolution #729 solves the slow reading issue? I'm trying to weigh up the benefits of upgrading our infrastructure to use google-cloud-storage >=2.6.0
for use with smart-open.
Problem description
I am trying to download a slightly large file (1.1GB) and the attached code with
smart_open
takes a long time (15m40s) while agsutil cp
takes about 25s. Thestorage.blob
API of google is also quite fast (and comparable to gsutil).Steps/code to reproduce the problem
Code used:
Nearly each chunk read above takes close to 230s. (Write to output file on local FS has sub-second latency).
Versions
Please provide the output of: