piskvorky / smart_open

Utils for streaming large files (S3, HDFS, gzip, bz2...)
MIT License
3.18k stars 383 forks source link

Make calls to smart_open.open() for GCS 1000x faster by avoiding unnecessary GCS API call #788

Closed JohnHBrock closed 7 months ago

JohnHBrock commented 11 months ago

Motivation

Bucket.get_blob requires a roundtrip to GCS, but Bucket.blob doesn't, so let's use that instead. In my benchmarking, this change reduces the time taken to call smart_open.open() from milliseconds to microseconds. This is especially nice when you're repeatedly calling smart_open.open for a large list of files.

Tests

To run the new benchmark test:

pytest integration-tests/test_gcs.py::test_gcs_performance_open

pytest-benchmark comparison for old code vs. new code: image

cadnce commented 8 months ago

Reasonable suggestion, the performance increase in the benchmark conditions is indeed significant! However I would question if the accuracy of the benchmark compared to real life usage (as noted by your comment in the tests "we don't need to use a uri that actually exists in order to call GCS's open()") - to make it more accurate you should read/write a byte or two so that the same work happens in both scenarios.

Ignoring the benchmark for a second, when making this change there was 2 relevant considerations:

This means to ensure there is no unexpected behaviour with this change you would need to get the generation id for the blob. As I understand it, this is typically done using the blob.reload() function which is the overhead you're noticing in the Bucket.get_blob() call - https://github.com/googleapis/python-storage/blob/main/google/cloud/storage/bucket.py#L756-L800 vs https://github.com/googleapis/python-storage/blob/main/google/cloud/storage/bucket.py#L1176-L1281

I guess my point is, based on the docs, I don't believe that the GCS api call is unnessecary. But please correct me If you think my understanding is wrong!

SimonBiggs commented 6 months ago

In my testing this results in a file that doesn't exist to no longer return an error when opened.

ddelange commented 6 months ago

@SimonBiggs what do you get instead?

ddelange commented 6 months ago

I would tend to agree with @cadnce:

I guess my point is, based on the docs, I don't believe that the GCS api call is unnessecary.

mpenkov commented 6 months ago

In retrospect, merging this may have been premature. Here is the trade-off we're dealing with:

Pro: calling smart_open.open on a GCS URL is 1000x faster (*) Con: breaking API change, as opening a non-existing file no longer raises an error (like the built-in open function does)

I think the benefit of the open function being 1000x faster is questionable, too. Presumably, the user is calling open because they want to do something with the returned file-like object, like reading or writing. These subsequent operations will require network access and thus consume time. So really, this PR isn't about isn't speeding things up by improving performance; it is about postponing the cost of the network transfer from open-time to write-time.

I think the con is greater than the pro here. Yes, in general, faster is better, but being correct is far more important (if you take the behavior of the built-in open as the definition of "correct").

What do you think @piskvorky? Should we keep this behavior or roll it back?

piskvorky commented 6 months ago

Consistency with open is smart_open's "selling point". Our documentation explicitly promises:

smart_open is a drop-in replacement for Python's built-in open(): it can do anything open can (100% compatible, falls back to native open wherever possible), plus lots of nifty extra stuff on top.

So to me the answer is clear :)

mpenkov commented 6 months ago

OK, reverted.

mpenkov commented 6 months ago

Thank you @SimonBiggs for pointing out the issue. I've made a bugfix release 7.0.3 that reverts to the original behavior.

cadnce commented 6 months ago

Glad y’all agree. We’ve been holding off upgrading until this was reverted :)