Closed JohnHBrock closed 7 months ago
Reasonable suggestion, the performance increase in the benchmark conditions is indeed significant! However I would question if the accuracy of the benchmark compared to real life usage (as noted by your comment in the tests "we don't need to use a uri that actually exists in order to call GCS's open()") - to make it more accurate you should read/write a byte or two so that the same work happens in both scenarios.
Ignoring the benchmark for a second, when making this change there was 2 relevant considerations:
Bucket.get_blob
interface allows this to adhere to the existing smart_open interface api inc exceptions. blob.open()
interface is unsafe if the underlying blob in GCS is updated/replaced https://cloud.google.com/python/docs/reference/storage/latest/google.cloud.storage.blob.Blob#google_cloud_storage_blob_Blob_open.This means to ensure there is no unexpected behaviour with this change you would need to get the generation id for the blob. As I understand it, this is typically done using the blob.reload()
function which is the overhead you're noticing in the Bucket.get_blob()
call - https://github.com/googleapis/python-storage/blob/main/google/cloud/storage/bucket.py#L756-L800 vs https://github.com/googleapis/python-storage/blob/main/google/cloud/storage/bucket.py#L1176-L1281
I guess my point is, based on the docs, I don't believe that the GCS api call is unnessecary. But please correct me If you think my understanding is wrong!
In my testing this results in a file that doesn't exist to no longer return an error when opened.
@SimonBiggs what do you get instead?
I would tend to agree with @cadnce:
I guess my point is, based on the docs, I don't believe that the GCS api call is unnessecary.
In retrospect, merging this may have been premature. Here is the trade-off we're dealing with:
Pro: calling smart_open.open on a GCS URL is 1000x faster (*) Con: breaking API change, as opening a non-existing file no longer raises an error (like the built-in open function does)
I think the benefit of the open function being 1000x faster is questionable, too. Presumably, the user is calling open because they want to do something with the returned file-like object, like reading or writing. These subsequent operations will require network access and thus consume time. So really, this PR isn't about isn't speeding things up by improving performance; it is about postponing the cost of the network transfer from open-time to write-time.
I think the con is greater than the pro here. Yes, in general, faster is better, but being correct is far more important (if you take the behavior of the built-in open as the definition of "correct").
What do you think @piskvorky? Should we keep this behavior or roll it back?
Consistency with open
is smart_open's "selling point". Our documentation explicitly promises:
smart_open
is a drop-in replacement for Python's built-inopen()
: it can do anything open can (100% compatible, falls back to native open wherever possible), plus lots of nifty extra stuff on top.
So to me the answer is clear :)
OK, reverted.
Thank you @SimonBiggs for pointing out the issue. I've made a bugfix release 7.0.3 that reverts to the original behavior.
Glad y’all agree. We’ve been holding off upgrading until this was reverted :)
Motivation
Bucket.get_blob
requires a roundtrip to GCS, butBucket.blob
doesn't, so let's use that instead. In my benchmarking, this change reduces the time taken to callsmart_open.open()
from milliseconds to microseconds. This is especially nice when you're repeatedly callingsmart_open.open
for a large list of files.Tests
To run the new benchmark test:
pytest-benchmark
comparison for old code vs. new code: