justindujardin / pathy

simple, flexible, offline capable, cloud storage with a Python path-like interface
Apache License 2.0
171 stars 23 forks source link

Pathy.exists() check might impact performance due to partial startswith check #109

Open yaelmi3 opened 1 year ago

yaelmi3 commented 1 year ago

env: python3.10, tested with GS

Consider the following case:

Pathy("gs://bucket/blob-not-there")

In this case we check whether the exact blob exists , but in case it doesn't exist, we continue to checking partial blob appearance, in all bucket files using startswith. This introduces 2 possible issues:

  1. In case of bucket with high amount of blob (in our case we have bucket with hundred of thousands blobs), this check might be unreasonably long
  2. In case we have a prefix match, exists will return True, but it might not be the blob we are referring to

Possible solutions

  1. Avoid looking for blob prefix
  2. Add a flag to exists, something like exact_match
justindujardin commented 10 months ago

@yaelmi3 thanks for providing this review/analysis! 🙇

Could you construct a performance test that measures how slow it is and compare it with your suggested change? I can run it on all the cloud providers to get a sense of the impact if you write a script that works with the local-mode implementation.