seung-lab / cloud-files

Threaded Python and CLI client library for AWS S3, Google Cloud Storage (GCS), in-memory, and the local filesystem.
BSD 3-Clause "New" or "Revised" License
38 stars 8 forks source link

Disable internal checksum validation in GCS client #96

Closed ranlu closed 1 year ago

ranlu commented 1 year ago

Google storage client has md5 chucksum enabled by default. Since CloudFiles does checksum validation itself, we should disable checksum in the download_as_bytes call. This reduces overall time from 5s to 4s, when downloading 31 files, 10-20M each.

For some reason it seems if I disable checksum validation in CloudFiles instead, there is no speed improvement.

william-silversmith commented 1 year ago

I wonder if this big difference is the google storage client using crcmod (pure python) instead of crcmod (compiled version). Whichever, it doesn't matter as it's duplicative work. Thanks for the PR, I'll test this.

william-silversmith commented 1 year ago

I looked into the GCS library, and we're using the same hash implementation (hashlib.md5). The only difference I can see is that they're using a blocked reading strategy so maybe for reasonably sized files this is slower?

william-silversmith commented 1 year ago

S3 probably has the same double validation logic. Trying to figure out what to do about it.

ranlu commented 1 year ago

I wonder if all or most the vendors cloud-files supports supports checksum validation, if so we can also disable the validation in cloud-files, but the speed penalty for GCS is a bit alarming.

william-silversmith commented 1 year ago

I'm trying to remember why I did download validation. It might be that I wasn't sure if GCS and s3 libraries were doing it. For uploads, I believe you have to manually provide the md5 (which we do).

ranlu commented 1 year ago

Make senses, I suppose the md5sum are needed to validate the updated data is not corrupted.

ranlu commented 1 year ago

BTW, we should disable checksum in GCS client for range request anyway. at the moment when downloading sharded objects like skeletons, there are tons of errors like this if I use logging:

2023-09-01 07:40:14 INFO     No MD5 checksum was returned from the service while downloading https://storage.googleapis.com/download/storage/v1/b/xxxxxxxx/o/skeletons_mip_1%2F0494.shard?alt=media
(which happens for composite objects), so client-side content integrity checking is not being performed.
william-silversmith commented 1 year ago

I did some testing and I do see a decrease of about 1 second in user time for downloading 32 10MB files using this PR. I got a similar improvement by removing CloudFiles' checksum code. I'll just accept your PR now since the range requests are causing inappropriate logging from the Google checksum.