Closed ranlu closed 1 year ago
I wonder if this big difference is the google storage client using crcmod (pure python) instead of crcmod (compiled version). Whichever, it doesn't matter as it's duplicative work. Thanks for the PR, I'll test this.
I looked into the GCS library, and we're using the same hash implementation (hashlib.md5). The only difference I can see is that they're using a blocked reading strategy so maybe for reasonably sized files this is slower?
S3 probably has the same double validation logic. Trying to figure out what to do about it.
I wonder if all or most the vendors cloud-files supports supports checksum validation, if so we can also disable the validation in cloud-files, but the speed penalty for GCS is a bit alarming.
I'm trying to remember why I did download validation. It might be that I wasn't sure if GCS and s3 libraries were doing it. For uploads, I believe you have to manually provide the md5 (which we do).
Make senses, I suppose the md5sum are needed to validate the updated data is not corrupted.
BTW, we should disable checksum in GCS client for range request anyway. at the moment when downloading sharded objects like skeletons, there are tons of errors like this if I use logging:
2023-09-01 07:40:14 INFO No MD5 checksum was returned from the service while downloading https://storage.googleapis.com/download/storage/v1/b/xxxxxxxx/o/skeletons_mip_1%2F0494.shard?alt=media
(which happens for composite objects), so client-side content integrity checking is not being performed.
I did some testing and I do see a decrease of about 1 second in user time for downloading 32 10MB files using this PR. I got a similar improvement by removing CloudFiles' checksum code. I'll just accept your PR now since the range requests are causing inappropriate logging from the Google checksum.
Google storage client has md5 chucksum enabled by default. Since CloudFiles does checksum validation itself, we should disable checksum in the
download_as_bytes
call. This reduces overall time from 5s to 4s, when downloading 31 files, 10-20M each.For some reason it seems if I disable checksum validation in CloudFiles instead, there is no speed improvement.