seung-lab / cloud-files

Threaded Python and CLI client library for AWS S3, Google Cloud Storage (GCS), in-memory, and the local filesystem.
BSD 3-Clause "New" or "Revised" License
38 stars 8 forks source link

S3 Interface: Multipart ETags(?) causing checksum mismatch #49

Closed nkemnitz closed 3 years ago

nkemnitz commented 3 years ago

Some of the files on the Cloudian (and probably other S3 backends) seem to have some kind of multipart identifier attached to the ETag - which is weird by itself, because I thought I didn't upload them as multipart... Anyway, the suffix confuses the checksum calculation.

Example:

cf = CloudFiles("matrix://fafbv14-em/aligned/v1/16_16_40")
x = cf["129.shard"]

has a checksum of cd8d2616dfa6cc80a06a846d3b3f6f30-14. ~Ignoring the -14 part results matches the expected checksum.~ Was wrong about that.

nkemnitz commented 3 years ago

Yeah, somehow those files must have been uploaded as multipart... not sure why. ETag calculation is a bit annoying: the number after the dash is the number of multiparts for that object. With that and the size of the full object, one can guess the size of a each multipart and calculate the MD5 for each part + the final combined MD5.

E.g.: https://teppen.io/2018/10/23/aws_s3_verify_etags/

william-silversmith commented 3 years ago

That's odd, if these were created using CloudFiles. I wasn't aware it could even do that. Did you use s3cli or gsutil? Those can sometimes automatically switch to multipart.

nkemnitz commented 3 years ago

I am fairly sure that I did those with CloudVolume - it's the sharded version of Flywire. But I had to rework some files a while ago - maybe I used gsutil there, don't remember.

The real issue is that I can't read them with CloudFiles, though. For now, I set etag = None in the interfaces.py for those etags that contain the multipart suffix, so that my transfer finishes.

william-silversmith commented 3 years ago

I'm surprised CF can download the multipart files without special logic. I was considering the multi-part logic in #7. I guess downloads are a lot easier.

william-silversmith commented 3 years ago

In any case, this is a good feature. I'll look into it some more. Thanks for the test case and info.