seung-lab / cloud-files

Threaded Python and CLI client library for AWS S3, Google Cloud Storage (GCS), in-memory, and the local filesystem.
BSD 3-Clause "New" or "Revised" License
38 stars 8 forks source link

Incorrect (I think?) MD5IntegrityError #105

Open jasper-tms opened 1 month ago

jasper-tms commented 1 month ago

Hi Will,

I used cloudvolume to upload a simple greyscale image volume in precomputed format to google cloud, as I've done a million times. The upload seemed to succeed without issue. But if I try to download the data from google cloud using cloudvolume, I get a scary error:

In [1]: from cloudvolume import CloudVolume
/home/phelps/.virtualenvs/cloudvolume/lib/python3.10/site-packages/python_jsonschema_objects/__init__.py:113: UserWarning: Schema id not specified. Defaulting to 'self'
  warnings.warn("Schema id not specified. Defaulting to 'self'")

In [2]: vol = CloudVolume('gs://lee-lab_brain-and-nerve-cord-fly-connectome/templates/JRC2018_FEMALE.ng')
Using default Google credentials. There is no ~/.cloudvolume/secrets/google-secret.json set.

In [3]: im = vol[:]
Downloading:   0%|                                                                                                                | 0/2496 [00:01<?, ?it/s]
---------------------------------------------------------------------------
MD5IntegrityError                         Traceback (most recent call last)
Cell In[3], line 1
----> 1 im = vol[:]

File ~/.virtualenvs/cloudvolume/lib/python3.10/site-packages/cloudvolume/frontends/precomputed.py:551, in CloudVolumePrecomputed.__getitem__(self, slices)
    548 channel_slice = slices.pop()
    549 requested_bbox = Bbox.from_slices(slices)
--> 551 img = self.download(requested_bbox, self.mip)
    552 return img[::steps.x, ::steps.y, ::steps.z, channel_slice]

......

File ~/.virtualenvs/cloudvolume/lib/python3.10/site-packages/cloudfiles/cloudfiles.py:423, in CloudFiles.get.<locals>.download(path)
    421 if start is None and end is None:
    422   if server_hash_type == "md5":
--> 423     check_md5(path, content, server_hash)
    424   elif server_hash_type == "crc32c":
    425     check_crc32c(path, content, server_hash)

File ~/.virtualenvs/cloudvolume/lib/python3.10/site-packages/cloudfiles/cloudfiles.py:393, in CloudFiles.get.<locals>.check_md5(path, content, server_hash)
    390 computed_md5 = md5(content)
    392 if computed_md5.rstrip("==") != server_hash.rstrip("=="):
--> 393   raise MD5IntegrityError("{} failed its md5 check. server md5: {} computed md5: {}".format(
    394     path, server_hash, computed_md5
    395   ))

MD5IntegrityError: 380_380_380/0-64_0-64_0-64 failed its md5 check. server md5: I2RXOQeR8uEbpP1FfLfFPA== computed md5: tMYajCH5Z0AfeDH2Y+Q+ig==

I've never seen this before. I tried re-uploading the dataset and got the same problem, so I don't think it was a failed upload / corrupted data. The dataset also loads into neuroglancer just fine. I can also download the files using a gcloud storage cp command just fine. So I suspect that the issue may not actually be with the files but with how cloudfiles is attempting to validate the checksum. Not sure if its relevant, but the specific cube that triggers the error is in fact all black (pixel values all 0) and is the top-left-most block in the dataset.

Do you have any idea what could be going on here? Can you reproduce the issue if you try to load this exact volume into memory via cloudvolume?

Thanks a lot!

william-silversmith commented 1 month ago

Oh wow, I can reproduce this error on the download side. I'll have to investigate. This is pretty weird.

nkemnitz commented 1 month ago

I ran into this issue when writing jpg/png images with CloudVolume and had gzip compression explicitly turned on (overriding CloudVolume default for jpg/png). Maybe same here? Jasper's images are jpg, and the remote objects have a custom header X-Goog-Stored-Content-Encoding: gzip..

jasper-tms commented 1 month ago

Yes, that's right, thanks Nico – I have a little library for format conversion npimage to which I just added support for saving as neuroglancer precomputed via cloudvolume, and if the user asks for npimage.save(array, 'gs://bucket/path', compress=True) (or compress='lossy') then I do give them both jpeg and gzip:

https://github.com/jasper-tms/npimage/blob/324a6d4ed2e98310e779001da86019a0d9f6b8c1/npimage/core.py#L268-L270

I thought this was the default behavior of cloudvolume but perhaps its not and so I've ended up in a rarely used corner case? Are jpeg encoded precomputed volumes not gzipped typically?

nkemnitz commented 1 month ago

Correct, for compressed image file formats such as JPG and PNG, gzip should not be necessary, because it's already part of the format, anyway. (PNG uses the exact same compression algorithm as GZIP, which is called DEFLATE; and JPG uses Huffman coding, which again is part of DEFLATE). CloudVolume determines whether or not gzip compression is required, here: https://github.com/seung-lab/cloud-volume/blob/master/cloudvolume/datasource/precomputed/common.py#L12-L19

Still, somewhere CloudFiles seems to compare gzipped with ungzipped checksums for these "double compressed" files?

jasper-tms commented 1 month ago

Great, I skipped gzipping when in jpeg encoding and that got rid of the checksum issues from cloudfiles. Thanks for the input Nico.

If it really doesn't make sense to ever gzip when in jpeg encoding, you might think of enforcing that on the cloudvolume side Will. (There still remains the question of why cloudfiles is getting confused about the checksums for these double compressed files, but if you update cloudvolume to refused to make such files, the problem is probably 90% solved, in practice.)

Feel free to close this issue or not depending on whether you think you'll try to dive in and fix the checksum bug.

william-silversmith commented 1 month ago

I wonder if this is a bug in Google's library? I played around with this and it seems like blob.download_as_bytes(start=start, end=end, raw_download=True, checksum=None) is not respecting raw_download=True? The string returned is not encoded as gzip, so it must have been decompressed already and so the md5 match will fail.