seung-lab / cloud-files

Threaded Python and CLI client library for AWS S3, Google Cloud Storage (GCS), in-memory, and the local filesystem.
BSD 3-Clause "New" or "Revised" License
38 stars 8 forks source link

feat: perform md5 integrity checks #16

Closed william-silversmith closed 4 years ago

william-silversmith commented 4 years ago

A small download test seemed to show no significant adverse performance impact from this computation.

One important ease-of-use consideration is how to process strings that are ingested. Do we require strict binary input to CloudFiles or can we assume strings are utf8? If strings are not utf8 a simple remedy for the end user is to encode their strings to binary before sending it into CF. However, python27 is more tricky as strings and bytes are the same so not as easily detected. We could simply disable this check for py27 or drop support.

Should we allow disabling the md5 checks? If a problem with them develops it might make it easier for users to work around those problems.

william-silversmith commented 4 years ago

The problem motivating this PR is that sometimes in cloudvolume when using fill_missing, erroneous black tiles appear. This is most likely caused by a server returning a success (200) response with a missing body. An md5 check could help if the metadata is not corrupted at the same time as the data. In most object storage systems, metadata is stored separately from data, so there is a chance we might detect some kinds of failures. The black tile issue is hard to replicate so we're kind of shooting in the dark.

william-silversmith commented 4 years ago

https://cloud.google.com/storage/docs/hashes-etags#_MD5

william-silversmith commented 4 years ago

Experiment with a composite object:

with only download_as_string:

blob.component_count
>>> None
blob.md5_hash
>>> None
blob.etag
>>> None

With get_blob:

blob.component_count
>>> 3
blob.etag
>>> CNP9to3NlOsCEAE=
from cloudfiles.lib import md5
md5(content)
>>> gD660sBzjs3iiz7frBjubw==

The etag is something that's not an md5 for composite objects.