laminlabs / lamindb-setup

Setup & configure LaminDB.
Apache License 2.0
4 stars 1 forks source link

Hashes for individual files in google cloud #748

Open Koncopd opened 2 months ago

Koncopd commented 2 months ago

UPath.stat for google cloud paths has both "etag" and "md5Hash". We need to add this to https://github.com/laminlabs/lamindb-setup/blob/6bda3d8bc6c47c7707a79554149e8dc6a534e40f/lamindb_setup/core/upath.py#L745 image There are some processing of these hashes, so i am not sure how to this correctly as i am not aware of why this processing even needed, didn't work with hashes. Now we just ignore hashes for individual files in gcp, but not for folders, which is strange.

falexwolf commented 2 months ago

We first need to understand the difference between md5Hash and ETag. On AWS, for files below 50MB, the ETag is the md5 hash in hex representation.

Here, it seems that there is a difference, and google use base64 representation.

falexwolf commented 2 months ago

And yes: we need to store the hash for file-like artifacts on GCP.

Koncopd commented 2 months ago

For gcp it is described here https://cloud.google.com/storage/docs/hashes-etags

falexwolf commented 2 months ago

Oh, that's very interesting. I fear though that AWS doesn't support CRC32c. It looks better in every regard than md5...

Let's keep this issue open and see what we can do here in the future.

For the time being, we'd likely resort to md5.