starlinglab / authenticated-attributes

Authenticated Attributes project by the Starling Lab
MIT License
6 stars 1 forks source link

Hash Large Files Efficiently [Nice to have] #2

Open katelynsills opened 1 year ago

katelynsills commented 1 year ago

Some of the files that Starling archives are >65GB videos. Getting a CID through the usual mechanism for these large files is likely very slow. We may want to change the CID settings to use Blake3 rather than SHA256 on the UWAZI side, so that the end user can get a CID more quickly and the user experience improves.

On the Starling ingestion side, this would mean that Starling would need to produce the usual SHA256 CIDv1 as well as the Blake3 version, and put both in the Starling Hyperbee. It would be something like the following, where the brackets are replaced with the actual hashes:

key: [Blake3CID]/SHA256CID value: [SHA256CID]

The attribute values for the entity will remain keyed by the usual SHA256CID.

This would require two lookups to get the usual metadata, and it also requires the user to trust Starling to have associated the Blake3 hash with the SHA256 hash correctly. However, for a casual user who simply wants to view the metadata, this is likely the most efficient.

Methods

IPFS Kubo and the js-multiformats library have an option to use hash functions other than SHA256.

benhylau commented 1 year ago

@makew0rld can you also look into Iroh? https://iroh.computer/design/iroh/#content-addressed-blobs

It is already using Blake3 and avoids the "lack of canonical CID-ing" issue we face with existing IPFS CIDs.

makew0rld commented 1 year ago

Part of the idea of using the CIDs is to make it easy to pull data off of the IPFS network, like a WACZ file all the attestations are about or similar. We could use some sort of Iroh ID, but unless we are switching to storing all our data on Iroh I'm not sure it would be worth it. And from my limited knowledge, Iroh is not super ready for us to completely switch to right now.