Change Instance-ID algrorithm to BLAKE3

titusz commented 4 years ago

BLAKE3 turns out to be the ideal cryptographic hash for the Instance-ID. As stated by its developers BLAKE3 is:

Much faster than MD5, SHA-1, SHA-2, SHA-3, and BLAKE2 (~10x of sha256 based on our tests).
Secure, unlike MD5 and SHA-1. And secure against length extension, unlike SHA-2.
Highly parallelizable across any number of threads and SIMD lanes, because it's a Merkle tree on the inside.
Capable of verified streaming and incremental updates, again because it's a Merkle tree.
A PRF, MAC, KDF, and XOF, as well as a regular hash.
One algorithm with no variants, which is fast on x86-64 and also on smaller architectures.

For details see: https://github.com/BLAKE3-team/BLAKE3-specs/blob/master/blake3.pdf

lrosenthol commented 4 years ago

What about supporting multihash (https://multiformats.io/multihash/) which would (a) allow implementors to choose the right algorithm for their implementation and (b) support forward thinking (since all hashes will be broken at some point)?

titusz commented 4 years ago

Yes we should carefully think about self-descriptiveness and forward compatibility. The ISCC is a composition of multiple hashes, that can also be used separately if required. One way would be to give each component a 2 byte header where we can encode the type, version, length end eventually type specific header information. Something like this: ISCC-Component-Structure

lrosenthol commented 4 years ago

My point though @titusz is that there is already a standard for this - see my link in the previous comment. There is no reason to reinvent the wheel

titusz commented 4 years ago

@lrosenthol thank you for pointing this out. Adopting existing standards is indeed preferable where it makes sense. I am following the development of multiformats closely and have also been experimenting with multihash.

On the ISCC component level we currently have a 1-byte header plus 8-byte body structure. To conform to multihash we would need to add a minimum of 2 bytes header data per component to indicate type and length (we also need to indicate version and subtype specific flags). A multihash representation on the full ISCC (4 components combined) level might be good idea.

Multihashes are presented in base16 (hex) encoding. For the printable representation ISCC currently uses a more compact base58 encoding with a custom alphabet for human readability of the component type. So we would need to add the ISCC encoding to the multibase table and prepend another character per component. Which brings us to at least 3 bytes overhead per component while still missing the required version and subtype information.

Code compactness is a crucial design target for the ISCC. We have been collecting feedback on the expectations and requirements for the ISCC from a broad community. There are still some open questions on the final byte structure layout of the ISCC. We need to get those right in the first place. When that is stable there can be support for different printable encodings including multihash.

titusz commented 2 years ago

@lrosenthol I have opened an issue to make the next version of ISCC multiformats compatible: https://github.com/multiformats/multicodec/pull/252#issuecomment-996740862

iscc / iscc-specs

Change Instance-ID algrorithm to BLAKE3 #87