multiformats / multicodec

Compact self-describing codecs. Save space by using predefined multicodec tables.
MIT License
338 stars 202 forks source link

Add sha256a #329

Closed oed closed 1 year ago

oed commented 1 year ago

This PR adds sha256a which is an associative hash function. It's defined as the sum of multiple sha2-256 hashes. You can read the specification here.

oed commented 1 year ago

Ping @rvagg :)

oed commented 1 year ago

What's the "content" in this case?

All of the data the sum of which is this hash.

If you used this in a CID, what would that mean?

Not sure, this is not the intention. Goal here is to use this as a way to have upgradability if we need to use another ahash function (using something else than sha256 under the hood)

if you broke up content, did a sha256 across each chunk, then combined them with sha256a then you'd have a "digest" that's valid for all possible ordered recombinations of those chunks.

This is actually the exact reason why we are using it (see our Recon spec, or the "Range-Based Set Reconciliation and Authenticated Set Representations” [arXiv:2212.13567] paper where the idea comes from)

aschmahmann commented 1 year ago

What's the "content" in this case?

All of the data the sum of which is this hash.

IIUC the distinction here is that most hash functions (i.e. all of the ones currently covered by multihash) take in bytes and spit out a hash. However, it seems like sha256a takes a list of bytes and spits out a hash which is a different abstraction.

While you could extract the part the part of sha256a that does bytes -> hash mapping, that would be the part that divides up a byte array into 32 bytes chunks and sums them up, which doesn't have much to do with sha256 which is largely what makes this awkward.

aschmahmann commented 1 year ago

@rvagg @vmx is the guidance in use for multihash still https://github.com/multiformats/multihash/tree/381cb3310e42d2263130781ab6fd559a46d40ab8/#non-cryptographic-hash-functions? Since collisions are an intentional part of the design of sha256a it seems like hash would be the tag to use (leaving aside the other issue of this not quite fitting the bytes -> digest abstraction hashes normally use).

rvagg commented 1 year ago

@aschmahmann yeah, it's a bit ambiguous in that it builds on a cryptographic hash function but then undoes some of the guarantees provided! The text in your link that's probably most relevant is:

are not suitable for content addressing systems

By that definition, if this isn't intended for use as a CID for addressing (acknowledging that there's even ambiguity in how CIDs get used in the wild, so even that isn't a clear demarcation!), then switching it to hash might be a better choice here to (a) avoid confusion and side-step ambiguity and (b) not set a problematic precedent, which is a common problem we have with the table.

Would you mind switching to the hash tag @oed? I think we'd be good with that.

oed commented 1 year ago

Makes sense, updated to hash. Thanks @rvagg @aschmahmann