multiformats / multicodec

Compact self-describing codecs. Save space by using predefined multicodec tables.
MIT License
338 stars 202 forks source link

Re-label Murmur3 as non-cryptographic hash functions #310

Closed IS4Code closed 1 year ago

IS4Code commented 1 year ago

identity and Murmur3 hashes changed to hash, per multiformats/multihash#157.

rvagg commented 1 year ago

ooo, identity, I hadn't thought about that; the problem with changing that is that we do use it in CIDs and it's neither a cryptographic nor a noncryptographic hash.

I don't think we should be changing it. Maybe we should special-case this somewhere, perhaps in the multihash README / spec.

@vmx what's your take on this?

rvagg commented 1 year ago

oh, @vmx requested it in https://github.com/multiformats/multihash/issues/157! surprising.

So a problem with doing this is that it will end up being special-cased somewhere in the stack. For go-multicodec which does codegen from the table, it'll not show up as a valid "multihash" for use with CIDs, so that's probably going to need a special-case for it.

I think I'm -1 on changing it since this doesn't improve the situation, "identity" isn't a hash, it's not a one-way function, it's a special-case.

vmx commented 1 year ago

I think I'm -1 on changing it since this doesn't improve the situation, "identity" isn't a hash, it's not a one-way function, it's a special-case.

That's fine with me. I just thought "hash" is better then "multihash" even if it's wrong, but that might not be true. I agree with @rvagg here, so let's leave "identity" just as it is in this weird special case state. Sorry @IS4Code for the extra work, but please change it back.

ribasushi commented 1 year ago

Contrary opinion: we use multihash as a (bad) synonym for cryptographic hash or more specifically in the cid use case collision-resistant-hash. In other words we use the multihash column as a proxy for "is this at all adversary resistant?". In this context identity gives the strongest possible guarantee: there are no possible "collisions" to be found now or in the future against such "content addressing".

Leaving aside what multihash means: in the context of the table being edited identity and murmur could not be further apart on the safety spectrum

vmx commented 1 year ago

Contrary opinion: we use multihash as a (bad) synonym for cryptographic hash or more specifically in the cid use case collision-resistant-hash. In other words we use the multihash column as a proxy for "is this at all adversary resistant?". In this context identity gives the strongest possible guarantee: there are no possible "collisions" to be found now or in the future against such "content addressing".

The question is not whether identity hash is cryptographic or not, it's whether it's a hash function or not. There are many definitions out there, but let's take the one from Wikipedia:

A hash function is any function that can be used to map data of arbitrary size to fixed-size values.

In the identity hash case it doesn't hash to a fixed-size value, it depends on the input. => not a hash function.

ribasushi commented 1 year ago

It's subtle. Focus on this part:

In other words we use the multihash column as a proxy for "is this at all adversary resistant?"

vmx commented 1 year ago

In other words we use the multihash column as a proxy for "is this at all adversary resistant?"

That depends on what you mean with "adversary resistant". You mean "collision free", that's true. But for hash function you usually also take preimage resistance into account, which means that you cannot get the original input data from the hash. For identity that's the case.

ribasushi commented 1 year ago

Preimage resistance in the context of CIDs is moot: their overwhelming purpose is to point to the original content, which is in turn discoverable in a location-less manner from multiple parties.

vmx commented 1 year ago

Preimage resistance in the context of CIDs is moot

Multihashes are not only used for CIDs.

IS4Code commented 1 year ago

In the identity hash case it doesn't hash to a fixed-size value, it depends on the input. => not a hash function.

Not disagreeing in general, but multihash itself is also technically not a fixed-size value, or at least it shouldn't be treated as such, although it does not depend on the size of the data. I've also seen situations where hashing individual large chunks in a file and concatenating the result could be treated as a hash of the whole file, despite proportional to the original size.

BigLep commented 1 year ago

2023-01-24 IPLD maintainer conversation: can we push ahead without identity (avoid scope screep). We can deal with that as a separate issue?

rvagg commented 1 year ago

just noticed that identity was removed from here, so this is merged now, thanks!