multiformats / multicodec

Compact self-describing codecs. Save space by using predefined multicodec tables.
MIT License
336 stars 201 forks source link

Add xxHash #301

Closed j2ghz closed 1 year ago

j2ghz commented 1 year ago

I'm not too sure on the naming, sources:

So I went with xxh-32, xxh-64, xxh3-64, xxh3-128 to align with the formatting of the rest of the table

vmx commented 1 year ago

Multihash is about "well-established cryptographic hash functions", as non-cryptographic hash functions are not suitable for content-addressing. xxHash is a non-cryptographic one. So I don't think it belongs to the table (at least not for multihashes).

@j2ghz what was your intended use for the multicodec code?

j2ghz commented 1 year ago

I'm building a tool that will use chunking, hashing and delta patching to minimize the amount of data that has to be transferred to replicate a file structure, especially as it's updated over time, with main focus on game mods. I chose XXHash for its speed and a small hash size (makes a big difference to the metadata size), and I don't think there's a benefit in using a cryptographic hash function in my case (not p2p/distributed, but centralized).

I missed the part about multihash focusing on cryptographic hash functions only, I guess I'll have to define it in the private use range. The reason I wanted to use multihash was as stated in the readme: "It is useful to write applications that future-proof their use of hashes, and allow multiple hash functions to coexist.". If I need to change the hash in the future, in a backwards compatible way, or others want to write other client or server implementations, multihash simplifies that.

rvagg commented 1 year ago

@vmx I wonder if we shouldn't be a bit more flexible with this, or perhaps even introduce a hash label? The problem is that we already have non-cryptographic hashes in here--murmur3* (and also identity) and we have plenty of insecure ones (md4, md5, sha1). Now we have a second request for a non-cryptographic @ #303.

If people know what they are doing, then using multihash to represent non-cryptographic hashes could have good use. Maybe we can compromise by making sure non-cryptographic entries say that in their comments section.

I would still recommend using a cryptographic hash, including in the case outlined here, but it's not my app and I understand the rationale here.

vmx commented 1 year ago

@rvagg Agreed, I thought the same. Having also non-cryptographic hashes within the multiformats ecosystem makes sense. I think they need to be in separate categories. We need to find a way to make sure we don't end up with non-cryptographic hashes in CIDs (with identity hash being the exception). But how would this be spec'ed out? Would we say only hashes in the "multihash" category are allowed in CIDs? But wouldn't "multihash" then be a kind of confusing name? Originally mulihashes were meant for content-addressing, but as those issues show, we now have cases beyond that. Can instead all hashes (also non-cryptographic one) be multihashes? Then perhaps we we should have a category called "cryptographic hash", "content-addressed hash", "ATM likely non-colliding hash", "secure hash", or "CID hash", and only allow those in CIDs (similar to what we should do with codecs, where we haven't been good with creating clear categories/spec'ing it out)?

IS4Code commented 1 year ago

Adding my own thoughts (since I proposed CRC32); I think all hashes/digests/checksums are suitable for the multihash category and should be inherently usable as multihashes (unless a particular service has specific requirements), since they are all used in the same way (get bytes, digest bytes, output bytes). What is different is how suitable they are in different environments: CRC32 and XXH32 would be fine for content-addressing in very small centralized environments (I actually had such a usage in mind when proposing CRC32), but obviously are very bad for global addressing. The same could be said for MD5 however, being a broken cryptographic hash.

If a particular service uses and validates multihashes, it must also have a list of supported algorithms, so it can decide which are acceptable. It could be based on the output size (20 bytes seems like a good dividing line), but having a sub-category for different strengths or intended purposes is also a good idea, so people don't pick a weak one by mistake. Perhaps the "description" field should be required to specify that?

rvagg commented 1 year ago

Notes from public IPLD sync meeting today:

rvagg commented 1 year ago

https://github.com/multiformats/multihash/pull/156 is merged, this just needs to be have the labels renamed to hash and we can probably merge it. Over to you @j2ghz.

j2ghz commented 1 year ago

Updated, should be ready