multiformats / multihash

Self describing hashes - for future proofing
https://multiformats.io/multihash/
MIT License
903 stars 115 forks source link

Specify Multihash without a specific binary encoding #160

Closed vmx closed 1 year ago

vmx commented 1 year ago

In the current spec, Multihashs are tied to a specific binary encoding. I propose splitting the Multihash spec into a definition of the values it describes and some default binary encoding.

The description of the values would talk about the hash type, size and the actual digest. It would be independent on how it is represented. For example rust-multihash supports encoding a multihash using the SCALE codec, which is not the default binary encoding of the Multihash.

There would then be a default binary encoding (as it is today) with the varints.

rvagg commented 1 year ago

Seems fine to me I think; at least I can't think of a reason to not loosen the definition.

Aside: what is SCALE buying in rust-multihash? If I understand it correctly it's going to take more bytes (the int encoding appears to be less efficient than varint) and provide any real benefits for the combination of ints and byte array. Is someone in that ecosystem storing multihashes in a way that just a full binary string isn't suitable?

vmx commented 1 year ago

I don't know what people get from using SCALE for multihashs. It might be that they are using SCALE anyway it it is more convenient. E.g. If you have some structure that contains a multihash, so you can encode the whole thing with SCALE.

aschmahmann commented 1 year ago

I'm curious in the value of calling a different binary encoding of the data a multihash as opposed to something that's convertible to/from multihash.

While I'd certainly rather people use the abstract multihash format (code, length, digest) than something like inventing new codes making the definition of multihash abstract adds cognitive overhead to the spec and people who work with it. i.e. every definition of base64-encoded-mutlihash now needs to be replaced with something like base64-encoded-multihash-original-format. There's also some self-description lost as now multihash requires more out-of-band information to read it than it did previously.

If this is something there is demand for it might make more sense to leave multihash defined as it is, but call out that there exist alternate formats (which would be in need of alternate names and specs to define them) that abstractly match multihash and are convertible to/from it.

vmx commented 1 year ago

i.e. every definition of base64-encoded-mutlihash now needs to be replaced with something like base64-encoded-multihash-original-format

I don't think so. It could state that the binary encoding is the default one. What I'm after is, that I can say "I use Multihash", even if I use a different encoding.

rvagg commented 1 year ago

We do risk watering it all down even further though. While I don't really mind the idea of "I use Multihash" pointing to the general concept, the fact that we have all these squishy edges continues to make our work harder as people show up with weird cases that don't fit nicely into our box. The solution might not be to tightly define (and close) the box, because the real world is squishy, but maybe loosening it further would exacerbate those problems? I don't know. Recursively: this is one of those things that it's hard to build a solid case against. Going the default route of not changing anything without a really good case to do so, which I think is what @aschmahmann is suggesting, might be the better path? If we don't even know why on earth SCALE has value, maybe this is premature.

aschmahmann commented 1 year ago

Yeah, I don't think my view is too different from Rod's (or his summary of mine) and I don't feel super strongly.

While I work in a similar space as you both perhaps my perspective is a little different given some of the community interactions I've observed where people struggle with more "abstract specs" (e.g. large portions of IPLD, IPFS and libp2p) and seeing those be harder conversations than with more concrete specs like in multihash, multiaddr and cid. It doesn't mean that also having a more abstract definition here is necessarily a bad idea, just one I don't think we should walk into without some wider agreement and motivation. For example, clarifying why an alternative like encoding like SCALE would be useful here and what the value is in saying "I use Multihash" vs "my format is 1:1 compatible with multihash".

Given that @darobin and others are working towards standardization + governance improvements (with multihash being one of the prime candidates) it might be prudent to wait for that before making conceptual changes that IIUC there's not currently a strong need for. You were both present on this PR https://github.com/multiformats/github-mgmt/pull/53 which is pushing in that direction.

rvagg commented 1 year ago

@vmx how are you feeling about the discussion so far? Is it worth pushing this any further. I think I'm more inclined to the negative on this one for all the reasons stated above; but it's not a strongly held opinion.

vmx commented 1 year ago

It was just an idea I wanted to get out there as I think Multihash and CID also make sense on the conceptual level without an encoding. But it doesn't seem that this is generally agreed upon, hence I'm closing this issue.