multiformats / multicodec

Compact self-describing codecs. Save space by using predefined multicodec tables.
MIT License
336 stars 201 forks source link

Add 'jcs' and 'urdna2015' canonicalization values. #261

Closed dmitrizagidulin closed 1 year ago

dmitrizagidulin commented 2 years ago

Adds a new canonhash tag value that represents a combination canonicalization+hash operation (using RDF Dataset Canonicalization URDNA2015, soon to be renamed to URDCA2015).

Used for the hashlinking of Verifiable Credentials proposal to the W3C VC WG, in the implementation of digestMultibase.

digestMultibase example:

MULTIBASE('base58btc', CANONICALIZE('urdca-2015-canon', MULTIHASH('sha256', <canonicalized input>)))
MULTIBASE('base58btc', CANONICALIZE('jcs-canon', MULTIHASH('sha256', <canonicalized input>)))
vmx commented 2 years ago

I only had a quick look at JCS and urdna2015. Do I understand it correctly MULTIHASH('jcs', MULTIHASH('sha256', <canonicalized hashed VC>)) would still be a SHA-256 hash? Is the idea that you'd like to be able to determine "that SHA-256 came from a canonicalized JSON according to the JCS rules"?

dmitrizagidulin commented 2 years ago

Do I understand it correctly MULTIHASH('jcs', MULTIHASH('sha256', <canonicalized hashed VC>)) would still be a SHA-256 hash? Is the idea that you'd like to be able to determine "that SHA-256 came from a canonicalized JSON according to the JCS rules"?

That's it, exactly. Tagging jcs as a multihash is not exactly right, but we're trying to work with the limitations of the fact that MULTIHASH has essentially one parameter, but really needs multiple params (see the discussion for issue 78/Parametrized Hashing https://github.com/multiformats/multihash/issues/78, and issue 56 https://github.com/multiformats/multihash/issues/56)

dmitrizagidulin commented 2 years ago

Ok, on further conversation, it might be less confusing to people if this PR introduced a new tag (instead of overloading the use of multihash). So instead, I propose adding a canonized hash tag.

vmx commented 2 years ago

In your original comment you mention hashlinking. Is the goal to use that multicodec code as part of a CID?

I'm asking as I think this request poses an interesting question. If I think in terms of a CID, where we specify the encoding as well as the hash algorithm, the question is, should this be the encoding information or the hash algorithm information?

To me a CID is self-describing on how to get from the bytes it points to, to some deserialized version of it and back. If the hash algorithm is always SHA-256, I can see two ways describing it:

  1. The encoding is JSON and the hash algorithm is "canonicalize things first, then do a SHA-256" hash.
  2. The encoding is canonicalized JSON and the hash algorithm is SHA-256.

In both cases you'd have all the information you need.

dlongley commented 2 years ago

@vmx,

We would ideally like to design this in such a way that any hash algorithm from the multihash table could be used -- without having to create NxM combination codec values. So, we can express that some data was canonicalized with algorithm X (urdca2015 or jcs are the two most interesting values here right now) and then hashed with algorithm Y (any value from the multihash table). So we're just looking for the best way / format to allow this kind of parameterization so that all of the information needed (as you mentioned) is there.

vmx commented 2 years ago

@dlongley This means that urdca2015 and jcs aren't about hashing at all, they are about the step before the hashing. I still guess you want to use this as part of a CID, so the only possible place to put this identifier in is the data codec (the CID spec names that "multicodec codec type"). The information there is used to know how to encode/decode the bytes that were addressed with the CID. Is JCS always JSON and URDCA2015 always XML? Or could also other data formats be canonicalized with such algorithms?

dmitrizagidulin commented 2 years ago

@vmx

@dlongley This means that urdca2015 and jcs aren't about hashing at all, they are about the step before the hashing.

Right, exactly. They're essentially a second parameter to the multihash (what pre-processing steps must be taken with the data before hashing).

Is JCS always JSON and URDCA2015 always XML? Or could also other data formats be canonicalized with such algorithms?

JCS is always JSON. URDCA2015 is any sort of RDF-based linked data (which includes JSON, Turtle, RDF-XML, N-Quads, etc).

To me a CID is self-describing on how to get from the bytes it points to, to some deserialized version of it and back. If the hash algorithm is always SHA-256, I can see two ways describing it:

  1. The encoding is JSON and the hash algorithm is "canonicalize things first, then do a SHA-256" hash.
  2. The encoding is canonicalized JSON and the hash algorithm is SHA-256.

Right, so, this is the tricky part. I'd say the situation is closer to 1 -- the hash algorithm is "canonicalize things first, then do a SHA-256" hash. And the encoding (of the hash) is multibase. (I'm not sure it's necessary to specify the encoding of the pre-hash data, though. Since the hash is a one-way operation.)

@vmx - would you be open to defining a new "canonized hash" tag?

rvagg commented 2 years ago

Finally found time to look at this and give my 2c.

  1. Firstly, I'd like to make a bit of space after the poseidon* entries because we can expect more of those, maybe bump it to 0xb503 or even find a different space for it around that area.
  2. I don't think I have an objection to making a new tag for this, it really is a different beast, and it's not like we have strong rules for that column anyway. It would probably be inappropriate to make it multihash or ipld or even serialization since it's not quite any of those.
  3. I think I could see a path to this being used in CIDs if you implement it as a faux-multihash. Our implementations have ways of abstracting the multihash part of a CID such that you just need to be able to produce a digest. So, you could implement this as a layer ontop of the existing mutlihash interfaces so you take existing multihash implementations and wrap them in this thing and the multihash part of the CID is really a multihash(multihash), although as far as the CID implementations are concerned it's just the one multihash. That would be interesting to see work and there may be hiccups along the way. I'm not sure it's a great idea, but it doesn't seem impossible.
  4. Having said all of that ^, using this for CIDs does feel a bit like a hack, to squish information into a CID because CIDv1 doesn't have the ability to convey quite enough information as it is. Maybe this goes into the wishlist bucket for CIDv2?
vmx commented 2 years ago

I'd like to check if I understood the current outcome correctly.

The urdca-2015-hash is used in the multihash part of the CID. So a CID would look like this (I leave out the size information bits for simplicity):

<v1><can-e.g.-be-json-turtle-xml><urdca-2015-hash><the-hash-digest>

This points to some data.

Now I retrieve the data and I want to create a CID out of it. I would only know that I need to canonicalize the the data before hashing, but I wouldn't know which hash algorithm to use. Is that correct?

rvagg commented 1 year ago

@dmitrizagidulin any changes to this you want to pursue so we can get this over the line in some form?

dmitrizagidulin commented 1 year ago

Hi @rvagg, thanks for checking in. So, yeah, absolutely, we’ve got even more implementations in need of this mechanism on the way, so we definitely want to find some kind of solution. (I was chatting with @gobengo about this just yesterday, and he gave me a couple new vectors to consider.) So, let me review the issue and get back to you later today.

dmitrizagidulin commented 1 year ago

Hi @rvagg -- after some discussion with @gobengo, I've updated the PR (and resolved merge conflicts) to hopefully address some of your concerns.

Firstly, I'd like to make a bit of space after the poseidon* entries because we can expect more of those, maybe bump it to 0xb503 or even find a different space for it around that area.

Totally understood wanting to make space -- I moved the JCS canonicalization entry to post-poseidon. If at all possible, we would really like to keep urdna-2015-canon entry as 0xb403. (This is totally my fault, I dropped the ball on resolving this PR, and meanwhile the 0xb403 tag is being deployed to millions of Point-of-Sale systems (literally old-school cash registers) as part of a US-wide Age Verification project.)

We've also updated the tag for those two entries to re-use ipld (on Bengo's advice), instead of introducing a new 'canonhash' tag. This is because json-jcs is essentially a standardized version of what dag-json does (sorts/canonicalizes JSON input so that it can be composed with hashing).

dlongley commented 1 year ago

@dmitrizagidulin,

We've also updated the tag for those two entries to re-use ipld (on Bengo's advice), instead of introducing a new 'canonhash' tag.

Does that mean the existing implementations need to change? If not, why not?

dmitrizagidulin commented 1 year ago

@dmitrizagidulin,

We've also updated the tag for those two entries to re-use ipld (on Bengo's advice), instead of introducing a new 'canonhash' tag.

Does that mean the existing implementations need to change? If not, why not?

Hey @dlongley - no, no existing implementations need to change. The tag in the CSV file is conceptual / for organizing things into categories, it's not used in the code.

msporny commented 1 year ago

Thanks for the merge @rvagg.

To come back to ipld not being the right way to go, I agree. Can we just use "multiformat" for the tag name?

If not, what if we introduced a new "transformed-multihash" namespace? It's not clear to me what constitutes a "namespace" vs. a "multiformat".

rvagg commented 1 year ago

@msporny the tags really don't matter that much so it's not worth getting too hung up about it - I imagine a future point where we refactor a bunch of the organisational stuff and they become more relevant at which point we take a more holistic view of what we have and do some adjustment.

If something feels like it should be just "multiformat" then we should probably just invent a new tag for it - if you're making something that could be described in a new multiformat spec then make a tag as a new category. I'm not sure about "namespace", mostly I treat those as networking / libp2p related so usually not appropriate for hashing or encoding.

I'd be happy for someone to come up with a new tag for this, but maybe something broad enough that can fit other things too? transformed-multihash might work, it's a little long but it explains the purpose. multimultihash might be a bit too cute, compound-multihash is another option in the same theme.

RangerMauve commented 1 year ago

Can't believe I'm just seeing this now! Really glad that this has been put in place.

IMO IPLD is absolutely something that we should look into here since we can use this as a component of IPLD based database systems at large.