Clarify hashing algorithm used for relatedResource examples

lemoustachiste commented 1 month ago

In this section: https://www.w3.org/TR/vc-data-model-2.0/#integrity-of-related-resources, the vc v2 context https://www.w3.org/ns/credentials/v2 has this corresponding digestMultibase value: uEres1usWcWCmW7uolIW2uA0CjQ8iRV14eGaZStJL73Vz

I am trying to implement logic in my verifier to check the assertion, however using js-multiformat (https://www.npmjs.com/package/multiformats) I cannot find the correct steps to manage the same result.

Is the hashing algorithm for contexts and other resources detailed somewhere (https://www.w3.org/TR/controller-document/#algorithms maybe)? Or better yet, is there an available library to do this conversions?

All in all I think adding a link to the expected algorithm in the vc-model spec would greatly help implementers konw how to get back to expected values.

Thanks

lemoustachiste commented 1 month ago

FWIW, I tried running the algorithm here: https://www.w3.org/TR/controller-document/#example-an-implementation-of-the-general-base-encoding-algorithm-above-in-javascript, but while I get the same result as with this naïve script:

  const bytes = json.encode(vcV2Context);
  const sha384Digest = await sha384.hash(bytes);
  const computedDigest = base64url.encode(sha384Digest);

  // https://www.w3.org/TR/vc-data-model-2.0/#integrity-of-related-resources
  const targetDigest = 'uEres1usWcWCmW7uolIW2uA0CjQ8iRV14eGaZStJL73Vz';

That result does not equate to the targetDigest

msporny commented 1 month ago

All in all I think adding a link to the expected algorithm in the vc-model spec would greatly help implementers konw how to get back to expected values.

We recently moved the definition of Multihash from Data Integrity to the Controller Document, specifically, this section:

https://www.w3.org/TR/controller-document/#multihash

Would referring to that section help?

lemoustachiste commented 1 month ago

Hi @msporny,

I'm sorry I must be missing something in the trail of information.

The digestMultibase of the VC v2 context is defined here: https://www.w3.org/TR/vc-data-model-2.0/#example-use-of-the-relatedresource-and-digestmultibase-properties as uEres1usWcWCmW7uolIW2uA0CjQ8iRV14eGaZStJL73Vz.

As it starts with a u I assume the encoding is base64-url as defined here https://www.w3.org/TR/controller-document/#multibase-0. I checked the expected alphabet and it matches that of Digital Bazaar's library base64url-universal (https://github.com/digitalbazaar/base64url-universal/blob/main/lib/base64url.js#L13), so I believe I am using the correct tool for the last step.

So I guess my issue is with the hashing part. In this section https://www.w3.org/TR/vc-data-model-2.0/#defn-relatedResource, it says that SHA-384 should be preferred, so that's what I'm using (SHA2).

I get the following byte array when hashing VC V2 context with SHA-384, consistently with 2 different libraries:

Uint8Array(48) [
  229,  13, 164, 166, 212, 164,  82, 226, 235,
   28,  65, 158, 115, 116, 131, 118, 235, 106,
   86,  82, 124,  58, 252,  21, 169,  79,  98,
  246, 242, 196, 175,  45, 183,  96, 153, 169,
   41, 122, 101, 102,  10,  90, 114, 126,  98,
   59, 242, 119
]

But using the DB's library to encode this hash to base64url, I get the following result: 5Q2kptSkUuLrHEGec3SDdutqVlJ8OvwVqU9i9vLEry23YJmpKXplZgpacn5iO_J3 which does not match the value and header specified in the spec doc.

I'm missing a piece of the puzzle and I can't figure out what it is. Can you guide me to your result?

Thanks

msporny commented 1 month ago

I'm missing a piece of the puzzle and I can't figure out what it is. Can you guide me to your result?

I expect that the Multihash values are base64-url encoded SHA-256 values, not SHA-384. I'll try to check here in a bit... it might also be that I messed up generating the example since this part of the spec is fairly new and hasn't gotten much review.

There is a disagreement in the WG as to whether SHA-384 really should be the suggested target hash value. The digestSRI folks seem to want 384, but many of the others in the WG don't feel like that's necessary given that the energy required to break SHA-256 is the equivalent of repurposing every computer on the planet for billions of years or boiling every ocean on the planet 16,000+ times over. That said, I don't think the VCWG has it in them to continue to debate the point more than it already has.

msporny commented 1 month ago

I'll try to check here in a bit...

Yeah, the test vectors were wrong/old. :(

I updated the spec to auto-generate the values instead of depending on humans to do it. The Multibase and Multihash encodings are also now explicitly specified in each example title. The commit that implemented that is here: 04576fa3c0c102f967f1bf053a2982c9251ca435

Does that address your concerns @lemoustachiste ?

lemoustachiste commented 1 month ago

@msporny yes I think that's way clearer like this, thanks...

...but I am still unable to find the correct algorithm to produce the same result as to what's displayed... :/. Do you have a link to the script that handles the hashing and the encoding?

msporny commented 1 month ago

Do you have a link to the script that handles the hashing and the encoding?

Yes, here's the code that's running in the spec:

https://github.com/w3c/respec-vc/blob/main/index.js#L376-L467

There is a chance that there is a bug in my code, so I'm very interested to hear if you're able to reproduce the output. Ideally, we'd need to figure this out by the end of this year to ensure that the correct test vectors are in the specification. Thank you for looking into this, independent verification of these test vectors are very important!

lemoustachiste commented 3 weeks ago

Hi @msporny,

I was able to review what happens in the code and figure out where lies my discrepancies.

Here is my assumption flow for verifying a hash of a context file:

a good practice in the industry is to cache the context files and use documentLoader to retrieve it as needed, from local memory.
I usually use import to preload those files in my execution script (and they actually end up bundled into the final verifier code).
Now if I'm using Uint8Array on my memory VC v2 context, the final length of the document is 7129 bytes or characters. To be noted that the actual physical file on the repo has 10139 characters.
Using fetch as the vc respec script does, the response document total length is 10139 bytes or characters.
this will obviously yield different hashing results

I was puzzled for a bit but I figured it's because of white spaces, so I then tried to use TextEncoder.encode(JSON.stringify(vcV2Context, null, 2) to get more formatting bytes out of that, but I still fall short to 10035 bytes. That means there would be more extra line breaks or spaces in the physical file than I can recompute from the memory JSON.

What I'm thinking is that those white characters shouldn't necessarily be accounted for as part of the hash as they don't exactly represent significant data as far as JSONLD contexts are concerned.

Technically in the browser it would require verifier instances to hold physical files at a predictable path to actually load the data from the file (hence keeping white spaces) and not reading from memory as instantiated by the script file. But in practice the latter is an obvious simpler option, albeit bringing more weight into packages - although a necessary trade-off to prevent various HTTP attacks on context files retrieval.

So as to prevent such hashing comparison mistake a normalization should occur before hashing and only the valuable data should be hashed. Maybe just whites pace trimming? I'm not sure if there is a more clever way to do it, such as JSONLD normalization, but for context files.

davidlehn commented 2 weeks ago

@lemoustachiste The hashes provided are for the exact file bytes with the white space, blank lines, and formatting they use. What you are probably looking for are hashes on the context files in JSON Canonicalization Scheme (JCS) RFC 8785 form. For reasons I don't recall there was push back about providing additional hashes for the JCS form. Once the contexts are locked down, it would be straightforward for a trusted third party or special tool to provide JCS based or other hashes.

That being said, and not knowing details of what you are doing, their may be an easy solution. If the contexts are pre-loaded, their hashes could be pre-checked, and included in the bundle as well. When checking resource hashes at runtime, if the id and hash match what's in the known list, then no further check is needed when using the pre-loaded data.

msporny commented 1 week ago

Yes, what @davidlehn says above. Some developers had strong pushback wrt. doing any sort of canonicalization, so we're stuck with what we have now (trying to canonicalize in any way will lead to objections). Yes, it doesn't make much sense and causes a single byte of whitespace to throw the hash off, but that's where we are right now and the approach won't change (because of the threats of objection from this group of developers).

I think we've done all we can in this issue. Do you have any objections to closing the issue @lemoustachiste (we're required to close/address all issues before going to the next stage in the standardization process).

lemoustachiste commented 1 week ago

I think the workaround can work but it seems like a sort of unofficial recipe, which I believe could lead to some issues in terms of interoperability and I guess false negatives.

causes a single byte of whitespace to throw the hash off

Exactly.

As a developer and implementer myself I would have preferred a clear and canonical way to ensure the data matches as expected so that with grounded rules, everyone knows what to expect and how to get there.

You can close this issue, but I believe it will come back.

Thanks

msporny commented 1 week ago

As a developer and implementer myself I would have preferred a clear and canonical way to ensure the data matches as expected so that with grounded rules, everyone knows what to expect and how to get there.

We say the following in the spec today:

Implementations MUST treat the base context value, located at https://www.w3.org/ns/credentials/v2, as already retrieved; the following value is the hexadecimal encoded SHA2-256 digest value of the base context file: 24a18c90e9856d526111f29376e302d970b2bd10182e33959995b0207d7043b9. It is possible to confirm the cryptographic digest above by running the following command from a modern Unix command interface line: curl -s https://www.w3.org/ns/credentials/v2 | openssl dgst -sha256.

Does the above match your expectation for "clear and canonical way"? It uses standard tooling available on most every operating system out there. What more than that could we do?

You can close this issue, but I believe it will come back.

Well, clearly we don't want that either... :)

I'm out of ideas on what more we could do here -- if we do canonicalization, people in the WG (and some developers) will object. If we keep what's there, you're saying we're not providing a clear and canonical way (though, I don't know how we could be more clear about it). If you could suggest some spec text, that might help us figure out how to address your concern to your liking?

w3c / vc-data-model

Clarify hashing algorithm used for relatedResource examples #1570