SD-JWT Proposed alternative to inline, escaped JSON values

sbutterfield commented 2 years ago

Problem statement

Currently, in sd-jwt spec, the recommended representation for JSON property values is to embed escaped JSON that represents a salted hash of the value. This seems to add complexity, room for error, and superfluous verbosity. Instead, specifying that claims in an "_sd" container are themselves ONE OF the following: (a) simple plain-text values (b) an object of specified properties supporting blinded property values (c) an object with specified properties supporting a blinded graph

There's a good balance of specificity and flexibility here (IMHO) for implementors to accommodate many different scenarios - as opposed to having to unescape and then detect the contents of an attribute's values. With strict conventions on certain attributes of an attribute block, relying parties are guaranteed some properties right away that can be validated before the actual values are read into memory - increasing reliability, security, and permitting faster failure sequences. In addition, writing unit and functional test automation suites for escaped stringified JSON is problematical at best and prone to some of the same formatting problems of concern with structured JSON values. It can also give way to new inconsistent ordering problems during serialized representation testing.

However, per issue 27 - and other discussions that I could uncover, there is ongoing concern over c18n for hash reproducibility. I completely understand why. What I’m proposing might be radioactive, but in this implementor's opinion, is unburdening: I do not think that it’s the SD-JWT spec’s problem to solve data format representation compatibility across all languages and libraries. Furthermore, I don’t think that SD-JWT spec needs to solve for all extended/intermediate representations of claim values - especially when it comes to VC use-cases. In the spirit of good interface building and abstraction, I propose that it’s possible to think of SD-JWT as foundational for other more opinionated specifications.

If statements put in attribute values need interpretation, then I’d suggest other means for doing so like metadata encoding specific sections, SD-JWT issuer documentation, JSON Schema, JCS (as an example: RFC 8785 has guidelines for int64 strict encoding). I humbly warn that normatively specifying that values for properties should be escaped JSON AND that you should stuff other important and specified k<>v pairs in the escaped JSON - is a mistake and will damage adoption for the spec. I don’t think that taking ownership over the shape of claim values should be part of this spec, especially if the goal of a claim is to be “anything you want”…

Boling down the problem of canonicalizing some value(s) - it seems most important to address bitwise value compatibility issues and the way that serialization libraries might handle non-primitive types (as opposed to distressing about ordering). Perhaps there are ways through this that don’t involve incomplete or burdensome canonicalization libraries. Several libraries responsible for producing consistent hashes over JSON (ie: node.js object-hash) internally detect and encode incompatible datatypes. Would it be possible for SD-JWT to specify that prior to salting and hashing the attribute - its value must be: Fully compacted (in the case of a JSON graph) Encoded using utf-8 octal, base64, base64url, hex, or some other specific encoding scheme Then salted and hashed?

On the whole, escaped stringified JSON values as the basis for SVC should probably be rethought. This issue aims to try and start that conversation with some meaningful solutions

By way of examples, you could instead represent SD-JWTs as:

blinded attribute block

"_sd": {
  "name": {
    "h": "VGhlIHF1aWNrIGJyb3duIGZveCBqdW1wcyBvdmVyIDEzIGxhenkgZG9ncy4=", // base64url
    "s": "6YCYrdrSxs7q6dlO562YI6GhAktsBExFe6rcCZ+OX9I=", // base64url
    "id": "did:jwk:894fa94hg10AOEe82…#0", //claim specific DID URL binding
    "v": "Slim Shady"
  }
}

Example generation: Hash over the compact form node for "name", without the "h" property present. Encode the resulting digest in base64url encoding. Insert the digest as property "h" into the "name" attribute block.

Enables compact blinded presentation form of {"name": "VGhlIHF1aWNrIGJyb3duIGZveCBqdW1wcyBvdmVyIDEzIGxhenkgZG9ncy4="}

The underlying salt, which is never revealed unless the "name" property value is revealed, disables direct attack on the property - even with the label, type, and validation procedure for the value data of the property is known.

By letting JSON be JSON in the entire document structure, you enable an issuer to (optionally) be very specific about the structure of the value entered for an attribute. This further enhances security - especially if the signature (or even simply a hash) of the schema for the SD-JWT is described in the signed-over envelope (in a VC-SD-JWT).

blinded, nested eKYC graph

The standardized definition of what an eKYC data model is supposed to look like is a bit ambiguous, so I assume what you’re concerned with is something like OIDC...

"_sd": {
  "oidc_req": {
    "h": "...",
    "s": "...",
    "v": {
      "scope": [..., ..., ...],
      "response_type": "code",
      "client_id": "...",
      "redirect_uri": "...",
      "nonce": "..."
    }
  }
}

the "v" of the claim is valid, and validatable JSON. The semantic meaning of the outer, securely disclosable, claim name can be either domain-specific or use a JSON-LD context.

OR13 commented 2 years ago

"id": "did:jwk:894fa94hg10AOEe82…#0"

I wonder if these are iss or kid or both (as is the current example).

OR13 commented 2 years ago

Do you even need canonicalization? are these serialized formats ever stored in a "decoded" form?

For example, in JWTs.

{ header, payload, signature } is a fine way to store a token, if you are assured member ordering of header and payload are preserved, and whitespace is trimmed.

If you are worried about that, you store the entire thing encoded as a jwt... I don't know enough about sd-jwt, but its possible that cannonicalization can be accomplished in a "light mode" with normative statements, similar to the instructions for https://www.rfc-editor.org/rfc/rfc7638#section-3.2

sbutterfield commented 2 years ago

"id": "did:jwk:894fa94hg10AOEe82…#0"
I wonder if these are iss or kid or both (as is the current example).

It's a really good question, Orie. The way I thought of it was the identifier kid - but it's something I just happened to add here for consideration. Instead of did:jwk, could it be any DID method URL allowing issuer-specific claim binding? or "any party" specific claim binding? Maybe putting did: there is the wrong idea altogether. I'd love to iterate on it.

sbutterfield commented 2 years ago

Do you even need canonicalization? are these serialized formats ever stored in a "decoded" form?

I think this is a key question I'm poking at here. Could sd-jwt be better off not over-specifying a canonicalized form, allowing room for implementation specs unique to the domain of use for the sd-jwt spec?

danielfett commented 2 years ago

As discussed on our call with Shawn yesterday, it is important to separate the representation of values in the credential (SD-JWT) from the actual values of properties that are resolved after the verification of the presentation (SD-JWT and SD-JWT-Release). The use of the escaped JSON in the credential is what allows SD-JWT to support any type, including objects, for property values. Ideally, anything that builds on top of SD-JWT will not need to handle the escaped JSON format, as (for example) the verification algorithm outputs a JSON without any JSON escaped values for further processing by the application. I'm therefore not convinced that the "unusual" format for property values in the credential is a problem.

That said, the escaped JSON is not a great solution and raises a lot of eyebrows. But we think that it is the best solution we have. It is trivially easy to implement correctly, using any JSON library out there. The issue we're addressing is not solvable by prescribing a certain JSON schema (as it works on a higher layer). JCS would solve the problem, but increase implementation complexity. Even just compacting objects, as proposed below, is a relatively complex operation requiring a full JSON parser.

OR13 commented 2 years ago

Sometimes you can avoid escaping by translating... I've used https://www.npmjs.com/package/json-pointer do do selective disclosure with merkle proofs before... I would rather have JSON Pointer as a dependency than see escaped JSON payloads or have to manage transforms that were only defined in the sd-jwt spec.

import pointer from 'json-pointer';

const objectToMessages = (obj: any) => {
  const dict = pointer.dict(obj);
  const messages = Object.keys(dict).map(key => {
    return `{"${key}": "${dict[key]}"}`;
  });
  return messages;
};

const messagesToObject = (messages: string[]) => {
  const obj = {};
  messages
    .map(m => {
      return JSON.parse(m);
    })
    .forEach(m => {
      const [key] = Object.keys(m);
      const value = m[key];
      pointer.set(obj, key, value);
    });
  return obj;
};

export { objectToMessages, messagesToObject };

There are other approaches considered here: https://github.com/w3c-ccg/Merkle-Disclosure-2021/tree/main/packages/linked-data-proof/src/merkle/normalization

Consider building blocks that are widely available in substitute for more normative spec definitions.

danielfett commented 2 years ago

I'm not following - can you give an example how JSON pointers or the code above would solve the hashing problem?

danielfett commented 2 years ago

Do you even need canonicalization? are these serialized formats ever stored in a "decoded" form?

For example, in JWTs.

Interestingly, JWTs are an example where normalization was avoided by just encoding the whole body into the token without any normalization, not unlike what we're doing in SD-JWT. (The ~~header~~ edit: JWK Thumbprint data is treated differently, but that is a very controlled and small data set.)

sbutterfield commented 2 years ago

@danielfett, not sure if you saw this new bit from my write-up:

Would it be possible for SD-JWT to specify that prior to salting and hashing the attribute - its value must be: Fully compacted (in the case of a JSON graph) Encoded using utf-8 octal, basexyz, hex, or some other specific encoding scheme Then salted and hashed?

Similar to the JWP spec, pre-encoding the value using utf-8 octal byte array (or something else, I don't care), then salting and hashing (JWP generates a proof instead).

sbutterfield commented 2 years ago

I think what you want to ensure is that the issuer's intended data format is assured and therefore the hash is always reproducible. With a byte-level encoding, to my knowledge, I cannot think of a language or library that screws this up (although I'm sure one exists), which is why many universal "XYZ to hash" producer libraries reify values in some encoded byte format.

OR13 commented 2 years ago

I'm not following - can you give an example how JSON pointers or the code above would solve the hashing problem?

https://github.com/w3c-ccg/Merkle-Disclosure-2021/blob/main/packages/linked-data-proof/src/merkle/normalization/__tests__/json-pointer.test.ts

https://github.com/transmute-industries/verifiable-data/tree/main/packages/merkle-proof#custom-hash-functions

Normalize to a set of messages which are built from json pointer.

Then use any multi-message scheme on those messages. (such as bbs+ signatures OR merkle set membership proofs)

Thats what we did to create merkle proofs for selective disclosure of object tree subsets.

sbutterfield commented 2 years ago

@OR13, If I try and catch your drift here, your approach would be to process the value sub-graph into JSON pointers and then generate merkle set membership proofs for each fragment - is that right? Then, use the concatenation of the merkle proofs as the value for the salted attribute?

OR13 commented 2 years ago

I showed how to convert an object to a set of messages.

You would want to apply the blinding to the messages.

The approach I showed is similar in that it relies on JSON encoding as strings... and then selective disclosure of those strings... which is then converted back to a selectively disclosed object.

Just sharing the approach, not sure how exactly it might map to sd-jwt.

danielfett commented 2 years ago

@sbutterfield: Are you thinking about putting the pre-encoded value into the document produced by the issuer and then sent to the holder (what we call SVC in SD-JWT)?

@OR13: I don't see how this approach would reduce complexity or make life easier for implementers. Why is an intermediate representation as a set of messages better than one where some values are JSON strings? In both cases, anybody working with the contents of the credential would need to apply some algorithms to convert it back into the original representation. With both approaches, the same data can be transported.

One of the main goals of SD-JWT is simplicity, which is why we have a fully working spec and four running implementations after only a couple of months of development. Right now, all that is needed to implement SD-JWT is a JSON library and a hash function.

bc-pi commented 2 years ago

The aesthetics of the string values with escaped JSON are not great, at best. But it obviates the need to do c18n/normalization with a straightforward approach that doesn't come with other baggage. I believe the draft should do a better job explaining that rational but stick with the current approach.

sbutterfield commented 2 years ago

@danielfett Basically, yes. Let me try and clarify with some examples to see if it makes sense. The encoding methodology employed here is similar to what some standards already employ and seems to be normatively used in JPT.

First for a property value that we want to blind (let's use some arbitrary JSON):

{
  "array": [
    1,
    2,
    3
  ],
  "boolean": true,
  "color": "gold",
  "null": null,
  "number": 123,
  "object": {
    "a": "b",
    "c": "d"
  },
  "string": "Hello World"
}

Must first be in compact form (normatively):

{"array":[1,2,3],"boolean":true,"color":"gold","null":null,"number":123,"object":{"a":"b","c":"d"},"string":"Hello World"}

Next, using a natively available json library function, or an easily supported polyfill - uint8array the value:

let uc1 = '{"array":[1,2,3],"boolean":true,"color":"gold","null":null,"number":123,"object":{"a":"b","c":"d"},"string":"Hello World"}';
let encoder = new TextEncoder();
let uc1Uint8Array = encoder.encode(uc1);
console.log("uc1: ", uc1Uint8Array.toString());
//"uc1: ", "123,34,97,114,114,97,121,34,58,91,49,44,50,44,51,93,44,34,98,111,111,108,101,97,110,34,58,116,114,117,101,44,34,99,111,108,111,114,34,58,34,103,111,108,100,34,44,34,110,117,108,108,34,58,110,117,108,108,44,34,110,117,109,98,101,114,34,58,49,50,51,44,34,111,98,106,101,99,116,34,58,123,34,97,34,58,34,98,34,44,34,99,34,58,34,100,34,125,44,34,115,116,114,105,110,103,34,58,34,72,101,108,108,111,32,87,111,114,108,100,34,125"

Now, it's possible to go in a number of directions in the representation... the uint8array string is quite simple and leaves little room for misinterpretation by an application during reproduction. Using the hashing algorithm specified in the security envelope, salt & hash the string to get your digest. I have not had any problems reproducing the hash locally using different languages and libraries. I've added arbitrary whitespace, kanji, etc. No issues getting the same representation back out

"_sd": {
  "myjson": {
    "h": "OGI2OTUzNjEwNDg0MmFiY2QzYjFiNWJmMTgzYTE2ZjZmOWNiYjU5MWFkYzI2ZDJjNzE4YjM1MmZkYzMzNTRhNg==",
    "s": "6YCYrdrSxs7q6dlO562YI6GhAktsBExFe6rcCZ+OX9I=",
    "v": "123,34,97,114,114,97,121,34,58,91,49,44,50,44,51,93,44,34,98,111,111,108,101,97,110,34,58,116,114,117,101,44,34,99,111,108,111,114,34,58,34,103,111,108,100,34,44,34,110,117,108,108,34,58,110,117,108,108,44,34,110,117,109,98,101,114,34,58,49,50,51,44,34,111,98,106,101,99,116,34,58,123,34,97,34,58,34,98,34,44,34,99,34,58,34,100,34,125,44,34,115,116,114,105,110,103,34,58,34,72,101,108,108,111,32,87,111,114,108,100,34,125"
  }
}

danielfett commented 2 years ago

So the string 123,34,97,114,114,97,121,34,58,91,49,44,50,44,51,93,44,34,98,111,111,108,101,97,110,34,58,116,114,117,101,44,34,99,111,108,111,114,34,58,34,103,111,108,100,34,44,34,110,117,108,108,34,58,110,117,108,108,44,34,110,117,109,98,101,114,34,58,49,50,51,44,34,111,98,106,101,99,116,34,58,123,34,97,34,58,34,98,34,44,34,99,34,58,34,100,34,125,44,34,115,116,114,105,110,103,34,58,34,72,101,108,108,111,32,87,111,114,108,100,34,125 is sent from the issuer to the wallet and then on to the verifier for the selectively disclosed claims, if I understand your proposal correctly. How does this improve on the current solution, which would be sending the string {\"array\": [1, 2, 3], \"boolean\": true, \"color\": \"gold\", \"null\": null, \"number\": 123, \"object\": {\"a\": \"b\", \"c\": \"d\"}, \"string\": \"Hello World\"}? `

sbutterfield commented 2 years ago

How does this improve on the current solution, which would be sending the string

It's not inline escaped json

danielfett commented 2 years ago

We have added an explanation why we have chosen JSON encoding here: https://drafts.oauth.net/oauth-selective-disclosure-jwt/draft-ietf-oauth-selective-disclosure-jwt.html#the-challenge-of-canonicalization

I still consider escaped JSON the most simple, most robust, and an extremely easy-to-implement solution. We will certainly not invent our own way of encoding bytes just to avoid unusually-looking strings in a document most applications will never even deal with. (To this point, please also take a look at the proposed processing model and the examples in the appendix.)

oauth-wg / oauth-selective-disclosure-jwt