w3c / did-core

W3C Decentralized Identifier Specification v1.0
https://www.w3.org/TR/did-core/
Other
407 stars 96 forks source link

Separating abstract data model from syntaxes #103

Closed talltree closed 4 years ago

talltree commented 4 years ago

The first public working draft (FPWD) currently defines DID documents in two sections.

  1. Section 5 defines DID Documents.
  2. Section 6 defines DID Document Syntax.

However Section 5 does not actually define DID documents abstractly, but rather as a collection of JSON properties. This may be because the first sentence of Section 6 currently says:

A DID document MUST be a single JSON object conforming to [RFC8259].

However this directly contradicts the second paragraph of Section 6, which says:

Although syntactic mappings are provided for JSON and JSON-LD only, applications and services can use any other data representation syntax, such as JXD (JSON XDI Data, a serialization format for the XDI graph model), XML, YAML, or CBOR, that is capable of expressing the data model.

Besides resolving these obvious conflicts in the FPWD, a number of WG members have asserted that, because DIDs and DID documents operate at such a low level of Internet infrastructure—and are effectively protocol elements in DID resolution—the following design principles should apply:

  1. DID document structure should be defined abstractly, using a language designed for abstract data modeling such as UML.
  2. Syntaxes for expressing DID documents should be defined separately from the abstract data model.
  3. No syntax should have special status, i.e., each should define exactly how it implements the abstract data model in its own separate section of the spec.

If there is rough consensus on these design principles, then it would make sense to revise the current structure of the spec as follows:

Note that this issue is orthogonal to #95 (Document Structure), so a decision on this issue may affect that one.

SmithSamuelM commented 4 years ago

Using an abstract data model for the DID document syntax specification enables tooling that is validating DID documents to be expressed in modeled form thereby engendering validation in a generic sense. Step one convert did doc to abstract data model. Validate in model space. The mapping between abstract data model and format specific syntax should be round trippable and the syntax mapping should be defined both directions. As is well known smart contract expressed inthe Ethereum smart contract language solidity are difficult to validate. The best validation approaches converted solidity contracts into abstract data models that could be validated. A UML State Chart format would allow modeling of the did method operations on a did doc in a formal way.

Besides the universal validation-ability advantages. An abstract data model will clear up and prevent many of the conflicts over items and syntax in the did doc. As the discussions often devolve to clarification between JSON and JSON-LD syntaxes as opposed to the functionality changes to the spec. Making the core spec abstract and then letting each syntax decide what is the best (two way round-trippable) way to implement the data model in a given syntax will allow the focus of discussions to be where they need to be. Generic functionality will be syntax independent thus removing any discussion of syntax from those discussions. Syntax specific discussion will be attended by those impacted by that syntax and who are knowledgeable in that syntax thereby focusing those discussion. This would also be an advantage to DID resolution.

SmithSamuelM commented 4 years ago

Using a abstract data model for the DID methods will enable more universal DID resolvers and the DID methods in a given language may be auto-generated from the abstract data model (such as UML State Charts) This makes code validation against a did method spec possible.

dhuseby commented 4 years ago

@talltree thank you for taking the time to write this up. I couldn't agree with you more. I spent this summer implementing a Rust crate for parsing DID's and DID documents. It is now the primary DID handling code in Hyperledger Aries. After getting a mostly working crate together, I was so incensed at the state of the spec that I drank three beers and then wrote out a rant from an implementer's perspective.

The part of my rant that is applicable to this issue is Section 4.1: The DID document spec should be encoding agnostic. I completely agree with @talltree and @SmithSamuelM that we should concentrate more on what must be in a DID document that how it is encoding. Everything else we want to specify around DID operations (e.g. controller binding, key management, service point linking, etc) is independent of the encoding of the DID documents.

So why specify that DID docs must be JSON-LD? More importantly, what are the reasons for why we wouldn't want to bless on encoding? It turns out that many different industries have settled on specific encodings as standard and if we want SSI to penetrate into those markets DID documents will need to be encodable in those encodings.

I'm not talking about insignificant industries either. The entire legal profession has settled on PDF documents for better or worse. Adobe and several other vendors (e.g. Docusign) are building out identity related products meant to handle cryptographic credential distribution and management inside of PDF documents. Those products sound suspiciously like SSI and the PDF meta data sounds suspiciously like DID document data. If the DID spec outlined the data that should be in a DID doc and then formalized ways that data can be encoded in different formats, then support for PDFs would be possible for the legal industry.

The same goes for the financial industry. They recently settled on the ISO 20022 encoding for financial records and messaging. If we want to bring DID docs to the financial industry we'll need to support that. Same goes for national ID/drivers' license standards. Birth certificate encodings. Health record encodings. Hell, even climate scientists have settled on the HDF5 standard for all weather and geosat data sets. I think it would be a nice improvement if weather and climate instruments were digitally signing all of their data using SSI/DID credentials and enabling the provenance tracking of the data. But that would require the DID credentials to be encoded in HDF5.

To put a period at the end of my point, here are the recommendations that I made that are in full agreement with @talltree and @SmithSamuelM :

  1. Focus on what data must be in a DID document.
  2. Create a registry of encoding methods.
  3. Define the process by which a new encoding method can be included in the registry of standard encoding methods.
  4. Add JSON-LD as the first encoding method in the registry and move all of the encoding details out of the existing FPWD and into the JSON-LD encoding spec (i.e. the canonicalization rules, etc).

This also future proofs the DID spec because we can incorporate new encodings on the fly.

dhuseby commented 4 years ago

In theory, if we follow the pattern of specifying the minimum data and then also specifying a registry of encoding methods, there's nothing keeping us from saying that X.509v3 certificates are a valid encoding method for a publicKey data unit as long as it contains the necessary fields. Based on my survey of the different methods of representing cryptographic key material, the only thing missing from the X.509v3 spec would be the "id" unless we abuse the "common name" (CN) member of the "Subject" field as specified in the latest PKIX RFCs to instead contain a DID string URI instead of a URL.

This approach would certainly address issue #69.

dhuseby commented 4 years ago

But given my grok of DID, I don't think the existing certificate formats contain enough data to fully support DID. They certainly don't include any data to support the standard key management functions we seek to include in the standard.

dhuseby commented 4 years ago

But how cool would it be to make the set of DID specs able to incorporate the existing CA system. Credentials from a CA in the form of an EV certificate could work just as well as KYC credentials from your bank or DMV. We would just need to wrap them in the DID context so that they are actually useful for SSI.

dhuseby commented 4 years ago

I am now laughing to myself at the thought of the did:ca: method.

In all seriousness though, I think we're getting somewhere if we are able to boil down the suite of DID specs to the core tenants of cryptographic identity such that the CA system turns out to be one narrow implementation of the standard.

selfissued commented 4 years ago

I have worked on specs* where there was an abstract data model spec and a companion concrete binding spec, and do you know what? Developers hated it!

The said that it was overly confusing to have to read both specs in parallel to try to piece together what they actually needed to implement. In the end, overwhelming feedback from developers caused us to abandon that approach. We folded all the normative requirements in the abstract spec into the concrete spec, and do you know what? The result was much clearer and easier for developers to use.

I'll also add that, as an editor, maintaining the two parallel specs, keeping them in sync, and figuring out which statements belonged in the abstract spec and which belonged in the concrete spec was a special kind of hell. I wouldn't wish it on anyone.

Please, let's not go down this rathole. Let's create a great JSON DID spec. If we later want to create a parallel concrete representation in another data format, we can do that. But let it be usable on its own just like JWT [RFC 7519] and CWT [RFC 8392] are parallel, but usable on their own without reference to one another.

* These specs were OpenID Connect Messages (the abstract form) and OpenID Connect Standard (the HTTP/OAuth binding of the abstract form). Developers revolted and insisted that we merge them to create OpenID Connect Core before we made OpenID Connect final.

SmithSamuelM commented 4 years ago

@selfissued. It’s interesting to hear about your experience with an abstract data model spec (what spec was it?). One of my concerns is that we currently are supporting two specs. A JSON version and a JSON-LD version. Only its not too encodings is some frankensteinian combination in one encoding. Having been an editor on other specs. (CEA 852, 852.1, 709.X) I find that focusing on the data model even when there is a preferred implementation language is a good thing. If there is a strong use case for more than one encoding (which is already the case for DIDs given both JSON and JSON-LD) then IMHO it makes tracking multiple encodings very difficult indeed if there is not an abstract data model. Alternatively if there is only one encoding then having both an abstract data model and an encoding is more complicated and I would be on your side in saying lets not have an abstract data model. So if you don’t mind responding for ithe spec you mentioned was there ever more than one encoding? If not then its not a convincing case for not having one. As a developer I like having examples for implementation purposes but if I have to support more than one implementation then I want a canonical specification that is the source of truth otherwise I spend way too much time trying to determine what is implementation detail and what is canonical detail. That seems to be what has been happening with the DID spec. Way too much time spent in bike-shedding JSON-LD vs JSON and not having a clean spec for implementing either of them. As cryptographic primitives DIDs should have more than one encoding. We want universality and portability. The price of that is creating an unambiguous spec. There are two paths to clarity. A single canonical implementation encoding or a single canonical data model. If we anticipate multiple encodings then the latter is better in the long run.

dhuseby commented 4 years ago

@selfissued we're already doing the split abstract and concrete with the DID method spec's and it seems to work just fine. There's a template for what a DID method spec needs to define and implementors define those for their spec. I wrote/maintain the Git DID method spec and it works just fine.

BTW, I'm also proposing in other issues that the existing DID spec be broken up into: DID URI spec, DID key material spec, DID service endpoint spec, and the overall DID doc spec. I think it will be pretty easy to specify what the data model is given the very specific nature of what we are trying to accomplish.

selfissued commented 4 years ago

Answering @SmithSamuelM's question, the last version of the abstract spec was https://openid.net/specs/openid-connect-messages-1_0-20.html, the last version of the concrete binding spec was https://openid.net/specs/openid-connect-standard-1_0-21.html, and they were replaced by the combined specification https://openid.net/specs/openid-connect-core-1_0.html, which became a standard.

SmithSamuelM commented 4 years ago

@selfissued. It looks like all three version s of the spec are essentially the same all use http for non normative examples. So maintaining two version would be confusing.what I believe is being proposed here is not to follow the example of opened. But instead to have only one full spec with appendices that are normative annotated code examples. So the normative binding spec would be simplified with normative examples but not completely duplicating the spec. So a developer has one spec that defines the data model and then a set or normative examples for the specific encoding. The normative examples would be annotated with clarifying comments. This is in contradistinction to essentially copying and pasting the full text of the spec for each encoding. I would not want the encodings to be stand alone specifications as appears to be the case for the open ID examples you gave. That would be a nightmare. =) Instead the encodings would be annotated normative or compliant code examples. Then the spec maintenance becomes generating the annotations such as field and block type representation and then making sure that the encodings pass compliance tests. As opposed to the editors being forced to synchronize multiple standalone copies of the same spec that only differ on the normative code examples. One could then use the simplest most common encoding (JSON in this case or pseudo code) to provide illustrative but non normative example throughout the body of the spec.

talltree commented 4 years ago

@selfissued (which, I just have to say, in the context of the work we are doing, is about the coolest handle ever): thank you for the concrete examples of the problem of doing entirely separate specs for abstract data model and concrete encoding model. I agree that could become a nightmare.

What I had in mind when I raised this issue is exactly the model @SmithSamuelM describes: one spec that defines the abstract data model in one section and then uses subsections or appendices for defining each encoding. So the end result is one spec regardless. If someone wants a different encoding after we are done, they can either write a separate spec or convince us to version the main spec.

I also believe this could help us nicely partition the work. Everyone who cares about the abstract data model can collaborate on that, and then those who care about a particular encoding can collaborate on that encoding. Encoding-specific issues stay with the encoding teams, and only issues with the abstract data model are handled by the abstract data model team.

Note that this approach will also address issue #92. And I think it will help us decide about #95.

dhuseby commented 4 years ago

I fully agree with the above two statements. I also want to clarify my 2p on this since I look at these problems from an implementor's perspective. When it comes to encodings there are exactly two things we care about:

  1. The representation of constants (e.g. algorithm identifiers, key usage restrictions, service endpoint types, dictionary key names in encodings like JSON that store key names).
  2. The canonicalization algorithm used to encode a data object into a form that can be digitally signed/verified.

So what I would like to see is of lists of constants used to identify all of the blessed algorithm types, key usage restrictions, service endpoint types, key management function types and the constants from the data model such as the name of each part of a key record.

Then for each encoding appendix, there would be a section that maps the constants to their values in the encoding (e.g. "RsaSignature2018" and "keyEncoding" in JSON).

After that I expect there to be a short section on the canonicalization algorithm used for that particular encoding. So the existing stuff about the JSON encoding would move to the JSON encoding appendix.

Then the only other thing we care about is extensibility and how the lists of constants can be expanded to include experimental features that may get included in future revisions of the spec. For instance, in HTTP used to recommend using "X-my-header" for new experimental or private API headers. We should address how a vendor would expand this spec to cover a novel key encoding or novel algorithm that isn't already accounted for in the spec. It may be as simple as including a URI/URL to the canonical reference on the non-standard data.

What you did in the open ID spec is not at all what I was thinking about. My approach is informed by my years in video game programming where we had an overall data model and then specific encodings for each target video game hardware (e.g. PC, XBox, Playstation, etc). All of our tools were built around assumptions about the core data model (i.e. renderable objects always have a mesh ID referencing the mesh data) and all of the load/save functions where smart enough to detect a specific encoding and do the translations on the fly when needed.

Having a core data model allows implementors to detect malformed DID docs regardless of encoding. That's where we ultimately want to be. I shouldn't care if a DID doc comes in the form of JSON over a HTTPS GET, or is loaded from the meta data in a PDF, or is scanned from the back of a physical drivers' license. I should be able to write code that validates the core data model and can work with DID documents from any source. What gets me excited is the thought that I could have just one implementation that loads DID documents from scanning drivers' licenses and then can immediately be forwarded to the universal resolver over an HTTP request as JSON encoded data. Or loading a DID from a HIPPA encoded medical record and storing it in PDF that is a medical bill. If we don't do this separation, we'll be endlessly hand coding hacks to stuff JSON-LD data into all of these non-JSON-LD aware data systems.

jricher commented 4 years ago

I think the real answer is what the current document is attempting: to have an abstractable data model, but have that model expressed as a known concrete data format. Any other serializations are into and out of that format. This puts hard limits on representation of values, composition of objects, and other items like that. If it can't be represented in the JSON serialization, then it can't be represented. That way you get out of problems like XML attributes and comments, which have no simple JSON equivalent, and instead get to have one data format that can be used across all things. We tried to do this with the VC Data Model spec, with language added towards the end of its lifetime declaring that all serialization and encoding needed to be lossless, bidirectional, and deterministic across any format. I think this could have been helped by having the VC Data model fundamentally expressed as a JSON document explicitly, instead of the implied JSON-LD that's there today.

Regardless, while there's definitely value in an abstract model, there's more value in what @selfissued says above about concrete bindings to real representations. And if you can make your concrete representation translatable to different formats, then you've essentially won for both sides.

selfissued commented 4 years ago

To clarify my position on this following a phone conversation with @talltree, I'm fine with us working on multiple DID encodings as needed by community use cases, provided each is in a separate specification and that the first encoding we work on is JSON. If the JSON encoding also can be used as an abstract or prototype encoding that the others normatively reference, all the better.

SmithSamuelM commented 4 years ago

Mike,

My concern is that the separate spec vs appendix or addendum to the spec might be a hill I would not choose to die on. It’s clear that we might be able to only win one battle in this war and that is the abstract data model. Attempting to also win the separate spec might lose us the war.

Sam

Sent from my iPad

On Dec 6, 2019, at 14:25, Mike Jones notifications@github.com wrote:

 To clarify my position on this following a phone conversation with @talltree, I'm fine with us working on multiple DID encodings as needed by community use cases, provided each is in a separate specification and that the first encoding we work on is JSON. If the JSON encoding also can be used as an abstract or prototype encoding that the others normatively reference, all the better.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

msporny commented 4 years ago

The data model has been separated from the syntaxes in PR #186, which was merged yesterday.

You can view the new layout in the latest published spec:

https://www.w3.org/TR/did-core/

Closing this issue unless there are objections.