Pull out `id` from `credentialSubject`. Change `credentialSubject` to `claims`.

w3c / vc-data-model

W3C Verifiable Credentials v2.0 Specification

https://w3c.github.io/vc-data-model/

Other

285 stars 99 forks source link

Pull out `id` from `credentialSubject`. Change `credentialSubject` to `claims`. #1130

Closed decentralgabe closed 1 year ago

decentralgabe commented 1 year ago

The current credentialSubject property is confusing for implementers and anyone inspecting a verifiable credential. It would be more meaningfully named claims, which makes it abundantly clear that the section is for the claims being made in the credential.

This issue is compounded with the current usage of the id property, which is an optional property within the current credentialSubject. There are two sources confusion here:

Needing to check for an optional property within a confusingly named credentialSubject property to learn whom the subject values are about (if they're about no one why are they there?)
Needing to understand that the id property is a special property with special processing rules which is identifying the party who the credential subject is about <--- this should be confusing enough for you to agree to rename the property.

So, I propose two changes:

Before

{
  "@context": [
    "https://www.w3.org/ns/credentials/v2",
    "https://www.w3.org/ns/credentials/examples/v2"
  ],
  "id": "http://example.edu/credentials/3732",
  "type": ["VerifiableCredential", "UniversityDegreeCredential"],
  "issuer": "https://example.edu/issuers/565049",
  "validFrom": "2010-01-01T00:00:00Z",
  "credentialSubject": {
    "id": "did:example:ebfeb1f712ebc6f1c276e12ec21",
    "degree": {
      "type": "BachelorDegree",
      "name": "Bachelor of Science and Arts"
    }
  }
}

After

{
  "@context": [
    "https://www.w3.org/ns/credentials/v2",
    "https://www.w3.org/ns/credentials/examples/v2"
  ],
  "id": "http://example.edu/credentials/3732",
  "type": ["VerifiableCredential", "UniversityDegreeCredential"],
  "issuer": "https://example.edu/issuers/565049",
  "validFrom": "2010-01-01T00:00:00Z",
  "subject": "did:example:ebfeb1f712ebc6f1c276e12ec21",
  "claims": {
    "degree": {
      "type": "BachelorDegree",
      "name": "Bachelor of Science and Arts"
    }
  }
}

By elevating subject to a top-level property we follow the existing pattern used by issuer. We remove ambiguity about whom the credential is about. By renaming credentialSubject to claims we give a meaningful name to the property and remove confusion about its usage, including removing the need to parse its value for any special-case properties. Implementers will rejoice.

Caveats

The only thing that this breaks, as far as I am aware, is multiple subjects. I believe this can be handled like so...

{
  "@context": [
    "https://www.w3.org/ns/credentials/v2",
    "https://www.w3.org/ns/credentials/examples/v2"
  ],
  "id": "http://example.edu/credentials/3732",
  "type": ["VerifiableCredential", "NameCredential"],
  "issuer": "https://example.edu/issuers/565049",
  "validFrom": "2010-01-01T00:00:00Z",
  "subject": ["did:example:ebfeb1f712ebc6f1c276e12ec21", "did:example:6f1c276e12ec21ebfeb1f712ebc"]
  "claims": [
    { "name": "Alice" },
    {  "name": "Bob" }
  ]
}

OR13 commented 1 year ago

Here is a JSON-LD preview

<http://example.edu/credentials/3732> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <https://www.w3.org/2018/credentials#VerifiableCredential> .
<http://example.edu/credentials/3732> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <https://www.w3.org/ns/credentials/examples#NameCredential> .
<http://example.edu/credentials/3732> <https://www.w3.org/2018/credentials#issuer> <https://example.edu/issuers/565049> .
<http://example.edu/credentials/3732> <https://www.w3.org/2018/credentials#validFrom> "2010-01-01T00:00:00Z"^^<http://www.w3.org/2001/XMLSchema#dateTime> .
<http://example.edu/credentials/3732> <https://www.w3.org/ns/credentials/examples#claims> _:c14n0 .
<http://example.edu/credentials/3732> <https://www.w3.org/ns/credentials/examples#claims> _:c14n1 .
<http://example.edu/credentials/3732> <https://www.w3.org/ns/credentials/examples#subject> "did:example:6f1c276e12ec21ebfeb1f712ebc" .
<http://example.edu/credentials/3732> <https://www.w3.org/ns/credentials/examples#subject> "did:example:ebfeb1f712ebc6f1c276e12ec21" .
_:c14n0 <https://www.w3.org/ns/credentials/examples#name> "Bob" .
_:c14n1 <https://www.w3.org/ns/credentials/examples#name> "Alice" .

dlongley commented 1 year ago

-1, the property is not "for claims" (in the absence of a credential subject), but to express claims about a credential subject.

When we speak of "claims" in the VCDM, we are not talking about a container or bucket of arbitrarily modeled data, but rather, claims are "subject - property - value" statements. The top-level subject is the credential subject. Each subject is expressed as a JSON object, identified by the id property with its other properties (other JSON keys) holding the value component(s) for each claim. The same modeling applies recursively with nested data. This is how the data model works and is expressed today in the spec.

andresuribe87 commented 1 year ago

I agree with Dave's comment. That said, I think it's important to acknowledge that implementers are getting confused. I think part of the issue is that VCDM is about a graph of information, but people are thinking about it in JSON terms. Are there other changes that we could do to clarify such confusions?

decentralgabe commented 1 year ago

@dlongley while what you said is factually correct I have a hard time understanding it and bet that the majority of the working group, and an even greater majority of the implementers don't understand it either. That's a problem.

Further, any language about "subject - property - value" statements are non-normative so I cannot find this argument convincing.

dlongley commented 1 year ago

More on how things work today from an older issue:

https://github.com/w3c/vc-data-model/issues/931#issuecomment-1508999231

dlongley commented 1 year ago

@decentralgabe,

@dlongley while what you said is factually correct I have a hard time understanding it and bet that the majority of the working group, and an even greater majority of the implementers don't understand it either. That's a problem.

I agree that we should elaborate more.

Further, any language about "subject - property - value" statements are non-normative so I cannot find this argument convincing.

Well, no informative explanation is going to be "normative" (a contradiction of terms). What's normative is the reference that says the base data model representation is JSON-LD compact form. You can read the JSON-LD spec for all the details I mentioned above in a normative way as a testable expression of the data model. However, I agree that we could do a better job explaining things -- but it is going to be non-normative / informative text by its very nature.

selfissued commented 1 year ago

I agree with the direction this issue takes us in.

TallTed commented 1 year ago

@selfissued — Would you please elaborate on the direction you see this issue taking us?

melvincarvalho commented 1 year ago

I suspect those coming from the JSON world would favour the approach of @decentralgabe and those coming from the RDF world would favour the approach of @dlongley

The proposed changes by @decentralgabe aim to reduce the complexity of the Verifiable Credentials data model by renaming credentialSubject to claims and elevating id within it to a top-level subject property. This indeed simplifies the model in the context of JSON representations and could lead to easier understanding and implementation.

I understand and respect the concerns raised by @dlongley and others regarding the inherent graph nature of the data model and the potential loss of expressiveness with this simplification. However, I believe that a balance between simplicity and expressiveness is necessary to ensure wide adoption.

I dont want to make a proposal here as there are already some, and one more may not help. But as an illustration there might be a middle ground with a term like 'verifies' could be more intuitive to those coming from the json world and rdf world, and is also contained in the name of the project.

msporny commented 1 year ago

-1 because we've gone down this road before (see link).

We have spent significant time in previous iterations of the WG discussing this topic: https://github.com/w3c/vc-data-model/issues/1128#issuecomment-1546324637

I'll also note that it's not clear what object the "claims" refer to, and if we use claims, it can only refer to one object (the subject, presumably).

That is, issuer can have claims asserted about it as well...

  "issuer": {
    "id": "https://example.edu/issuers/565049",
    "name": "Example University",
    "image": "https://example.edu/images/logo.png",
  }

as can any arbitrary object (via extension point, like status) expressed at the top level of a VC or VP.

"status": {
    "id": "https://university.example/credentials/status/3#94567",
    "type": "StatusList2021Entry",
    "statusPurpose": "revocation",
    "statusListIndex": "94567",
    "statusListCredential": "https://university.example/credentials/status/3"
}

So, it doesn't make any sense to have a single property called claims in the VC because an issuer makes claims about far more than just the subject in a VC.

melvincarvalho commented 1 year ago

CredentialSubject is confusing tho. What about validates?

decentralgabe commented 1 year ago

@msporny good point, I think subjectClaims would alleviate that concern.

No matter what, an issuer is making claims about something. When that something is a subject, it is much clearer to explicitly call out that subject using a specific subject property. The subject could be the issuer. It is unclear what the behavior is when there is no subject property as the spec allows for today.

Edit: I agree with a lot of the original arguments in https://github.com/w3c/vc-data-model/issues/480

jandrieu commented 1 year ago

No matter what, an issuer is making claims about something. When that something is a subject, it is much clearer to explicitly call out that subject using a specific subject property. The subject could be the issuer.

@decentralgabe You may be mistakenly thinking that VC's have just a single subject. VCs like a marriage license actually contains claims for at least four subjects (officiant, spouse1, spouse2, and witness).

It is unclear what the behavior is when there is no subject property as the spec allows for today.

Where does the spec "allow for" a subject property? Currently, there is no subject property, there is a "credentialSubject" property that allows for sets of claims about different credential subjects in the JSON-LD manner.

If we had a subject property that was not in the JSON-LD manner, sure, you could have an array of ids for those parties, but then you would still need to associate the specific claims with each of those subjects.

In short, you'd still need a way to represent the triples that say "subjectX predicateA objectB" for each of those subjects. These are the claims about SubjectX where each VC may have multiple subjects.

The credentialSubject property is that array that links statements about subjects to identifiers for those subjects. Having a separate property that just has subject identifiers is unnecessary and redundant with the id subproperty of an entry in the credentialSubject array.

To @dlongley's point, this is normatively defined today:

What's normative is the reference that says the base data model representation is JSON-LD compact form.

Having better language explaining this would be an improvement. However, getting rid of this JSON-LD pattern of representing statements about subjects, would, I think, misalign the data model relative to the semantic and profoundly break the fundamental data model that VCs are based on.

-1 to this adjustment.

Also, I doubt we'll get consensus on this change.

decentralgabe commented 1 year ago

@jandrieu

You may be mistakenly thinking that VC's have just a single subject. VCs like a marriage license actually contains claims for at least four subjects (officiant, spouse1, spouse2, and witness).

No, in fact in my initial post I stated the contrary and gave an example on how you would address multiple subjects. It appears that I would need to amend my post to represent the correlation between the subject property and the specific claim. See the following example:

{
  "@context": [
    "https://www.w3.org/ns/credentials/v2",
    "https://www.w3.org/ns/credentials/examples/v2"
  ],
  "id": "http://example.edu/credentials/3732",
  "type": ["VerifiableCredential", "NameCredential"],
  "issuer": "https://example.edu/issuers/565049",
  "validFrom": "2010-01-01T00:00:00Z",
  "subject": ["did:example:ebfeb1f712ebc6f1c276e12ec21", "did:example:6f1c276e12ec21ebfeb1f712ebc"]
  "subjectClaims": [
    { "subjectId": "did:example:ebfeb1f712ebc6f1c276e12ec21", "name": "Alice" },
    { "subjectId": "did:example:6f1c276e12ec21ebfeb1f712ebc", "name": "Bob" }
  ]
}

Where does the spec "allow for" a subject property? Currently, there is no subject property, there is a "credentialSubject" property that allows for sets of claims about different credential subjects in the JSON-LD manner.

This was not clear. We allow for a credentialSubject without an id property -- that's confusing.

Having better language explaining this would be an improvement. However, getting rid of this JSON-LD pattern of representing statements about subjects, would, I think, misalign the data model relative to the semantic and profoundly break the fundamental data model that VCs are based on.

Are you and @dlongley asserting that what I proposed is impossible to represented in JSON-LD? That it is impossible to have separate subject and subjectClaims properties? If that is truly the case then it seems that our tooling is forcing us into strange JSON representations that are unhelpful to developers, and we should strongly reconsider them.

dlongley commented 1 year ago

@decentralgabe,

Are you and @dlongley asserting that what I proposed is impossible to represented in JSON-LD? If that is truly the case then it seems that our tooling is forcing us into strange JSON representations that are unhelpful to developers, and we should strongly reconsider them.

No, I am not making that assertion. I also find what you proposed to be bizarre in plain JSON. If I had to model a car in JSON, I'd probably do it something like this:

{
  "id": "some VIN",
  "type": "Car",
  "color": "red",
  "engine": {
    "id": "some serial number",
    "type": "InternalCombustionEngine",
    "cylinders": 8
  },
  ...
}

I would NOT do this:

{
  "subject": ["some VIN", "some serial number"],
  "claimsBucket": [{
    "type": "Car",
    "color": "red",
    "engine": "???"
  }, {
    "type": "InternalCombustionEngine",
    "cylinders": 8
  }]
}

And then expect people to map the different positions in "subject" to the positions in "claimsBucket" to understand where the IDs applied. I would actually find that to be non-idiomatic JSON and quite frustrating. Notably, JSON-LD was designed to serve idiomatic JSON by layering linked data on top of it with a goal of getting as close as possible to "zero edits / changes".

It's true that sometimes people create JSON that is more or less "whatever" -- but that doesn't make for a consistent nor compositional data model. It requires everything to be understood in some bespoke way. Instead, it's better to represent your objects as ... JSON objects ... and your object's properties as properties of that object (JSON keys) and your object's properties values as ... the values of those properties. Then nest away based on properties that link to other objects as values. This is all quite natural modeling, IMO.

So I wouldn't endorse your suggestion as a way to do plain JSON. It assumes a very simplistic, non-compositional model with a lot of external (not internally present) information to understand it (like the mapping of subject positions to positions in a big claims bucket).

dlongley commented 1 year ago

@decentralgabe,

Notably this change from your original suggestion:

{
  "@context": [
    "https://www.w3.org/ns/credentials/v2",
    "https://www.w3.org/ns/credentials/examples/v2"
  ],
  "id": "http://example.edu/credentials/3732",
  "type": ["VerifiableCredential", "NameCredential"],
  "issuer": "https://example.edu/issuers/565049",
  "validFrom": "2010-01-01T00:00:00Z",
  "subject": ["did:example:ebfeb1f712ebc6f1c276e12ec21", "did:example:6f1c276e12ec21ebfeb1f712ebc"]
  "subjectClaims": [
    { "subjectId": "did:example:ebfeb1f712ebc6f1c276e12ec21", "name": "Alice" },
    { "subjectId": "did:example:6f1c276e12ec21ebfeb1f712ebc", "name": "Bob" }
  ]
}

Looks just like what we have today, except there's a special "subjectId" property instead of "id" (which is what is consistently used for all IDs for any object in our model today) and the name "subjectClaims" instead of "credentialSubject". Then there's the extra "subject" property, which seems redundant. In short, I think what we have today is simpler and achieves the same goal with more consistency.

msporny commented 1 year ago

@decentralgabe wrote:

It appears that I would need to amend my post to represent the correlation between the subject property and the specific claim. See the following example:

   ...
  "subject": ["did:example:ebfeb1f712ebc6f1c276e12ec21", "did:example:6f1c276e12ec21ebfeb1f712ebc"]
  "subjectClaims": [
    { "subjectId": "did:example:ebfeb1f712ebc6f1c276e12ec21", "name": "Alice" },
    { "subjectId": "did:example:6f1c276e12ec21ebfeb1f712ebc", "name": "Bob" }
   ...

hmm, let me fix that for you by renaming subjectClaims to credentialSubject and subjectId to id:

   ...
  "subject": ["did:example:ebfeb1f712ebc6f1c276e12ec21", "did:example:6f1c276e12ec21ebfeb1f712ebc"]
  "credentialSubject": [
    { "id": "did:example:ebfeb1f712ebc6f1c276e12ec21", "name": "Alice" },
    { "id": "did:example:6f1c276e12ec21ebfeb1f712ebc", "name": "Bob" }
   ...

Once you do that, subject becomes redundant and the only difference between what we have today vs. what you've proposed is the name of the properties.

decentralgabe commented 1 year ago

@msporny no, you miss the point that having subjectId is only necessary when there are multiple subjects. Maybe there's a more elegant way. I would assume ordering would implicitly make this property unnecessary, but I've heard LD does not maintain ordering.

I'd assert that 90% of the time (if not more) a credential is about a single subject. The data model today makes that fact confusing, and doesn't require the identification of that subject.

@dlongley we could debate what idiomatic JSON looks like. JWTs have top level subject identifiers and separate claims and seem to work just fine. In fact they have much broader adoption than VCs. Because a VC must have a credentialSubject it makes sense - to have a top-level subject which refers to the id of the subject(s) the credential is about.

It is quite clear in the diagram at the beginning of this section that there is a set of credential metadata, and then claims. To not call the claims claims is confusing. Credential metadata includes other statements -- like issuer, evidence, status, etc.

Additionally, @dlongley, going back to your earlier comment...the spec clearly states that a claim is a statement about a subject. It would follow that having a section claims implicitly refers to those statements being made about a subject.

Recapping: we've contorted the data model in a terribly confusing way that is self-contradictory, given the sections I linked above. Let's reduce this confusion by calling subjects subjects and claims claims.

dlongley commented 1 year ago

@decentralgabe,

@dlongley we could debate what idiomatic JSON looks like. JWTs have top level subject identifiers and separate claims and seem to work just fine.

I recommend just using a JWT if it works for you. There's no point in duplicating that standard here.

In fact they have much broader adoption than VCs.

For different use cases. One reason VCs were invented years back was because JWTs on their own did not come with features to allow people to easily express open world data in consistent, expressible ways with extensible decentralized semantics. The standard that allows people to do that is JSON-LD, so VCs are built on top of it.

JWTs did not have uptake in the space VCs were created to fill. We would have just used JWTs if they had the kind of data modeling and features that it seems you're now suggesting we remove in favor of the constrained and simplistic JWT approach. The JWT approach works for the set of use cases JWTs were designed for: simple authorization and authentication tokens. The vast majority of JWT use cases look practically the same, using a very limited, but very reusable set of JWT claims. If that's all you need, use JWTs. But it doesn't make any sense to make VCs behave just like JWTs when JWTs are already a standard.

It is quite clear in the diagram at the beginning of this section that there is a set of credential metadata, and then claims. To not call the claims claims is confusing. Credential metadata includes other statements -- like issuer, evidence, status, etc.

Additionally, @dlongley, going back to your earlier comment...the spec clearly states that a claim is a statement about a subject. It would follow that having a section claims implicitly refers to those statements being made about a subject.

There's more than one subject in every VC. The credential itself is a subject. The credential subject is a subject. The issuer is a subject -- and so on. The graph of information is a collection of statements that are "subject - property - value", where the "subject" is whatever the properties and values apply to. This is why we use the term "credentialSubject" to refer specifically to the credential subject and the claims made about it -- and to distinguish it from other subjects in the graph. We don't just use the generic term "subject" for this because then it's that much easier to confuse the two (generic "subject" with "credential subject") ... like it seems you just did.

So for a VC:

{
  "@context": "...",
  "id": "this is the ID of subject A",
  "type": ["VerifiableCredential", "..."],
  "issuer": {
    "id": "this is the ID of subject B",
    "name": "Some Issuer"
  },
  "credentialSubject": {
    "id": "this is the ID of subject C, *the credential subject*",
    "aPropertyOfTheCredentialSubject": "foo"
  }
}

This can be expressed as a set of statements (subject - property - value) that form a graph of information:

"this is the ID of subject A" - "type" - "VerifiableCredential"
"this is the ID of subject A" - "issuer" - "this is the ID of subject B"
"this is the ID of subject B" - "name" - "Some Issuer"
"this is the ID of subject A" - "credentialSubject" - "this is the ID of subject C, *the credential subject*"
"this is the ID of subject C, *the credential subject*" - "aPropertyOfTheCredentialSubject" - "foo"

Recapping: we've contorted the data model in a terribly confusing way that is self-contradictory, given the sections I linked above. Let's reduce this confusion by calling subjects subjects and claims claims.

It's not contorted nor self-contradictory, there's just a misunderstanding. I believe the confusion here may actually be coming from removing the qualifier "credential" from "credentialSubject" leaving only "subject" behind ... with no way to differentiate it from every other subject in the graph of information. Another source of confusion could be from people in our group contextually and colloquially using "the subject" to mean "the credential subject". But the spec talks about more than just "the credential subject", it talks about "subject" as a generic "thing" in a subject-property-value statement (aka "claim"). I agree it would be good to see if there's some more informative text that would help alleviate confusion here.

msporny commented 1 year ago

@msporny no, you miss the point that having subjectId is only necessary when there are multiple subjects.

If that's the case, then you're saying that the identifier for the subject is optional, which is exactly what we have in the spec today. The only difference then becomes that the identifier of the subject of the credential is separated from the claims for the subject of the credential. Separating an identifier from the data that it's associated with does not seem like an improvement.

Maybe there's a more elegant way. I would assume ordering would implicitly make this property unnecessary, but I've heard LD does not maintain ordering.

Then you're talking about keeping the ordering of two array values in an object in sync, which seems less than ideal.

JSON-LD (and this goes down to the RDF model) does not maintain order in a *set*, which is true for any set-based data structure -- ordering is not maintained. That's just the pure mathematical definition of a set. JSON-LD (again, really the RDF data model) also has the concept of a list, which does preserve ordering. So, LD can do both unordered sets and ordered lists.

I'd assert that 90% of the time (if not more) a credential is about a single subject. The data model today makes that fact confusing, and doesn't require the identification of that subject.

No, 100% of the time, a verifiable credential contains information about multiple subjects. These include at least: the issuer, the credential itself, and the credentialSubject.

It seems like when you say "subject", you mean "credentialSubject"... and not the more general "subject" in the "subject-property-value" sense.

Additionally, @dlongley, going back to your earlier comment...the spec clearly states that a claim is a statement about a subject. It would follow that having a section claims implicitly refers to those statements being made about a subject.

Yes, but which subject?

Recapping: we've contorted the data model in a terribly confusing way that is self-contradictory, given the sections I linked above. Let's reduce this confusion by calling subjects subjects and claims claims.

I hope it's clear that by doing that, it confuses things further and doesn't simplify anything.

melvincarvalho commented 1 year ago

JSON-LD (and this goes down to the RDF model) does not maintain order in a set, which is true for any set-based data structure -- ordering is not maintained. That's just the pure mathematical definition of a set. JSON-LD (again, really the RDF data model) also has the concept of a list, which does preserve ordering. So, LD can do both unordered sets and ordered lists.

While it's accurate to state that pure mathematical sets do not maintain order, it's not entirely accurate to say that JSON-LD doesn't maintain order in a "set". This assertion seems to conflate the concept of a "set" in mathematical terms with the concept of a "set" in the context of programming languages and data structures.

In JSON-LD, an unordered collection of items is typically represented as an array. JSON, the underlying data format for JSON-LD, maintains the order of elements in an array. However, when JSON-LD is converted to RDF, which is a graph-based data model, that order is typically lost because RDF does not inherently support ordered collections. To preserve order, RDF provides a specific construct, the RDF List, but this is not commonly used due to its complexity.

Therefore, while JSON-LD can represent both ordered and unordered collections, it is not accurate to say that it doesn't maintain order in a "set". The truth of this statement largely depends on the context: it's true in the context of RDF, but not in the context of JSON. ie only when converted to RDF is the ordering lost.

Edit: quick example of how json-ld and RDF differ as a set:

the array : // legal in json(-ld)

{
  "@context": {
    "@vocab": "http://example.org/"
  },
  "numbers": [1, 2, 2, 3] 
}

the array : // illegal in RDF, becomes new array, 2 is missing, no order preserved

{
  "@context": {
    "@vocab": "http://example.org/"
  },
  "numbers": [2, 1, 3]
}

OR13 commented 1 year ago

@msporny just commenting on your status list example, this JSON looks nicer, it saves space, and I think the RDF representation is also cleaner:

"status": {
    "id": "https://university.example/credentials/status/3#94567",
    "type": "StatusList2021Entry",
    "purpose": "revocation",
    "index": "94567",
    "credential": "https://university.example/credentials/status/3"
}

Note that "purpose", "index" and "credential" are already in the context of the type "StatusList2021Entry"... so repeating the string "status" is wasteful in both JSON and RDF.

As is repeating the word "credential" in "VerifiableCredential".

dlongley commented 1 year ago

@OR13,

Note that "purpose", "index" and "credential" are already in the context of the type "StatusList2021Entry"... so repeating the string "status" is wasteful in both JSON and RDF.

As is repeating the word "credential" in "VerifiableCredential".

For some history on why some terms have been prefixed:

It's important that terms be @protected to allow for simpler JSON-based processing. This means that terms cannot be redefined unless you're using a new nested structure or a different type of object. For a long time with JSON-LD (1.0) we did not have property-scoped or type-scoped contexts. Now we do with 1.1 (which essentially came out during VC 1.0) -- so repeating words has become less of a problem with strongly typed data and type-scoped contexts. This makes your suggestion much easier -- as we can define the properties you're mentioning on just objects typed with StatusList2021Entry. Other types can redefine and reuse the same terms in other ways. We couldn't do that in the past -- those properties had to be defined across every type, leading to potential annoyance when people wanted to reuse the same words with different meanings for different types of objects.

In short, I'm pretty sure the above issue with taking the approach you're suggesting has been mitigated now -- but other issues may remain.

decentralgabe commented 1 year ago

First, many thanks to @dlongley and @msporny I had a gap in understanding you've helped me overcome I appreciate your clear explanations and patience. I understand why in JSON-LD land it makes sense to have a nested property for ID. I'm still not clear on why the id property in credentialSubject is optional, when credentialSubject is mandatory. Maybe this has some use in reducing correlation risk? Still having a require manner to know who claims are about seems valuable.

More broadly, I am worried that few members in the group share the understandings you've conveyed which adds some significant risk to the group in developing and implementing the spec. I'm not sure the best way to overcome this, and it's clearly out of scope of this issue, but I feel like it's something we should address...

Foremost, with my newfound understanding of the data model, I understand why the credentialSubject property is named as it is. I still believe this is an issue, particularly because the term "subject" is overloaded. It does not just refer to the party to whom claims are being issued, it could refer to a number of parties within a VC. I believe we could benefit from ubiquitous language—a single-word term that means "credential subject" without needing to always write out credentialSubject, similar to how "issuer" just means "issuer." Perhaps something like target or recipient but less terrible.

The concern over whether credentialSubject is separate from credentialSubjectId is a different discussion. I see value in decoupling the identifier of subjects (whether issuer or credentialSubject) from their claims; however, it may just be better to focus on the naming confusion of the credentialSubject property unless others share my concern here.

I'd like to revisit select comments from an issue raised by @rieksj a few years back, #408 (and before that #207). The issue seemed to mostly not go anywhere because v1 was in CR at the time. Now that we're working on v2, before CR, this is the right time to address the issue, should we be able to find consensus.

There are some strong articulations of my intentions which I'd like to recap. A few selected highlights:

NOTE: These comments are years old and it's very possible the author's positions have changed.

@brentzundel wrote:

This doesn't change the fact that it is confusing for others. We say that a verifiable credential contains claims, then show a data model that (for valid, yet pedantic reasons) has a credentialSubject property. We then have to explain that that's where claims should go, and then explain why the property is not just called "claims," since that is what a verifiable credential supposedly contains. Rather than requiring this educational moment every time someone new looks at the data model, I support the proposal to change credentialSubject back to claims.

@msporny wrote:

The specification doesn't state that explicitly because there was a contingent of people both inside and outside of the WG that didn't want that sort of binding to be made to between the VC spec and JSON-LD (this was the JSON-only contingent). So, we were not able to get to making any statements of the sort in the specification that would have reached consensus. From what I remember, we didn't try elaborating on that fact in https://www.w3.org/TR/vc-data-model/#syntactic-sugar, which we probably could do without a big scuffle in the group... it would be a non-normative statement because it's just stating a fact of JSON-LD. However, there would be no such assertion for JSON, because JSON doesn't have the concept of graph/nodes/node identifiers, etc. So, it would still remain woefully underspecified for the JSON expression.

JSON-LD, if applicable, needs to be introduced at the beginning of the specification and incorporated into the explanations of what a Claim is and what a Credential is because JSON-LD introduces requirements are unnatural and will be unexpected for most intelligent readers.

The initial specification did this... and after multiple objections, we had to remove that language from the specification and enable a JSON-only mode as well (which didn't answer many of the important points you're raising). That is, we did not have consensus at the time to do what you are suggesting... quite the opposite, there was a consistent opposition to doing what you're suggesting (which I personally support) until we removed that from the specification.

+1 and this is the issue. The spec takes a weak position (by necessity) on being a fully JSON-LD specification. Because of this we're left in a no mans land of LD/JSON that leads people (like me) to be confused about what the data model is actually defining and why. In a sense, I would rather see a completely LD data model to reduce this ambiguity.

David-Chadwick commented 1 year ago

@decentralgabe We allow for a credentialSubject without an id property -- that's confusing. Personally I do not see any confusion in this. A subject is identified by a whole set of attributes such as name, age, DoB etc. Having in "id" is just one such attribute that a subject may or may not have. Thus making the "id" optional is perfectly natural to me.

decentralgabe commented 1 year ago

@David-Chadwick does that imply that with an id the credential subject is to be identified purely by the id or by the id and other claims in the credentialSubject?

David-Chadwick commented 1 year ago

the id is defined as being globally unambiguous, therefore on its own, it can identify the subject. Note that unambiguous does not mean unique, as a subject can have multiple different IDs of different types e.g. email addresses are also globally unambiguous IDs. The id in the VCDM is defined to be a URI. So on its own it identifies the subject. However, other combinations of "non-id" attributes can also unambiguously identify the subject, such as the combination of name, address and DoB. The differentiating factor of an ID from other identifying attributes is that it can be used on its own to identify the subject.

msporny commented 1 year ago

@decentralgabe wrote:

In a sense, I would rather see a completely LD data model to reduce this ambiguity.

Yes, so would I. That said, you've heard the individuals in the WG that have consistently opposed such a model. To be fair, JSON-LD is more difficult to use than JSON, and this is because it makes an attempt at consistent data modelling (such as, "How do you globally identify an object?" or "Is the data model a tree (no cycles) or a graph (cycles allowed)?" or "How do you disambiguate terms?", and so on). It does this while trying to enable developers to only buy into as much of JSON-LD as they need. It's this latter property that enables JSON-LD to be embedded in Web pages for schema.org (no JSON-LD processing, or really, knowledge of how JSON-LD works to be known) or be embedded in a VC-JWT (again, no JSON-LD processing needed for those copy-pasting code into a template and then signing it).

What's been happening more recently is that some folks in the WG want to understand the JSON-LD underpinnings more deeply (and want to make sure that others understand them more deeply), or we've got some developer rough edges around how people are using JSON-LD in VCs, and that has led to conflict (which is expected).

The usage of JSON-LD in the core data model has always been a balancing act... use just enough of it to be useful to all communities, explain just enough of it to be useful to implementers, but always try to avoid going to an extreme and requiring that authors and implementers understand everything there is to know about JSON-LD before writing their first VC (that would clearly be a failure). To put it in perspective, none of us understand the depths of how the V8 engine works, nor spend considerable time in ECMAScript Working Groups, but still rely on the JavaScript language to get work done on the Web. Developers use tail recursion, promises, null coalescing, and cryptographic libraries without understanding how they're actually implemented under the scenes. Only a very few people on the planet need to understand how that stuff is implemented at depth.

So, IMHO, one of our jobs in the WG is to expose a set of primitives to developers and implementers that are useful and easy to reason about without exposing them to all the gory details of the underlying technologies. We want to drive copy-paste behavior that "just works" instead of having to learn a PhD's worth of CS to use the standards we've created. That doesn't mean we won't have rough edges when we're done... we're just trying to smooth down as many of the rough edges together as possible.

... and that's why I don't think that going to an extreme on RDF and JSON-LD is going to help us either. People just don't have the time to learn those technologies at depth, and JSON-LD was created to cater to a large subset of developers that use JSON, but need some of the decentralized data modelling properties that JSON-LD brings to the table.

dlongley commented 1 year ago

@msporny,

...but need some of the decentralized data modelling properties that JSON-LD brings to the table.

And I would say that many need it without needing to realize it (IMO, that's a good thing) -- to make everything hang together in the decentralized three party model.

The trick here is finding the right balance of what to expose to most people. It's a question of what they need to understand to make use of the standards, not what they could find out if they follow all the links down the various rabbit holes. Individuals can always do that on their own. We could spend lots of time trying to more accessibly surface what's in those rabbit holes in our own specs but with little positive impact on most people (and perhaps the opposite) by doing it.

decentralgabe commented 1 year ago

Thanks @msporny and @dlongley I agree with your responses. If there were a "vision" document or similar for the VCDM I think what you wrote would be immensely helpful to add there.

My concerns have been alleviated hold for the confusion of the name "credentialSubject" -- I still think there's room for more clarity.

melvincarvalho commented 1 year ago

That said, you've heard the individuals in the WG that have consistently opposed such a model. To be fair, JSON-LD is more difficult to use than JSON

There is an education gap. @msporny wrote a beautiful piece on his blog about this years (about 10) ago. Explaining the differences. I've spent quite a bit of time looking for it but didnt find it. It's somewhere in the archive of the internet.

One of the main points was that arrays are not first class in linked data. And they are commonly used in JS/JSON. This ties into the idea of trees vs graphs, and unordered vs ordered data. It's subtle how much we rely on ordering even when it's not mandated. Silly example but: imagine that a file like package.json came back in a random order each time it was changed. In a Set model it doesnt matter, but in practice developers would be up in arms.

I'd love to reread that blog post. I think there could be ways to close the information gap.

msporny commented 1 year ago

I'd love to reread that blog post. I think there could be ways to close the information gap.

Maybe this two-parter?

http://manu.sporny.org/2014/json-ld-origins/ http://manu.sporny.org/2014/json-ld-origins-2/

There is an education gap.

IMHO, it's a tooling gap more than an education gap. Though, education always helps... it educates more people so that more people can develop more tooling.

To put this in perspective, I don't know how JSON Schema works, but I've had to write a JSON Schema parser that would build an internal model and render it to HTML. It was (and continues to be) an absolutely awful and horrible experience due to the language design... yet, people continue to use it (and some even love JSON Schema). Why is that? Well... there's enough tooling to help make it usable by people that have next to no idea about how the technology actually works (and that's a great thing).

We suffer from tooling problems at the cutting edge... it's always been that way for new technologies (or new uses of old technologies)... and there is no amount of writing about how something works that solves that tooling problem. You have to develop and release the tooling to improve things. So, as much as I am a fan of educating people... that's not where most of the effort is needed these days -- it's in software libraries that implement the standards, and tooling that makes use of those software libraries to make developers more productive (until the AIs eat the developers, that is :P).

Just my $0.02, which could be varying degrees of wrong. :)

TallTed commented 1 year ago

email addresses are also globally unambiguous IDs

Be careful with such assertions. Email addresses may be globally unambiguous, but only with a timestamp, as domains may change hands, just as email addresses within domains may. Even with that temporal specificity, an email address may reach a single entity's mbox, or it may reach a group's shared mbox, or it may distribute to a group of mboxes (with a count ranging from zero to n) and thereby their individual or group owner entities, etc.

In other words, email addresses are NOT globally unambiguous IDs.

melvincarvalho commented 1 year ago

In other words, email addresses are NOT globally unambiguous IDs.

Doesn't this temporal aspect apply to all identifiers including all URIs?

TallTed commented 1 year ago

Doesn't this temporal aspect apply to all identifiers including all URIs?

Yes. Temporal details should be included as attributes of every graph (whether named or not) wherein some URI is minted, used, and/or referenced.

That said, the HTTP/S URI scheme is defined differently than the MAILTO URI scheme, and the HTTP/S URI scheme definition includes that there be a singular referent denoted by a given URI, even if that referent is defined to be a collection/group, and if you are abiding by principles of Linked Data and/or RDF, dereferencing that URI will yield a description which pinpoints the members of that group — or, sometimes sufficiently, at least informs that the referent is a group, though its specific membership may not be available.

MAILTO URIs have no such general dereferenceability, though specific deployments may impose some means by which to glean such a list of members.

My objection to the assertion that email addresses are also globally unambiguous IDs stands.

decentralgabe commented 1 year ago

Closing due to lack of consensus/interest.