w3c / json-ld-syntax

JSON-LD 1.1 Specification
https://w3c.github.io/json-ld-syntax/
Other
116 stars 22 forks source link

`@protected` creates unresolvable conflicts when the same term is defined in two contexts top-level #443

Open trwnh opened 1 month ago

trwnh commented 1 month ago

I've just encountered issue #424 (and the related #361 as well) and in a similar situation with https://www.w3.org/ns/controller/v1 defining alsoKnownAs top-level alongside @protected: true, while https://www.w3.org/ns/activitystreams defines alsoKnownAs in a different namespace (as: vs sec:, loosely)

From controller/v1:

{
  "@context": {
    "@protected": true,
    "id": "@id",
    "type": "@type",

    "alsoKnownAs": {
      "@id": "https://w3id.org/security#alsoKnownAs",
      "@type": "@id",
      "@container": "@set"
    },
//...

From activitystreams:

{
  "@context": {
    "@vocab": "_:",
    "xsd": "http://www.w3.org/2001/XMLSchema#",
    "as": "https://www.w3.org/ns/activitystreams#",
// ...
"alsoKnownAs": {
      "@id": "as:alsoKnownAs",
      "@type": "@id"
    }
// ...

Putting activitystreams before controller/v1 causes the later definition to override the older one, as expected (but not as desired):

{
  "@context": ["https://www.w3.org/ns/activitystreams", "https://www.w3.org/ns/controller/v1"],
  "type": "Person",
  "id": "http://person.example",
  "alsoKnownAs": "https://person.example"  // sec:alsoKnownAs
}
[
  {
    "https://w3id.org/security#alsoKnownAs": [  // should be https://www.w3.org/ns/activitystreams#alsoKnownAs
      {
        "@id": "https://person.example"
      }
    ],
    "@id": "http://person.example",
    "@type": [
      "https://www.w3.org/ns/activitystreams#Person"
    ]
  }
]

But putting activitystreams after controller/v1 triggers the error due to @protected: true:

{
  "@context": [
"https://www.w3.org/ns/controller/v1",  // uses @protected
"https://www.w3.org/ns/activitystreams"  // will trigger the redefinition error
],
  "type": "Person",
  "id": "http://person.example",
  "alsoKnownAs": "https://person.example"
}
jsonld.SyntaxError: Invalid JSON-LD syntax; tried to redefine a protected term.

JSON-LD 1.1 4.1.11 Protected term definitions https://www.w3.org/TR/json-ld11/#protected-term-definitions describes two exceptions. The first exception is when the definition is the same, which is not applicable here. The second exception is for property-scoped context definitions, which is unworkable because in this case the singular top-level object is intended to be both an Actor as well as a Controller Document.

To veryify, here's a type-scoped context definition that errors out:

{
  "@context": [
    "https://www.w3.org/ns/controller/v1",
     {
       "Person": {
         "@id": "https://www.w3.org/ns/activitystreams#Person",
         "@context": {
           "alsoKnownAs": {  // triggers the redefinition error
             "@id": "https://www.w3.org/ns/activitystreams#alsoKnownAs"
           }
         }
       }
     }],
  "type": "Person",
  "id": "http://person.example",
  "alsoKnownAs": "https://person.example"
}

And to reiterate, a property-scoped context definition can't be used because the alsoKnownAs property is top-level. So the way I see it, there's nothing that can be done to resolve this in a "plain JSON" compatible way except:

This leads me to think that @protected is a generally poorly-thought-out mechanism that highly increases the likelihood of such conflicts. Without it, as a producer I could just redefine the term later, for example by putting the activitystreams context last, or by using a local context object that comes after both remote contexts:

{
  "@context": [
  "https://www.w3.org/ns/controller/v1",  // needs to remove @protected
  "https://www.w3.org/ns/activitystreams"  // as:alsoKnownAs will override controller/v1's sec:alsoKnownAs
],
  "type": "Person",
  "id": "http://person.example",
  "alsoKnownAs": "https://person.example"  // as:alsoKnownAs
}

or

{
  "@context": [
"https://www.w3.org/ns/activitystreams",  // defines as:alsoKnownAs
"https://www.w3.org/ns/controller/v1",  // redefines sec:alsoKnownAs as @protected 
{
"alsoKnownAs": {
  "@id": "https://www.w3.org/ns/activitystreams#alsoKnownAs",  // won't work unless controller/v1 removes @protected
  "@type": "@id"
}
}],
  "type": "Person",
  "id": "http://person.example",
  "alsoKnownAs": "https://person.example"  // as:alsoKnownAs
}

I'm not sure the existence of @protected accomplishes its stated goal of "prevent[ing] this divergence of interpretation", nor that the rationale "that "plain JSON" implementations, relying on a given specification, will only traverse properties defined by that specification" is sufficiently addressing the issue of conflicts (or that it is a valid assumption in the first place). The issue arises when two specifications define the same term, and both specifications apply to the current object or document. It effectively leads to a hard incompatibility where it is impossible to implement both specs fully; you have to pick between them.

If there's an option I'm not aware of I'd like to hear it.

dlongley commented 1 month ago

There's a typo in the controller document v1 context and it should instead use the activitystreams vocab for alsoKnownAs. A bug fix will address this particular case.

That being said, the whole point of protection is to enforce a particular term definition in a particular place when a particular context is present. So it is not a bug that it is doing this, but a feature. And it does require coordination to share terms across contexts in the same place (by ensuring the term definitions match). That's a requirement for the feature to work. You can only use other term definitions when you bring in a property-scoped context (as mentioned), because decentralized extensibility (in this case, reuse of the same term with a different definition) is only considered safe in different areas of the JSON tree in the same document.

Of course, if specs and / or implementations allow for JSON-LD compaction to be performed, then significantly more flexibility is possible. All of these designs are around finding a balance for different kinds of consumers in a sufficiently large decentralized ecosystem, some who will only accept static documents and others who might use compaction prior to consumption. This of course creates constraints.

trwnh commented 1 month ago

the whole point of protection is to enforce a particular term definition in a particular place when a particular context is present. So it is not a bug that it is doing this, but a feature. And it does require coordination to share terms across contexts in the same place (by ensuring the term definitions match). That's a requirement for the feature to work.

If I'm reading this correctly, are you saying that two context authors are required to coordinate whenever there is a term conflict? This seems unworkable given the open-world assumption. If any single context author decides to make their term definition(s) @protected, then this creates problems for anyone else who defines the term differently. Essentially, one author doing it means that this author gets supremacy over the "plain JSON" and that their context declaration needs to come last or else the JSON-LD parser will throw a redefinition error. Two authors doing it will create an unresolvable error.

It seems to me like this unnecessarily makes things way more complicated for polyglots or anyone wanting to implement multiple overlapping specs. If for example schema.org decided to protect their context, it would become impossible to use both activitystreams and schema.org, because numerous top-level properties like name are shared across both contexts. A developer producing documents with "@context": ["https://schema.org", "https://www.w3.org/ns/activitystreams"] in this example would be creating irreconcilably unprocessable JSON-LD documents, because as:name is seen as a redefinition of schema:name. This means that either the developer will be forced to write their own context document (even if they don't understand JSON-LD), or that some downstream consumer will have to postprocess the unprocessable JSON-LD to replace the context with their own corrected one.

I don't see a situation that can possibly work smoothly so long as anyone uses @protected. If the aim is to ensure that terms don't get redefined, then this feels like a backfire because the actual result is that the entire document becomes unprocessable; instead of not understanding some number of redefined terms and having them appear to be missing ("I can't find schema:name, I only have as:name, but all the other schema: properties are as expected"), you end up not understanding the entire document ("my parser is giving me an error, I can't do anything with this unless I replace their context with what I am guessing they meant").

dlongley commented 1 month ago

@trwnh,

Apologies, I would have written a shorter response if I had more time.

If I'm reading this correctly, are you saying that two context authors are required to coordinate whenever there is a term conflict?

No, I'm saying that the @protected feature was created for use by specifications that do require significant coordination to decide what the immutable definitions for certain terms in certain documents ought to be. This coordination may be done over the period of several years, in a standards working group. The @protected feature is to explicitly prohibit different definitions for the same terms in the same places in JSON documents. There is no way for two (or more) different context authors to coordinate to sort out a term conflict here, because using a definition different from what is written in the spec is prohibited. The coordination must happen prior to the spec becoming a standard.

This prohibition exists for a good reason: to enable both rigid and flexible implementations to interoperate.

It is used when there is a spec that expresses, in detail, a data model and JSON format, such that implementers who read the spec can write rigid implementations "in the context of" the data as expressed in the specification. In other words, from this perspective, these specs are no different from any other specification designed around information expressed in JSON (with no capability to transform conforming documents into some other expression).

These rigid implementations treat the URLs in the @context field as simple document type + version identifiers. No JSON-LD library or API calls are needed to work with conforming documents, as conformance requires that these fields match specific values and that the documents have an expected structure.

However, behind these @context values are actual JSON-LD context documents that are processable by more flexible implementations. These flexible implementations are able to use JSON-LD libraries to understand the data (potentially even without the spec, through "follow your nose") or to transform the data into a different expression that their code is expecting. By using the @protected keyword in these contexts, an enforcement process is introduced by which the same interpretation is guaranteed to be used across these different implementation approaches (or a protected term error will be thrown).

Of course, enabling these two approaches at once has trade offs. Nothing is for free. Coordination is required while creating the spec and, as is always required when using a JSON spec, a conforming document must not deviate from what's in the spec or reuse terms (JSON keys) to mean something other than what is in the spec. Simply put: the use of a spec and the @protected feature to increase interoperability across implementations of differing complexity reduces some decentralized extensibility in exchange for allowing less complex (but interoperable) consumers.

This seems unworkable given the open-world assumption.

It's workable, and only slightly more constrained, i.e., you can't "just use whatever term definitions you want" in your documents and expect them to be consumable by simpler implementations that are unable to understand your changes. The most basic and commonly reused term definitions from a spec are immutable.

If it helps, this can be thought of as extending the set of JSON keys that JSON-LD already doesn't allow redefinition of, i.e., all keywords (e.g., @context, @id, @type). I don't think this constraint makes JSON-LD "unworkable given the open-world assumption", as you say. By using the @protected feature, a context author just reduces the set of immutable JSON keys a little further beyond what JSON-LD already restricts in its own spec.

Specs that use this feature require the more complex implementations to express their documents in a more rigid way (really, in a specific context) in order to enable simpler implementations to exist. However, you can, of course, express all the information you want using other terms that the spec doesn't mark as @protected. The more complex implementations can then transform incoming documents into whatever contexts they want to (using whatever terms they want to) for consumption.

It is true that when a spec uses this feature it might become incompatible with another spec that also tries to enable these two types of implementations: you can't have a single document be expressed using two contexts that are in conflict with one another. Note that the Activity Streams work tried to enable simpler consumers too, it just didn't use the @protected feature (IIRC, it wasn't available at the time). A consequence of this is that anyone can change the definition of a term defined by the Activity Streams context (by using the @context field), but the simpler implementations do not detect it. This creates semantic confusion which can lead to a variety of serious problems. Newer specifications can avoid this by using @protected in their contexts to actually surface these errors -- so that no valid implementation can use such a document (as you say, the document becomes "unprocessable").

This means that either the developer will be forced to write their own context document (even if they don't understand JSON-LD), or that some downstream consumer will have to postprocess the unprocessable JSON-LD to replace the context with their own corrected one.

...

If the aim is to ensure that terms don't get redefined, then this feels like a backfire because the actual result is that the entire document becomes unprocessable; instead of not understanding some number of redefined terms and having them appear to be missing ("I can't find schema:name, I only have as:name, but all the other schema: properties are as expected")

Your concerns are certainly heard -- but it's important to remember that one of the constraints is that the simplest implementations do not use a JSON-LD library at all. To enable these implementations, document authors have to work within the constraints in the specification: you can't change certain term definitions in certain places. Simply allowing any definition to be used without throwing any errors won't solve this problem, it will just create semantic confusion. As always, myself (and many others) are all ears for a better solution to this problem (and given the constraints), but allowing semantic confusion to happen isn't an acceptable outcome -- so this is the best solution we've landed on for now.

msporny commented 1 month ago

@trwnh wrote:

b) convince whoever is responsible for controller/v1 to redefine alsoKnownAs with the activitystreams-namespaced @id instead of the security-namespaced one

Hi, that's me ("whoever is responsible for controller/v1") :)

It's a bug, thanks for catching it; that context is fairly new and hasn't been put through its paces yet.

Feel free to raise a PR on controller/v1 to fix the issue, or I will do it when I get around to addressing the issue you raised in that repository.

trwnh commented 1 month ago

Your concerns are certainly heard -- but it's important to remember that one of the constraints is that the simplest implementations do not use a JSON-LD library at all. To enable these implementations, document authors have to work within the constraints in the specification: you can't change certain term definitions in certain places. Simply allowing any definition to be used without throwing any errors won't solve this problem, it will just create semantic confusion.

This is part of my concern, though: a document producer who does not use JSON-LD, but declares two well-known remote context documents, because the specs tell them to, or because they think that's what they need to do.

What this producer has just done is completely invisible to "plain JSON" consumers (who aren't aware of any term definitions let alone the possibility of redefining one or that this might conflict). But even the most basic of JSON-LD processors now has to deal with the mess that was created by this incompatible context declaration. I'm not entirely convinced of the fail-fast-and-hard approach here; maybe the JSON-LD processing algorithm could use an additional flag that converts these errors to warnings? This would allow the processor to at least have something to process, provided that they are willing to accept the semantic confusion. (Any errors in schema would be caught "further down the chain", so the document may be discarded later if it is unsuitable for further processing.)

Essentially, the use of @protected in any context document needs to come with a heavy disclaimer that this heavily limits compatibility. "Be careful, this can prevent adaptation" feels like it's not making the consequences fully clear. There should probably be language added around using multiple context documents, and how the use of @protected in any one of them can create problems depending on the order you declare those contexts or on whether any of the others likewise declare @protected. It should be clearly called out that "warning,the JSON-LD document may become unprocessable" is even a possibility, so that context publishers can carefully consider this possible consequence before just slapping a @protected in there.

msporny commented 1 month ago

maybe the JSON-LD processing algorithm could use an additional flag that converts these errors to warnings?

This has been discussed before and identified as a really bad idea. What you describe is in the class of errors that can lead to security compromises. If you want to ignore these sorts of security compromises, don't use @protected, but if you don't use @protected, don't expect people to depend on your context in situations where security is important.

When this sort of thing happens (overriding errors of a detected term conflict), it is definitely a problem that must not be ignored. Doing so would be like a static analysis tool for a non-memory range checked language finding out that you're using memory after it has been freed and allowing the practice to continue happening -- it's a recipe for something really bad happening to the code in production.

trwnh commented 1 month ago

Okay, if you say @protected should be used to help avoid security compromises, then when should one not use @protected? It still feels like the feature is being overused, and most of the cases of conflicting terms I've encountered appear to be primarily in the class of semantic errors where two term definitions differ in @id but could be taken to represent the same concept (owl:equivalentClass or owl:equivalentProperty). Things like as:name vs schema:name, or as:mediaType vs schema:encodingFormat. If there is any difference between the two terms, it is in the spec processing level, and in what those terms imply for the processing of other terms; for example, as:mediaType has implications for as:content or as:href, whereas schema:encodingFormat might have implications for schema:contentUrl or more generally for a schema:CreativeWork.

To be clear, I think this kind of thing (where a certain interpretation is required) somewhat strongly indicates that perhaps application/ld+json is no longer sufficient as a content type for that document, and a dedicated media format with its own processing semantics might be required (like application/activity+json or application/vc). Somewhat unfortunately, it looks like going this route also implicitly locks down extensibility, with one context document being given supremacy over any others. This is probably fine for documents of that content type that only use that context, or might augment it with a few additional term definitions... but an interpretation where a document may wish to conform to multiple types is not possible.

In light of that, perhaps the use of @protected should be advised (or reserved?) only in cases where you are no longer doing (or you expect your consumers to no longer be doing) "generic JSON-LD". Maybe some language along the lines of "if you use @protected, consider defining your own media type separately from application/ld+json, because the use of @protected significantly constrains the semantics and processability of the document" -- this leaves the problem of "conforming to multiple types" unsolved (and that might be a larger problem that JSON-LD itself cannot solve on its own), but at least it sets the expectations correctly.

pchampin commented 1 week ago

This was discussed during the json-ld meeting on 13 November 2024.

View the transcript

Issue Discussion

bigbluehat: We're working through the project list.

gkellogg: added issues that are class 1-3.

subtopic w3c/json-ld-syntax#436

<gb> Issue 436 URI in Profile triggers CORS Unsafe Request Header Byte rule (by azaroth42) [spec:w3c] [needs discussion] [tag-needs-resolution]

gkellogg: might just create "tokens" for profile paraemters.

gkellogg: tokens not being namespaced is mitigated by the fact that the media-type is the namespace.

bigbluehat: So, it treats the media-type as the namespace.
… Profile parameters not having a colon is wide-reaching

gkellogg: not sure how we update guidance for using profile parameters.

bigbluehat: This would be a breaking change for web annotations.
… That would mean web annotations needs their own media type.

niklasl: dlehn's reply may mean this isn't as horrible as it seems.
… I think the datasets working group has done something with this.

pchampin: This doesn't seem to be a problem where things can't work, but making them work is tricky, due to pre-flight requests.
… If we expect a server to support profile-based content-negotiation, it doesn't come automatically.
… If you want to support this, you'll also need to support pre-flight requests.

<bigbluehat> q|

pchampin: This is difficult to configure and easily forgotten.

<gb> Issue 436 URI in Profile triggers CORS Unsafe Request Header Byte rule (by azaroth42) [spec:w3c] [needs discussion] [tag-needs-resolution]

bigbluehat: There were some suggestions for defining enumerated values (tokens).

<pchampin> I think it wouldn't hurt to define "short names" for the profiles in addition to the currently defined IRIs

bigbluehat: The key is to not make it a breaking change.
… This would affect the media-type registration.

niklasl: Aren't link headers defined similarly, where there are pre-defined tokens and IRIs may also be used.

bigbluehat: Browsers have made decisions which are affecting what we can do.

<bigbluehat> > When processing the "profile" media type parameter, it is important to note that its value contains one or more URIs and not IRIs. In some cases it might therefore be necessary to convert between IRIs and URIs as specified in section 3 Relationship between IRIs and URIs of [RFC3987].

https://www.w3.org/TR/json-ld11/#iana-considerations

<niklasl> application/ld+json;profile="http://iiif.io/api/presentation/3/context.json"

niklasl: I think it would be good to add tokens. Rob's specific problem are more about the other uses of profiles.
… I wonder if our solution would be considered a solution for the issue; maybe parts of the issue can't be solved in the JSON-LD spec. Might recommend IIIF to use profile negotiation.
… But, using pre-flight does work, so that would be on their end.
… It's more that we put forward the design pattern and it has become more tricky.

bigbluehat: The ramifications of this are not just expand/compact/... Rob's point is for other specifications that used the same pattern.
… No we know to avoid it.

<niklasl> See also: https://www.w3.org/TR/dx-prof-conneg/ (and https://profilenegotiation.github.io/I-D-Profile-Negotiation/I-D-Profile-Negotiation.html )

bigbluehat: There's reason to document this in the best-practices document. How this affects other specs would mean that they cannot treat profile as being extensible, and will need a new media type.

gkellogg: we might create a registry to allow other specifications to add their profile parameters without needing a new media-type.

bigbluehat: niklasl shared a document on using the profile parameter for content negotiation.

pchampin: Reaching out the that TAG would be a good idea, as other specs rely on this, and they would be impacted.
… I'd like to see their thoughts and how much we should make the effort to try to change this.
… Regarding the spec, note that this is a working draft which has been inactive for a while. This might not be the strongest argument to take before the TAG. (The dataset exchange WG)
… Part of the reason that spec is stalled is that there are contentious discussions with IETF on where it belongs.

<niklasl> From the dx-prof-conneg draft: During 2018, DXWG members had a longer discussion with the JSON-LD WG at the annual forum TPAC in Lyon, France and it was concluded that the "profile” parameter in the Accept and Content-Type headers should be seen to convey profiles that are specific to the Media Type [such as JSON-LD's expanded .... ]

pchampin: But, is there enough interest in IETF to continue the work?

niklasl: There are aspects of the draft that goes into the profile parameter of the media type is the right way to go.
… The design of IIIF and Activity Streams I appreciate more when not looking at it from an RDF perspective.
… These are more useful at the intersection of JSON and RDF, which makes it easier to create specifications in a distributed way.
… If I believed (from RDF perspective) that format is irrelevant, general content negotiation works well.
… I can see how the TAG might argue from one of these perspectives. Maybe we shouldn't invent media-types on the fly.

<pchampin> https://www.w3.org/TR/vc-data-model-2.0/#media-type-precision

pchampin: Regarding the value of using JSON-LD media-type with parameter vs a new media-type, VC has had to rely on this for a while.
… The current solution is to have a dedicated media-type with additional language to explain the relationship between the two media types.
… We might point other specs to that solution.

<niklasl> +1 to mentioning that "third" point of view (very pertinent IMHO)

bigbluehat: I think we need to move on and come back to this issue.
… It would be great to write some of these things up on the issue so that we have something coherent to bring to the TAG.
… IETF has shifted their approach, and we're stuck in the middle. In the mean time, if we can collect thoughts in the issue.
… I don't think we know enough to lay out the preferred solution.
… If we go the short-name route, we run the risk of turning into a registry.

<bigbluehat> w3c/json-ld-syntax#443

<gb> Issue 443 `@protected` creates unresolvable conflicts when the same term is defined in two contexts top-level (by trwnh) [spec:editorial] [wr:commenter-agreed-partial] [class-2]


pchampin commented 1 week ago

This was discussed during the json-ld meeting on 13 November 2024.

View the transcript

w3c/json-ld-syntax#443

bigbluehat: This dove-tails with the profile-parameter conversation for other communities
… If a media type expects a context to exist, they would inject one if not provided.
… We could make other discussion issues from comments in this issue.

niklasl: IIRC, Activity Streams says you should put their context last because of this issue.
… If you use short names that have meaning, you must lock them down.

dlehn: I need to re-review the issue.
… In the case of the controller, it would be to change the activity streams URL, but that's kind of strange. People expect terms to be gathered in one place.

<niklasl> Maybe what is asked for is how to use this design pattern to have partial extensibility, extensions which are always subordinate to the "hardcoded" context (that may evolve)?

dlehn: This would conflict with other things where JWT is also used.

pchampin: The comment at the end is interesting as it resonates with TPAC discussions.
… There are two types of JSON-LD, one which is more about the RDF semantics, the other is about general representation of knowledge.
… I sympathize that we should make this more clear, but don't think it's a bug in the spec.

bigbluehat: There's a tension between generic JSON-LD which is endlessly pluggable, which confuses people.
… In this view, JSON-LD isn't the end product, but adding in @protected you constrain it into a use case, as you are using very specific terminology and limiting the extension points.
… At TPAC there was a discussion about other things, such as schema.org, or are we going to specific content formats with self-defined semantics.
… Maybe this is not a syntax change, but a best practices note. If you're in ld+json land you can do what you want, but if you're in something that provides more constraints, you may need different solutions.

<niklasl> +1 for best practice

<anatoly-scherbakov> +1

<gkellogg> +1

dlehn: It seems to be a bit more than best-practices as you need to tell people how to get around the rules.

dlehn: It's nice when things live together.

bigbluehat: In the future, maybe there would be a way to link from the spec to BP.

<bigbluehat> PROPOSAL: Address the concerns around when to use `@protected` (which were raised in https://github.com/w3c/json-ld-syntax/issues/443) through new content in the JSON-LD Best Practices document.

<gb> Issue 443 `@protected` creates unresolvable conflicts when the same term is defined in two contexts top-level (by trwnh) [spec:editorial] [wr:commenter-agreed-partial] [class-2]

<bigbluehat> +1

<niklasl> +1

<pchampin> +1

<gkellogg> +1

<anatoly-scherbakov> +1

<TallTed> +1

<dlehn> +1

dlehn: Is it more "when" or "how" to use @protected?

RESOLUTION: Address the concerns around when to use `@protected` (which were raised in https://github.com/w3c/json-ld-syntax/issues/443) through new content in the JSON-LD Best Practices document.

bigbluehat: We can make it as "best practice" and notify the commenter.

<niklasl> ... and *why* to...

bigbluehat: @protected needs more content.

<dlehn> "... when, how, and why to use ..."