w3c / rch-wg-charter

Charter proposal for an “RDF Dataset Canonicalization and Hash Working Group”
https://w3c.github.io/rch-wg-charter/
Other
12 stars 7 forks source link

Vague mentions of json-ld context work item needs clarification #83

Closed danbri closed 2 years ago

danbri commented 3 years ago

The Working Group will also provide standard ways to represent this vocabulary in various RDF serialization formats, such as by providing JSON-LD contexts for JSON-LD serializations. (See the separate explainer document for more detailed technical backgrounds and for the terminology used in this context.)

and later

The specification may also define one or more JSON-LD Context documents to be used by a JSON-LD serialization.

The word "context" does not appear in the explainer, and afaik there is no normative dependency on json-ld since the work is syntax-neutral. The charter or explainer could usefully explain what's needed from a json-ld context, what ongoing maintenance, security, privacy, longevity, and uptime commitments w3c (including systems team) expects to make if the context definitions are an integral part of using json-ld for secured RDF. How will non-JSON-LD formats match whatever the context does? Is it a syntactic sugar kind of a mechanism, or a must-have? The word "may" suggests the former, in which case there should be chartered work to assess the potential costs and risks for using json-ld contexts in this way.

These mentions should be clarified. Perhaps, for example, the expectation is that the content of the context will be served or directly included locally in applications? Or signed? cached, etc?

See also https://www.w3.org/blog/systeam/2008/02/08/w3c_s_excessive_dtd_traffic/ for previous W3C operational woes in this area.

My understanding is that it is widely considered proprietary or non-standard to extract RDF triples from a json-ld instance document without having up to date content of all relevant context files to hand (including the potential context alluded to in the charter). Is this a misreading? This WG is predicated on a workflow based entirely on processing RDF triples into canonicalizable graphs, so we need to understand how these pieces fit together.

iherman commented 3 years ago

The Working Group will also provide standard ways to represent this vocabulary in various RDF serialization formats, such as by providing JSON-LD contexts for JSON-LD serializations. (See the separate explainer document for more detailed technical backgrounds and for the terminology used in this context.)

and later

The specification may also define one or more JSON-LD Context documents to be used by a JSON-LD serialization.

The word "context" does not appear in the explainer, and afaik there is no normative dependency on json-ld since the work is syntax-neutral.

Yes, it is, and should be, syntax-neutral.

The first formulation is indeed misunderstandable, though; this "context" is not that "context", ie, the word context in the parenthetical sentence is for the whole paragraph and not for a JSON-LD @context.

Actually, the word "standard" is also incorrect or can be misunderstood. So I propose to modify the sentence to:

The Working Group will also provide preferred ways to represent this vocabulary in various RDF 
serialization formats.

Ie, no "standard" way to represent the vocabulary just preferred, and no reference to the JSON-LD term, which is way too specific for this paragraph.

As for

The specification may also define one or more JSON-LD Context documents to be used by a JSON-LD serialization.

I would change 'to be' to 'may be', because there is no obligation. (Specific, higher level applications or protocol may cast a context file in concrete, but it is not for this WG to decide that.)

The charter or explainer could usefully explain what's needed from a json-ld context, what ongoing maintenance, security, privacy, longevity, and uptime commitments w3c (including systems team) expects to make if the context definitions are an integral part of using json-ld for secured RDF.

The JSON-LD context should be a possibility. These questions are only relevant if the context file is the only way to express that in JSON-LD. There is no pre-eminent role for JSON-LD here, nor does the charter says that.

How will non-JSON-LD formats match whatever the context does?

With all due respect: that is not a real issue. A JSON-LD @context is just a high level set of macros to turn JSON into a set of triples. That is where 'match' happens.

Is it a syntactic sugar kind of a mechanism, or a must-have? The word "may" suggests the former, in which case there should be chartered work to assess the potential costs and risks for using json-ld contexts in this way.

I do not think this is an issue for this WG.

These mentions should be clarified. Perhaps, for example, the expectation is that the content of the context will be served or directly included locally in applications? Or signed? cached, etc?

See also https://www.w3.org/blog/systeam/2008/02/08/w3c_s_excessive_dtd_traffic/ for previous W3C operational woes in this area.

This is a real problem area. But not for this WG imho.

danbri commented 3 years ago

A JSON-LD @context is just a high level set of macros to turn JSON into a set of triples. That is where 'match' happens.

I hope that you or @pchampin can find time to walk @samuelweiler through the consequence of this - "just" is an understatement.

Instance data using a remote context by URL can by doing so give power to those controlling the content dereferenced from that URL, to determine the specific triples emitted during W3C-standard JSON-LD parsing.

This is not something to be done casually. You could imagine timing mischief, where a compromised context URL was only swapped out for a few seconds, minimizing chances of detection. Security folks are much more imaginative about this stuff than I am, but most of them won't know these obscure details from the RDF / JSON-LD world. Mischievous JSON-LD context payloads could for example (in some settings) switch the value of any "rdf:type" statements to an arbitrary different URI value.

To repeat the example I mentioned when we spoke: recently a mistaken change to schema.org's context file immediately broke unit tests in Apache Jena: unchanged software, running over unchanged inputs, gave different outputs, because a document on schema.org had a 1 character change (specifically, https://schema.org/docs/jsonldcontext.jsonld the vocab setting for 'http:' became 'https' briefly).

Hosting a context file that is an intrinsic element of security systems worldwide would be a significant and potentially never-ending burden on the systems team. Do the systeam review charters? /cc @gosko @swickr @ylafon

iherman commented 3 years ago

Cc @msporny @dlongley who have much more experience with this in practice

iherman commented 3 years ago

A note to those who were not on an earlier discussion: the JSON-LD Context file we are talking about is not relevant to the, closely speaking, security issues themselves. That process (e.g., canonicalization of the graph, hashing it, signing it) is done on the RDF triples and JSON-LD (or any context file) is orthogonal to this flow and has nothing to do with it.

What we are talking about is a context file used to express the signature, ie, it represents a vocabulary to say things like: "this is the hash functions that was used for the final signature of the graph", "this is the signature value", "this is the public key that should be used to check the signature". That is more or less it.

The issues about JSON-LD context files are real. Future JSON-LD WG-s might have to look into that process again. But it is not in the scope of this Working Group imho.

dlongley commented 3 years ago

@danbri,

I expect that any JSON-LD @contexts created by the WG will be described as static and that, if used in production systems, they must be loaded locally by software (or their integrity must be cryptographically verifiable) as opposed to being loaded remotely from the Web -- precisely for the reasons you mention. There would be text similar to that used in specs such as the VC data model spec, but the precise text will be up to the WG. JFYI, the practice of versioning and then keeping/installing local copies of static JSON-LD @contexts is already quite prevalent in the "LD proofs community" that is interested in this work.

kdenhartog commented 3 years ago

The issues about JSON-LD context files are real. Future JSON-LD WG-s might have to look into that process again. But it is not in the scope of this Working Group imho.

Oh please no! This is honestly the biggest footgun in using JSON-LD signatures today. Addressing this concern at this layer I think is important for any JSON-LD document that are to be integrity protected. Not just the ones that get standardized.

I definitely think we should allow this as a topic of discussion for the WG to consider rather than declaring it out of scope before the WG has begun.

iherman commented 3 years ago

The issues about JSON-LD context files are real. Future JSON-LD WG-s might have to look into that process again. But it is not in the scope of this Working Group imho.

Oh please no! This is honestly the biggest footgun in using JSON-LD signatures today. Addressing this concern at this layer I think is important for any JSON-LD document that are to be integrity protected. Not just the ones that get standardized.

@kdenhartog can you tell us what "this concern" is, precisely, that you would want to be in scope?

kdenhartog commented 3 years ago

The concern around impact of resolving a context on the resulting canonicalized quads. One of the things I was hoping to put forth as a possible solution for JSON-LD formats is that they MUST use a context which has integrity protections or should error on producing canonicalized quads. This way we're keeping the integrity guarantees within the scope of this working group. I definitely agree that availability of contexts should be left out of scope though which is what I understood @danbri original concern to be.

More concretely, when I retrieve a context, I would expect a canonicalize algorithm would check that a hash contained with the URL of a context (either within the resource path or a query parameter is what I had in mind) matches the hash of the content retrieved. Ideally, by using the canonicalization algorithm on the retrieved context data itself.

Since I edited my response after you responded as well I'll add what I'm after in here as well. All I'd ask is that this at least be in scope for the WG to consider rather than declaring it out of scope before the WG is chartered. If the WG isn't able to reach consensus on the topic so be it, but I'd hope we'd at least have the chance to discuss it since it has practical impacts on the expected integrity guarantees within JSON-LD.

iherman commented 3 years ago

@kdenhartog, thanks.

More concretely, when I retrieve a context, I would expect a canonicalize algorithm would check that a hash contained with the URL of a context (either within the resource path or a query parameter is what I had in mind) matches the hash of the content retrieved. Ideally, by using the canonicalization algorithm on the retrieved context data itself.

The canonicalization algorithm, as to be defined by this WG, is serialization independent, defined on the level of RDF graphs and datasets. It is not bound to JSON-LD or any other serialization. My feeling is that what you describe would be a layer on top of a basic set of building blocks defined by this WG, and would therefore represent a separate layer on top: what happens if the canonicalization (plus hash, proofs, etc) happens on data which is serialized in JSON-LD or, more exactly, in JSON-LD that uses an external context file.

What this means that there is a non-negligible set of extra standardization work to be done for something like that:

To avoid any misunderstandings: I do believe that this is valuable and, at some point, necessary work to be done. The only thing I am saying is that this not naturally in the scope of this WG or, put it another way, it would require a non-negligible addition and an extra deliverable to the charter, which is not an obligation we should commit to lightly (time, manpower, etc.)

One thing we could do is to add a non-normative WG note deliverables looking at those issues, with the proviso that, once this WG has completed its work, the result could be moved towards a Recommendation in a subsequent, re-chartered WG.

msporny commented 3 years ago

One thing we could do is to add a non-normative WG note deliverables looking at those issues, with the proviso that, once this WG has completed its work, the result could be moved towards a Recommendation in a subsequent, re-chartered WG.

I'll note that we've already provided guidance on this and the thing that @kdenhartog has identified as a footgun has been explored in other WGs with guidance to implementers:

https://w3c.github.io/vc-data-model/#contexts says:

The data available at https://www.w3.org/2018/credentials/v1 is a static document that is never updated and SHOULD be downloaded and cached.

https://w3c.github.io/vc-data-model/#base-context says:

The base context, located at https://www.w3.org/2018/credentials/v1 with a SHA-256 digest of ab4ddd9a531758807a79a5b450510d61ae8d147eab966cc9a200c07095b0cdcc, can be used to implement a local cached copy.

There is already a ton of guidance here for caching and remote loading of contexts:

https://www.w3.org/TR/json-ld11/#loading-documents

A documentLoader can be useful in a number of contexts where loading remote documents can be problematic:

  • Remote context documents should be cached to prevent overloading the location of the remote context for each request. Normally, an HTTP caching infrastructure might be expected to handle this, but in some contexts this might not be feasible. A documentLoader implementation might provide separate logic for performing such caching.
  • Certain well-known contexts may be statically cached within a documentLoader implementation. This might be particularly useful in embedded applications, where it is not feasible, or even possible, to access remote documents.
  • For security purposes, the act of remotely retrieving a document may provide a signal of application behavior. The judicious use of a documentLoader can isolate the application and reduce its online fingerprint.

In other words, I expect the same sort of language to be placed in the LDI spec and I don't consider it out of scope. It'll come up, and we'll do the exact same thing that we've done in every other WG that loads remote information for the generation of digital signatures -- we'll tell people that they should only work with copies that they have vetted and/or have been vetted by a trusted party.

iherman commented 3 years ago

In other words, I expect the same sort of language to be placed in the LDI spec and I don't consider it out of scope. It'll come up, and we'll do the exact same thing that we've done in every other WG that loads remote information for the generation of digital signatures -- we'll tell people that they should only work with copies that they have vetted and/or have been vetted by a trusted party.

I agree that this information is already spread around different places. But being spread around that way is not really helpful. Hence my proposal to provide a place where all these information/guidances/whatever would be collected at one place to provide a stable reference, and a separate WG note sounds like a good place for this.

In other words, I expect the same sort of language to be placed in the LDI spec and I don't consider it out of scope.

To avoid misunderstandings: when I argued this to be out of scope, I meant out of normative scope. I do not think any of those guidances are normative, nor are they clear cut today to turn them into normative statements.

dlongley commented 3 years ago

@iherman,

Standard means to canonicalize and hash the data expressed in a JSON-LD context file. Note that I am not sure that "retrieved context data" can be interpreted as an RDF graph by itself, i.e., the canonicalization algorithm could not be applied to it automatically (@dlongley can tell me if I am wrong here).

You're not wrong. A JSON-LD context file could be canonized using JCS though -- prior to hashing. But, in any case, I agree that we may want to produce a note as a place for all the information/guidances/whatever around JSON-LD context loading for security/privacy-conscious applications.

gkellogg commented 3 years ago

It seems to me that this is something of a self-correcting issue. If a dataset were signed and serialized to JSON-LD and a supporting context were to change, an attempt to verify the signature would also fail.

I agree about guidance on a static context with a published digest should be described in a note, and a future JSON-LD version could verify this digest upon load, but that could just as well be done via a document loader now.

kdenhartog commented 3 years ago

I guess my concern with leaving this as a non-normative note is that it has largely been ignored in practice. Especially since much of the contexts that I've seen in VCs are using schema.org which is in my opinion is the best json-ld context available today on the web in terms of a semantic ontologie for JSON-LD and hasn't implemented this best practice.

My hope was to try and build consensus within this WG to get normative statements in here to change things from being suggested best practices to normative requirements that force implementations to make changes to maintain integrity of JSON-LD specific formats. This way we could use this normative requirement as a way of clearly delineating contexts that can be relied on versus ones that can't based on the guarantees that publishers are willing to commit to.

I think you raise a good point @iherman about the fact that placing this within this WG as potentially normative would certainly place a burden on the WG in time and effort to focus on a specific format rather than the generalized RDF canonicalization. Having too much focus on a particular format will likely lead to a bit too much churn if this is a specific problem to only JSON-LD and it may make sense to address this problem normatively within another WG or at a later date (e.g. within specific JSON-LD suites).

For my context since I'm not as familiar with the other RDF formats, is this a problem that is commonly encountered with those other formats or is this really just a JSON-LD specific format issue and all other formats have solved this problem or don't encounter it? If it's actually the case that this is a JSON-LD only problem I can live with this being left out of scope for this WG and finding another standard where this requirement can be normatively stated (I'll be aiming to put this in the suite layer rather than JSON-LD) will be good enough for me.

kdenhartog commented 3 years ago

It seems to me that this is something of a self-correcting issue. If a dataset were signed and serialized to JSON-LD and a supporting context were to change, an attempt to verify the signature would also fail.

Yeah the problem is usually more prevalent when the property isn't in the context during signing then the properties get silently dropped during the canonicalize step. This had our whole team wracking their head when we were trying to figure out why signatures were randomly dropping properties and thinking we needed to check the inputted document against the outputted one post signature.

iherman commented 3 years ago

For my context since I'm not as familiar with the other RDF formats, is this a problem that is commonly encountered with those other formats or is this really just a JSON-LD specific format issue and all other formats have solved this problem or don't encounter it? If it's actually the case that this is a JSON-LD only problem I can live with this being left out of scope for this WG and finding another standard where this requirement can be normatively stated (I'll be aiming to put this in the suite layer rather than JSON-LD) will be good enough for me.

@kdenhartog yes, it is a JSON-LD specific problem. Other serialization formats (RDFa, Turtle, RDF/XML) do not have the concept of a @context that can be fetched at parsing time. The RDFa Working Group played with the idea at some point, but decided not to pursue.

Is there a consensus to add a non-normative deliverable? I am happy to provide a PR along those lines, alongside (or instead of?) the minor changes I proposed in https://github.com/w3c/lds-wg-charter/issues/83#issuecomment-843157747.

iherman commented 3 years ago

@kdenhartog

I guess my concern with leaving this as a non-normative note is that it has largely been ignored in practice. Especially since much of the contexts that I've seen in VCs are using schema.org which is in my opinion is the best json-ld context available today on the web in terms of a semantic ontologie for JSON-LD and hasn't implemented this best practice.

My apologies, @kdenhartog, I meant to react to this. I agree that there is a danger of the note being ignored (although if we consider this note as some sort of “incubation” for a future normative spec, it may be different). The practical issue, beyond what we discussed is also the fact that some tasks listed in https://github.com/w3c/lds-wg-charter/issues/83#issuecomment-843998182, namely the extension of JSON-LD to include a reference to the “metadata” assigned to the context file, should not be done in this Working Group in the first place; it should be a normative extension to the JSON-LD spec or up to the IETF (for hash link finalization).

pchampin commented 3 years ago

Is there a consensus to add a non-normative deliverable?

+1, although I would be equally happy to include it in the description of the 'Primer or Best Practice document'. E.g. by adding:

"In particular, as the security of an application is only as strong as the weakest component, these documents will contain guidance and caveats about how to securely use and combine the building blocks provided by the normative deliverables."

philarcher commented 3 years ago

Trying to summarise this thread in my head ...

I had the same thought as @gkellogg - change the @context file, that changes the triples and the signature won't work. That's a plus in my mind.

Also, yes, we're talking about abstract RDF, not JSON-LD. If we were only talking about JSON-LD, forget the LD part and just sign the JSON (as many would say is the right thing to do anyway). But that then means we're failing several of our use cases, so, no - this is not a JSON-LD thing.

I don't speak for schema.org of course (@danbri does that) but I would be very surprised if their context file were ever declared immutable. Remember what it's for - it's to help the search engines make sense of Web content. And that changes. Dan gave me an example some years ago of a schema property whose definition changed. Why? Because people were using it to mean what the new definition said, not what the original one said. AIUI, it is dynamic by design. If you want to exchange secure, immutable data, schema.org's context file is a really bad choice. (Please correct me if I'm wrong here Dan).

And that's the issue that we keep coming back to. What does LD Integrity mean when different people have different ideas about what something means? The phrase "single source of truth" comes up in my world a lot. I have to stop myself screaming at such obvious nonsense. What does LD Integrity mean when it can depend on external resources? Forget JSON-LD context, if a triple says "ex:alice ex:address https://example.com/myAddress" - that's an external reference and that external reference can change, completely independent of the person making the assertion. That's an issue and, yes, by golly we need to address that. It MAY be the case that the integrity spec has something to say about noting any signatures/hashes/whatever in external resources - dunno - that's why we need a WG.

+1 to @pchampin on making this part of the Primer/Best Practice doc rather than suggesting another, separate doc. We need to be clear what integrity means in this paradigm. It cannot mean everything here is true now and forever. It can only mean some version of "I use my signature to assert these things at this time for the purpose of X. Do not assume this purpose extends to Y" - which is why I'm pleased with PR 86.

@kdenhartog these sorts of issues are ones we will need to give attention to if the WG is chartered. I'm sure @peacekeeper and I will want to give time to this discussion - and it's not going to be trivial. From my POV, that Primer and the UCR doc, are no less important than the normative standards.

And we'll be looking for editors on day 1 :-)

kdenhartog commented 3 years ago

Is there a consensus to add a non-normative deliverable? I am happy to provide a PR along those lines, alongside (or instead of?) the minor changes I proposed in #83 (comment).

Ok, I'm satisfied with leaving this as a non-normative note, and will intend to fit this in a more specific JSON-LD layer. At this time, I'm thinking that's going to best fit within LDI suite(?) definitions and will start to introduce those concepts in some of the drafts we've already got within the CCG for now to get the ball moving in the right direction on this. Something along these lines to delineate it will be out of scope normatively, but discussed non-normatively within the group would be good.

I don't speak for schema.org of course (@danbri does that) but I would be very surprised if their context file were ever declared immutable. Remember what it's for - it's to help the search engines make sense of Web content. And that changes. Dan gave me an example some years ago of a schema property whose definition changed. Why? Because people were using it to mean what the new definition said, not what the original one said. AIUI, it is dynamic by design. If you want to exchange secure, immutable data, schema.org's context file is a really bad choice. (Please correct me if I'm wrong here Dan).

This is incredibly helpful context around their context file. 😄 I do plan to raise that as a point of discussion within their repo and see if there's appetite to it, but if not will look to some other ways in which we can manage these things better. For me, I care less about the immutability of the context and more about the integrity of the semantics which is why signing of the Quads is more valuable to me than signing plain JSON. Couple this with a common semantic ontology and it starts to enable some really cool interoperability stories at the data layer. Hence why I find this problem so important.

@kdenhartog these sorts of issues are ones we will need to give attention to if the WG is chartered. I'm sure @peacekeeper and I will want to give time to this discussion - and it's not going to be trivial. From my POV, that Primer and the UCR doc, are no less important than the normative standards.

I'm not quite at a point where I can commit to time on this note yet, but I can definitely say my interest is this is large enough that I plan to be a key contributor on the topic and content in that note if this WG is chartered.

msporny commented 3 years ago

I'm glad to see this discussion heading in a very reasonable direction. Security discussions rarely end in an "it's secure" or "it isn't secure" conclusion. Rather, it is a series of assumptions and input parameters, coupled with an array of attacks you are trying to defend against, and then a set of conclusions that only hold given the assumptions and input parameters under the given attacks.

While I can understand the desire for normative language around depending on remote information, the best we have ever been able to achieve there are SHOULD statements (which as all of you know, tend to be as toothless as NOTEs). So the best I've ever seen a WG achieve is to document the attacks, their defense strategies, and the thinking around those (typically called the Security Considerations section in a W3C specification).

iherman commented 3 years ago

I have created PR #87 adding the reference to the json-ld issues to the non-normative primer and best practice deliverables.

iherman commented 3 years ago

@danbri, in view of the recent changes (see also #87) are you o.k. closing this issue?

TallTed commented 3 years ago

Hey, folks,

Please try always to wrap @context in backticks, including in quote blocks, so as not to constantly ping that poor github user ...

(In this thread, I note comments from @danbri and @philarcher which could be edited.)

pchampin commented 2 years ago

closing, as the new version (with a reduced scope) does not contain these mentions of JSON-LD contexts anymore.