JSON-LD Context processing in HTML Documents

msporny commented 5 years ago

From this issue in the Verifiable Claims Working Group with regard to the new "full Processor" conformance class: https://github.com/w3c/vc-data-model/issues/585

@gkellogg wrote:

As this is something that may not be necessary in certain embedded environments, the notion of processor classes was introduced to allow a pure JSON Processor to conform without processing HTML. But, a full Processor is expected to do this.

@msporny wrote:

To be clear, I really dislike this feature of JSON-LD 1.1 because it raises the burden of Full JSON-LD 1.1 processors to contain an HTML processor (which is a massive requirement) on top of doing JSON-LD processing. I also think this is going to really damage the adoption of JSON-LD 1.1 and make it so much easier for people to argue against it... hell, even I would argue against "Full JSON-LD processors" (and plan to if this feature goes to REC).

@gkellogg wrote:

I appreciate your position, but JSON-LD in HTML is probably the biggest use case right now (although that will likely change with adoption of VC and WoT). JSON-LD in HTML is a reality that the spec needs to recognize and legitimize.

I agree that processing JSON-LD content in HTML is a primary use case and the WG should support it.

I disagree that people are publishing JSON-LD Contexts in HTML, that came out of nowhere. I can see what the WG is trying to do, but this issue is an example of my concern: https://github.com/w3c/vc-data-model/issues/585

You have someone suggesting that we pull in a JSON-LD Context file via an HTML document without understanding the technical burden in doing so. They don't understand that publishing a JSON-LD Context as an HTML document will not require full processors.

I also note that expressing JSON-LD Contexts in HTML was not contemplated in any of the input documents to the JSON-LD WG and as such, the group is skirting very close to being in violation of their charter by adding this feature:

https://www.w3.org/2018/03/jsonld-wg-charter.html https://github.com/json-ld/json-ld.org/wiki/Changes-in-Community-Group-Drafts-Targeted-for-1.1 https://json-ld.org/presentations/JSON-LD-Update-TPAC-2017/assets/player/KeynoteDHTMLPlayer.html

There are two major issues with this new set of features:

Enabling JSON-LD Contexts in HTML documents will silently increase the burden of consuming JSON-LD by small form factor implementations (IoT, WoT, etc.). I haven't even considered the security implications here, but I can probably create something where a native JS processor uses a different JSON-LD Context than one that doesn't do DOM processing.
There is an implied hierarchy of "good" and "not as good" in the new conformance classes. For example, it sounds like having a "full Processor" would be better than having a "pure JSON Processor".

Making the following changes to the specification would be an improvement:

Rename "full Processor" to "HTML Processor".
Remove the ability to use text/html files as JSON-LD Contexts as pure JSON Processors are not capable of processing them, which will lead to a variety of issues related to developer ergonomics.

pchampin commented 5 years ago

Rename "full Processor" to "HTML Processor".

I agree that the original naming may give the wrong impression (that other processors are somehow incomplete), and discourage some people from adopting JSON-LD. "HTML Processor" is a little misleading, but a better name could indeed be found.

Remove the ability to use text/html files as JSON-LD Contexts as pure JSON Processors are not capable of processing them, which will lead to a variety of issues related to developer ergonomics.

The argument was raised that JSON-LD Contexts are bona fide JSON-LD documents, and so it would be difficult to argue that a ~~Full~~ "Extended" processor could sometimes load JSON-LD from HTML, and sometimes not... I think this is a valid argument.

That being said, we could address your concern by replacing the Note, at the beginning of section 7, by a Warning, stating "not available in a Pure JSON-LD Processor" rather than "available in a Full Processor". And possibly hinting that content-negotiation is a more "portable" solution?...

dlongley commented 5 years ago

@pchampin,

And possibly hinting that content-negotiation is a more "portable" solution?...

I don't think we should merely "possibly hint" at this; my preference would be to make it a requirement that you MUST make your @context available as JSON. But, short of my own preferences, we should be very clear that you SHOULD do so and that if you don't, your @context won't work with every JSON-LD processor, only those that add the extra HTML feature set. I think we should be strongly encouraging JSON over HTML, but allow HTML for documentation purposes.

msporny commented 5 years ago

And possibly hinting that content-negotiation is a more "portable" solution?...

I feel stronger about this than @dlongley does... don't open up the Pandora's box of reading JSON-LD Context's from HTML. Remove the feature. The only argument that I can see for it is that it's a "neat feature" in the academic completeness sense... but JSON-LD was never meant to be an academically complete mechanism... it was supposed to help developers publish JSON-LD, but not become so complex that it blows your foot off when you try to use it. Having this feature means that developers will inevitably publish their JSON-LD Context as HTML only, which will cause a split in the ecosystem between "We expect you to publish via HTML" and "We expect you to publish no via HTML".

msporny commented 5 years ago

"HTML Processor" is a little misleading, but a better name could indeed be found.

Isn't the only feature that the "full" processor has over the JSON-only one the fact that it parses stuff from HTML?

pchampin commented 5 years ago

Isn't the only feature that the "full" processor has over the JSON-only one the fact that it parses stuff from HTML?

Yes, but "HTML Processor" makes it sound like it can only process HTML...

gkellogg commented 5 years ago

And possibly hinting that content-negotiation is a more "portable" solution?...

I feel stronger about this than @dlongley does... don't open up the Pandora's box of reading JSON-LD Context's from HTML. Remove the feature. The only argument that I can see for it is that it's a "neat feature" in the academic completeness sense... but JSON-LD was never meant to be an academically complete mechanism... it was supposed to help developers publish JSON-LD, but not become so complex that it blows your foot off when you try to use it. Having this feature means that developers will inevitably publish their JSON-LD Context as HTML only, which will cause a split in the ecosystem between "We expect you to publish via HTML" and "We expect you to publish no via HTML".

This was not added because it's a "neat feature", but as a response to concerns raised in #43. If JSON had a built-in commenting feature, it would be likely not necessary.

Because of this, and the need to normatively describe the in-the-wild JSON-LD in HTML scenarios provided a mechanism to do this. Once you describe JSON-LD in HTML, then allowing that for contexts and frames is a logical progression, particularly when the extraction is described in the document loader, which is the standard way to fetch all remote content.

The fact that it came up in w3c/vc-data-model#585 just goes to show a general need to be able to document contexts, and containing the context in the documenting HTML is likely a better way to keep them from diverging than using different resource formats.

I agree with @dlongley that we should better describe the potential for splitting the eco-system by recommending (SHOULD) that publishers provide an application/ld+json version via content-negotiation and not depend on a processor's conformance with HTML processing.

azaroth42 commented 5 years ago

With chair hat on...

I also note that expressing JSON-LD Contexts in HTML was not contemplated in any of the input documents to the JSON-LD WG and as such, the group is skirting very close to being in violation of their charter

Could you point out where in the charter it says that we can only introduce features described in input documents to the WG? Because that would also preclude features like @protected, as far as I'm aware. I don't think that's, thus, relevant here unless you can find somewhere that says we're constrained in this way?

And with chair hat off ...

I agree with @gkellogg that if we say that a context is JSON-LD, and that JSON-LD can be expressed in a script element of an HTML page, then the implication is that a context can be expressed in a script element of an HTML page. If I recall correctly, @danbri has brought up his issue as a frustration of web developers.

The possible routes forward seem to be:

Allow contexts in HTML with warnings and processor conformance statement [current]
Keep contexts as JSON-LD, but introduce a rule that they cannot be in HTML [my understanding of Manu's proposal]
Make contexts a separate, non-JSON-LD media type, with a rule they must be separate documents. [Alternative that I don't think anyone actually likes, and arguably backwards incompatible]
Remain silent [I think it's too late for this - the issue has been raised, we have to address it]

I agree with @pchampin that "extended" is better than "full", along with a big warning about contexts in HTML being complicated in the spec.

BigBlueHat commented 5 years ago

JSON-LD in HTML exists and even informatively--when viewed from the HTML-perspective: https://html.spec.whatwg.org/#the-script-element:attr-script-type-4

In the current spec-space, it's already possible to extract JSON-LD from HTML and use it as JSON(-LD)--because that's how data blocks work with any embedded format (CSV, YAML, etc.).

We have gone beyond simply echoing that fact in the syntax document and instead baked additional processing steps into the API.

Shifting things into the documentLoader space does help from an architectural layering concern, but this "context in HTML" usage raises a whole host of architectural and community concerns. It effectively moves us from the current world of extracting-then-using the embedded JSON-LD into one where HTML becomes a valid representation of JSON-LD itself.

We need to work to re-narrow our focus at this stage, and go back to the "simplest thing that could possibly work."

iherman commented 5 years ago

This issue was discussed in a meeting.

No actions or resolutions
View the transcript
Contexts in HTML
Ivan Herman: https://github.com/w3c/json-ld-syntax/issues/172
Rob Sanderson: Summary: in the spec we say that (normatively) json-ld can be included in script el. There is now a requirement on <base>. It was noted that contexts are also jsonld. Hence, it is permissible to have contexts embedded in script tags inside html. This means that processors need to be able to process that.
… We all agree that this is an extension to normal proc mode. Either we need to say that contexts have a special role, contexts are not jsonld, or we need to accept that contexts can be embedded in html and processors should have to be able to say that they support processing them.
Manu Sporny: Some context wrt VC. Purely json-based processors find information using context. Someone said it would be nice to have human-readable context. Argument in favor of this feature.
… Person said, It would be nice to see jsonld in html. But I don’t want the burden of jsonld processor supporting html.
… We all agree that JSON-LD in HTML is a huge use case (e.g. schema.org)
… I think pulling in contexts from html is controversial
… 2 questions
… 1: does jsonld context in html greatly increase jsonld usage?
… I think the answer is no
… There are other ways to solve issues people would have to want html for contexts.
… 2: is this going to create interop issues?
… Is this going to cause ecosystem to change by other processors to start failing?
… I think this is going to create issues.
… Some people are going to start publishing contexts as html only.
… Even if we say you should not do this.
… The damage this feature could create is far greater than possible benefits.
… I have more reasons, but this is the biggest argument.
… We should wait until there is more demand for this feature. We could do it in the future if really needed.
Benjamin Young: “This means that processors need to be able to process that.”
Benjamin Young: +1 to everything Manu said.
… This is the part of what Rob said in start that jsonld in html normative somehow begets this notion that we have to …
… jsonld in html has always been normative thanks to the data block in script tag
… we just described it better
… comes from HTML5 spec.
… Using single URL to specify context and its documentation is interesting. (Conneg can be used)
… Overhead of making this possible is too big for processors.
… This is a nuclear weapon to kill a small bird.
… There are less damaging ways to solve the problems discussed.
Dave Longley: +1 to manu and bigbluehat
Manu Sporny: +1 to bigbluehat !
Benjamin Young: We need to be more careful than we have before, before introducing new things like these.
Rob Sanderson: ref - https://www.w3.org/TR/2018/WD-json-ld11-20181214/#embedding-json-ld-in-html-documents
Rob Sanderson: in 1.1, we made it our problem, so we have to solve it.
… I want to channel danbri. Search engines want to include info in their knowledge graph that they find on the web as jsonld.
… schema.org, or at least the engines, currently assume do not process contexts at all.
… If you have a template in your website, with multiple schema.org definitions, you could put into your CMS as a data script block to push this into every single page.
… search engines would be able to see these blocks
… by having google’s clusters waiting to process jsonld in page. Publishers would be required to not embed into page.
… why not have it as include contexts object?: when multiple people responsible for editing context. Also, if there are templare-driven CMSs being used, you want to stripe jsonld over different templates being used. This would cause data blocks being used multiple times.
Dave Longley: Many of these things can be solved by saying that serving should happen with application/ld+json
… I think there are many cases not being considered wrt complexity
… many use cases not covered on template-based html pages
… Such as dynamic pages when generated client-side with javascript
… We shouldn’t get into that space.
… We should say that context MUST be server with proper content type
Manu Sporny: I could not follow schema.org use case. Danbri should write this up. We should do a deep analysis on this use case, to see what could address his concern.
… There are a bunch of assumptions in that use case
… e.g. people could create their own non-schema.org contexts. This would add a huge amount of complexity.
… it would be good to have dan involved.
Rob Sanderson: +1 to dlongley and manu
Manu Sporny: Also, it feels like this is migrating away from BPs.
… We are learning a lot from security around publishing contexts.
… We had discussions on the type of attacks, if you could publish contexts as html.
… So there are security concerns around this feature
… Concern around complexity, interoperability, …
… A long list of reasons for saying that this is not spec-ready.
… So we have to get use-case right. And see if it can be solved with current feature-set. Only if needed, we should look further into this html issue.
Ivan Herman: Manu said many things what I wanted to say. We need danbri to raise his voice.
… We have to rely on documented requirements
Rob Sanderson: I agree
Benjamin Young: I think what you described, if it’s on danbri’s previous desire to have this in jsonld, then this is a request. Dan has expressed multiple times that search engines want to understand page contents. This is different to giving identifier that serves contexts in html.
… We are going to end up with RDF dataset that is compiled of multiple contexts.
Gregg Kellogg: no, doesn’t work that way
Benjamin Young: Generating a graph is not about coupling jsonld context identifier algorithm.
Gregg Kellogg: I don’t think it is practical for many CMSs to do content negotiation (like github pages)
… we have to re-characterize what jsonld in html is.
… I agree that these things start to increase complexity and raising barrier.
… We need to reconsider what processing jsonld in html means.
Manu Sporny: +1 to re-characterize how processors process JSON-LD in HTML.
Rob Sanderson: We are not going to solve this today.
… I will reach out to danbri to see if he wants to engage.
Gregg Kellogg: He may be at WebConf

azaroth42 commented 5 years ago

From @danbri, posted with permission, after discussion with @gkellogg:

We are uncomfortable that our site (by virtue of our context url) has implicitly become a software component in a system where we don't even really know the other software components. I am considering turning off the context serving at weekend to encourage caching and more robust clients.

1b. Aside: it could be interesting to have a best practice note about how software components fetching contexts might identify themselves incl versions in http requests (user agent)

We are unhappy that the expectation of content negotiation on our home page blocks us from moving to 100% static-served site.

If we could have a small snippet of jsonld in our homepage, pointing off to a separate url with our giant big context file, that would be great

We are not interested in putting the whole context into our homepage; it is way too big. Similar issues may hold for Wikidata at some point.

We appreciate the reluctance to entangle the pure json nature of json-ld with html, but note that the success of json-ld was achieved in large part through just such an entangling

rubensworks commented 5 years ago

1b. Aside: it could be interesting to have a best practice note about how software components fetching contexts might identify themselves incl versions in http requests (user agent)

:+1: Related to this, that best practise note should also talk about caching of contexts.

We are unhappy that the expectation of content negotiation on our home page blocks us from moving to 100% static-served site.

One possible solution for this would be to allow a link header to be added to HTML documents that points towards contexts. (This may not solve all static site use cases though, as platforms like GitHub pages don't support custom link headers AFAIK)

msporny commented 5 years ago

We are uncomfortable that our site (by virtue of our context url) has implicitly become a software component in a system where we don't even really know the other software components. I am considering turning off the context serving at weekend to encourage caching and more robust clients.

I think that this would be a good thing to do. Provide guidance on aggressively caching the schema.org context (or packaging it with software implementations).

msporny commented 5 years ago

We are unhappy that the expectation of content negotiation on our home page blocks us from moving to 100% static-served site.

Then state that the new schema.org context will be served from: "https://schema.org/v1" -- make that the context, say that "https://schema.org/" is an alias for "https://schema.org/v1" and note that you will turn off content negotiation for "https://schema.org/" at the beginning of 2020.

msporny commented 5 years ago

If we could have a small snippet of jsonld in our homepage, pointing off to a separate url with our giant big context file, that would be great

Why? Seems like extra complexity... just say that the new schema.org context file is at: https://schema.org/v1 and be done with it. The schema.org context is so large that implementations will ship with it or aggressively cache it. Speaking from our implementation experience, at one point a bug caused us to go out to the web and fetch schema.org for every digital signature we did and our dev environment suffered horribly - massive performance hit. We now ship with static copies of schema.org... we never go out to the network to get the massive context (and that is the way it should be). The only issue, of course, is there is no versioning for schema.org... but we haven't had an issue w/ that yet. We may have an issue when people start digitally signing schema.org content and expecting those signatures to stay valid for 3-5 years while schema.org shifts underneath them.

msporny commented 5 years ago

We are not interested in putting the whole context into our homepage; it is way too big. Similar issues may hold for Wikidata at some point.

Yes, correct, so we don't need the JSON-LD Context processing in HTML documents feature. No one is asking for that feature.

msporny commented 5 years ago

We appreciate the reluctance to entangle the pure json nature of json-ld with html, but note that the success of json-ld was achieved in large part through just such an entangling

I don't understand this statement. There are a number of us that are attempting to make JSON-LD work w/ pure JSON environments in a more harmonious way and have made great strides towards that with the help of JSON-LD 1.1's @protected feature. @danbri, could you please explain what you meant by the comment above?

BigBlueHat commented 5 years ago

The only issue, of course, is there is no versioning for schema.org... but we haven't had an issue w/ that yet.

There sort of is...but it could be better. For instance, all the versions are in a directory on GitHub: https://github.com/schemaorg/schemaorg/tree/master/data/releases

The 3.7 context file (for instance) lives at https://github.com/schemaorg/schemaorg/blob/104238766458b465e6a60cc7d049f887c542563a/data/releases/3.7/schemaorgcontext.jsonld

That's versioned--via git sha's--but not tagged in git (which would help) nor made available as "the 3.7 context file" from the release history page. All of that would help certainly.

BigBlueHat commented 5 years ago

From @danbri, posted with permission, after discussion with @gkellogg:

@azaroth42 it would be helpful (if possible) to see more of that thread, or to make this an actual conversation/call (again, if possible). Without it, it's not clear we're all talking about the same thing(s).

gkellogg commented 5 years ago

@BigBlueHat this was from hallway conversations at the Web Conference, so no thread to refer to. @danbri should clarify his position, but IIRC, they could turn off content-negotiation for http(s)://schema.org and return a stub context in a script tag which references the actual JSON-LD version of the context, which could help their usage. So, for example, the schema.org web page might look something like the following:

<!DOCTYPE html>
<html lang="en">
<head>
  <!-- Generated from headtags.tpl -->
    <meta charset="utf-8" >
    <link rel="shortcut icon" type="image/png" href="docs/favicon.ico"/>
    <link rel="stylesheet" type="text/css" href="docs/schemaorg.css" />
    <link rel="stylesheet" type="text/css" href="docs/prettify.css" />
    ...
    <script type="application/ld+json">{"@context": "https://schema.org/docs/jsonldcontext.jsonld"}</script>
    ...
</head>
</html>

Presently, content-negotiation does a redirect to https://schema.org/docs/jsonldcontext.jsonld, so this would simplify their hosting infrastructure.

BigBlueHat commented 5 years ago

Right, but it would vastly increase the amount of work a JSON-LD processor must do.

Given this as a data document:

{"@context": "https://schema.org/",
 "@type": "Person",
 "name": "me"}

The processor (without a cached context it says is valid for https://schema.org/) would need to...

GET the default (HTML) response from https://schema.org/
Parse that looking for data blocks (i.e. <script type="application/ld+json">)
1. with the added requirement that one of them says it's a context file?
Extract that JSON-LD datablock
Parse it.
If valid, GET the @context value(s).
Parse those to create a single active context for the data document.

The processing requirements go from "use an HTTP(S) client" to "use an HTTP(s) client and HTML parser (which possibly supports JavaScript).

danbri commented 5 years ago

There is a massive amount of json-ld embedded within html. Tools without the capability to extract it are ignoring one of the biggest applications of json-ld. So perhaps the burden is not quite so huge?

On Fri, 14 Jun 2019 at 16:54, BigBlueHat notifications@github.com wrote:

Right, but it would vastly increase the amount of work a JSON-LD processor must do.

Given this as a data document:

{"@context": "https://schema.org/", "@type": "Person", "name": "me"}

The processor (without a cached context it says is valid for https://schema.org/) would need to...

GET the default (HTML) response from https://schema.org/

Parse that looking for data blocks (i.e. <script type="application/ld+json">)

with the added requirement that one of them says it's a context file?

Extract that JSON-LD datablock

Parse it.

If valid, GET the @context value(s).

Parse those to create a single active context for the data document.

The processing requirements go from "use an HTTP(S) client" to "use an HTTP(s) client and HTML parser (which possibly supports JavaScript).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/w3c/json-ld-syntax/issues/172?email_source=notifications&email_token=AABJSGKMBJVJIJIX5FIWJ2TP2O5J3A5CNFSM4HK3Y2R2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXXGNAY#issuecomment-502163075, or mute the thread https://github.com/notifications/unsubscribe-auth/AABJSGMMHEFSWFIOWEAFWI3P2O5J3ANCNFSM4HK3Y2RQ .

BigBlueHat commented 5 years ago

@danbri certainly if you're already in that space doing that thing, you're all set. 😃 But if you're in a "pure" JSON-LD environment (database, IoT, etc), you'd very much want to avoid having higher processing requirements.

gkellogg commented 5 years ago

Right, but it would vastly increase the amount of work a JSON-LD processor must do.

Given this as a data document:
{"@context": "https://schema.org/",
 "@type": "Person",
 "name": "me"}
The processor (without a cached context it says is valid for https://schema.org/) would need to...

GET the default (HTML) response from https://schema.org/

Parse that looking for data blocks (i.e. <script type="application/ld+json">)

with the added requirement that one of them says it's a context file?

Extract that JSON-LD datablock

Parse it.

If valid, GET the @context value(s).

Parse those to create a single active context for the data document.

The processing requirements go from "use an HTTP(S) client" to "use an HTTP(s) client and HTML parser (which possibly supports JavaScript).

Tools really need to cache contexts, anyway, so this might serve as an added incentive to do so.

iherman commented 5 years ago

This issue was discussed in a meeting.

RESOLVED: close #172 as addressed by #204
View the transcript
Benjamin Young: See Syntax issue #172
Benjamin Young: This issue is very related. Originally, extracting JSON-LD from HTML. This can now be done with a simple link header.
… schema.org for example does not want to use conneg, so this is good for this. Proposed closing based on the last PRs.
Gregg Kellogg: The behavior is slightly modified if you request context. Document loader will not add text/html from request. The API is not affected too much.
… If you will deal with HTML, like schema.org, then you can achieve a compatibility level with processing JSON-LD in HTML, instead of doing it mid-processing.
Dave Longley: Everything is untangled, and is cleaner now.
Ivan Herman: Users should be warned that they don’t define context as part of an HTML file.
Gregg Kellogg: We don’t have text saying that it can’t be done. We just removed the text saying that it can be done.
Ivan Herman: Because it can be done in theory?
Gregg Kellogg: Syntax doesn’t say anything about it. API doc explicitly excludes HTML.
Proposed resolution: close #172 as addressed by #204 (Benjamin Young)
Rob Sanderson: +1
Dave Longley: +1
Benjamin Young: +1
Ruben Taelman: +1
Ivan Herman: +1
Gregg Kellogg: +1
Resolution #3: close #172 as addressed by #204

w3c / json-ld-syntax

JSON-LD Context processing in HTML Documents #172