w3c / json-ld-syntax

JSON-LD 1.1 Specification
https://w3c.github.io/json-ld-syntax/
Other
111 stars 22 forks source link

JSON-LD Context processing in HTML Documents #172

Closed msporny closed 5 years ago

msporny commented 5 years ago

From this issue in the Verifiable Claims Working Group with regard to the new "full Processor" conformance class: https://github.com/w3c/vc-data-model/issues/585

@gkellogg wrote:

As this is something that may not be necessary in certain embedded environments, the notion of processor classes was introduced to allow a pure JSON Processor to conform without processing HTML. But, a full Processor is expected to do this.

@msporny wrote:

To be clear, I really dislike this feature of JSON-LD 1.1 because it raises the burden of Full JSON-LD 1.1 processors to contain an HTML processor (which is a massive requirement) on top of doing JSON-LD processing. I also think this is going to really damage the adoption of JSON-LD 1.1 and make it so much easier for people to argue against it... hell, even I would argue against "Full JSON-LD processors" (and plan to if this feature goes to REC).

@gkellogg wrote:

I appreciate your position, but JSON-LD in HTML is probably the biggest use case right now (although that will likely change with adoption of VC and WoT). JSON-LD in HTML is a reality that the spec needs to recognize and legitimize.

I agree that processing JSON-LD content in HTML is a primary use case and the WG should support it.

I disagree that people are publishing JSON-LD Contexts in HTML, that came out of nowhere. I can see what the WG is trying to do, but this issue is an example of my concern: https://github.com/w3c/vc-data-model/issues/585

You have someone suggesting that we pull in a JSON-LD Context file via an HTML document without understanding the technical burden in doing so. They don't understand that publishing a JSON-LD Context as an HTML document will not require full processors.

I also note that expressing JSON-LD Contexts in HTML was not contemplated in any of the input documents to the JSON-LD WG and as such, the group is skirting very close to being in violation of their charter by adding this feature:

https://www.w3.org/2018/03/jsonld-wg-charter.html https://github.com/json-ld/json-ld.org/wiki/Changes-in-Community-Group-Drafts-Targeted-for-1.1 https://json-ld.org/presentations/JSON-LD-Update-TPAC-2017/assets/player/KeynoteDHTMLPlayer.html

There are two major issues with this new set of features:

Making the following changes to the specification would be an improvement:

pchampin commented 5 years ago

Rename "full Processor" to "HTML Processor".

I agree that the original naming may give the wrong impression (that other processors are somehow incomplete), and discourage some people from adopting JSON-LD. "HTML Processor" is a little misleading, but a better name could indeed be found.

Remove the ability to use text/html files as JSON-LD Contexts as pure JSON Processors are not capable of processing them, which will lead to a variety of issues related to developer ergonomics.

The argument was raised that JSON-LD Contexts are bona fide JSON-LD documents, and so it would be difficult to argue that a Full "Extended" processor could sometimes load JSON-LD from HTML, and sometimes not... I think this is a valid argument.

That being said, we could address your concern by replacing the Note, at the beginning of section 7, by a Warning, stating "not available in a Pure JSON-LD Processor" rather than "available in a Full Processor". And possibly hinting that content-negotiation is a more "portable" solution?...

dlongley commented 5 years ago

@pchampin,

And possibly hinting that content-negotiation is a more "portable" solution?...

I don't think we should merely "possibly hint" at this; my preference would be to make it a requirement that you MUST make your @context available as JSON. But, short of my own preferences, we should be very clear that you SHOULD do so and that if you don't, your @context won't work with every JSON-LD processor, only those that add the extra HTML feature set. I think we should be strongly encouraging JSON over HTML, but allow HTML for documentation purposes.

msporny commented 5 years ago

And possibly hinting that content-negotiation is a more "portable" solution?...

I feel stronger about this than @dlongley does... don't open up the Pandora's box of reading JSON-LD Context's from HTML. Remove the feature. The only argument that I can see for it is that it's a "neat feature" in the academic completeness sense... but JSON-LD was never meant to be an academically complete mechanism... it was supposed to help developers publish JSON-LD, but not become so complex that it blows your foot off when you try to use it. Having this feature means that developers will inevitably publish their JSON-LD Context as HTML only, which will cause a split in the ecosystem between "We expect you to publish via HTML" and "We expect you to publish no via HTML".

msporny commented 5 years ago

"HTML Processor" is a little misleading, but a better name could indeed be found.

Isn't the only feature that the "full" processor has over the JSON-only one the fact that it parses stuff from HTML?

pchampin commented 5 years ago

Isn't the only feature that the "full" processor has over the JSON-only one the fact that it parses stuff from HTML?

Yes, but "HTML Processor" makes it sound like it can only process HTML...

gkellogg commented 5 years ago

And possibly hinting that content-negotiation is a more "portable" solution?...

I feel stronger about this than @dlongley does... don't open up the Pandora's box of reading JSON-LD Context's from HTML. Remove the feature. The only argument that I can see for it is that it's a "neat feature" in the academic completeness sense... but JSON-LD was never meant to be an academically complete mechanism... it was supposed to help developers publish JSON-LD, but not become so complex that it blows your foot off when you try to use it. Having this feature means that developers will inevitably publish their JSON-LD Context as HTML only, which will cause a split in the ecosystem between "We expect you to publish via HTML" and "We expect you to publish no via HTML".

This was not added because it's a "neat feature", but as a response to concerns raised in #43. If JSON had a built-in commenting feature, it would be likely not necessary.

Because of this, and the need to normatively describe the in-the-wild JSON-LD in HTML scenarios provided a mechanism to do this. Once you describe JSON-LD in HTML, then allowing that for contexts and frames is a logical progression, particularly when the extraction is described in the document loader, which is the standard way to fetch all remote content.

The fact that it came up in w3c/vc-data-model#585 just goes to show a general need to be able to document contexts, and containing the context in the documenting HTML is likely a better way to keep them from diverging than using different resource formats.

I agree with @dlongley that we should better describe the potential for splitting the eco-system by recommending (SHOULD) that publishers provide an application/ld+json version via content-negotiation and not depend on a processor's conformance with HTML processing.

azaroth42 commented 5 years ago

With chair hat on...

I also note that expressing JSON-LD Contexts in HTML was not contemplated in any of the input documents to the JSON-LD WG and as such, the group is skirting very close to being in violation of their charter

Could you point out where in the charter it says that we can only introduce features described in input documents to the WG? Because that would also preclude features like @protected, as far as I'm aware. I don't think that's, thus, relevant here unless you can find somewhere that says we're constrained in this way?

And with chair hat off ...

I agree with @gkellogg that if we say that a context is JSON-LD, and that JSON-LD can be expressed in a script element of an HTML page, then the implication is that a context can be expressed in a script element of an HTML page. If I recall correctly, @danbri has brought up his issue as a frustration of web developers.

The possible routes forward seem to be:

I agree with @pchampin that "extended" is better than "full", along with a big warning about contexts in HTML being complicated in the spec.

BigBlueHat commented 5 years ago

JSON-LD in HTML exists and even informatively--when viewed from the HTML-perspective: https://html.spec.whatwg.org/#the-script-element:attr-script-type-4

In the current spec-space, it's already possible to extract JSON-LD from HTML and use it as JSON(-LD)--because that's how data blocks work with any embedded format (CSV, YAML, etc.).

We have gone beyond simply echoing that fact in the syntax document and instead baked additional processing steps into the API.

Shifting things into the documentLoader space does help from an architectural layering concern, but this "context in HTML" usage raises a whole host of architectural and community concerns. It effectively moves us from the current world of extracting-then-using the embedded JSON-LD into one where HTML becomes a valid representation of JSON-LD itself.

We need to work to re-narrow our focus at this stage, and go back to the "simplest thing that could possibly work."

iherman commented 5 years ago

This issue was discussed in a meeting.

azaroth42 commented 5 years ago

From @danbri, posted with permission, after discussion with @gkellogg:

  1. We are uncomfortable that our site (by virtue of our context url) has implicitly become a software component in a system where we don't even really know the other software components. I am considering turning off the context serving at weekend to encourage caching and more robust clients.

1b. Aside: it could be interesting to have a best practice note about how software components fetching contexts might identify themselves incl versions in http requests (user agent)

  1. We are unhappy that the expectation of content negotiation on our home page blocks us from moving to 100% static-served site.

  2. If we could have a small snippet of jsonld in our homepage, pointing off to a separate url with our giant big context file, that would be great

  3. We are not interested in putting the whole context into our homepage; it is way too big. Similar issues may hold for Wikidata at some point.

  4. We appreciate the reluctance to entangle the pure json nature of json-ld with html, but note that the success of json-ld was achieved in large part through just such an entangling

rubensworks commented 5 years ago

1b. Aside: it could be interesting to have a best practice note about how software components fetching contexts might identify themselves incl versions in http requests (user agent)

:+1: Related to this, that best practise note should also talk about caching of contexts.

We are unhappy that the expectation of content negotiation on our home page blocks us from moving to 100% static-served site.

One possible solution for this would be to allow a link header to be added to HTML documents that points towards contexts. (This may not solve all static site use cases though, as platforms like GitHub pages don't support custom link headers AFAIK)

msporny commented 5 years ago

We are uncomfortable that our site (by virtue of our context url) has implicitly become a software component in a system where we don't even really know the other software components. I am considering turning off the context serving at weekend to encourage caching and more robust clients.

I think that this would be a good thing to do. Provide guidance on aggressively caching the schema.org context (or packaging it with software implementations).

msporny commented 5 years ago

We are unhappy that the expectation of content negotiation on our home page blocks us from moving to 100% static-served site.

Then state that the new schema.org context will be served from: "https://schema.org/v1" -- make that the context, say that "https://schema.org/" is an alias for "https://schema.org/v1" and note that you will turn off content negotiation for "https://schema.org/" at the beginning of 2020.

msporny commented 5 years ago

If we could have a small snippet of jsonld in our homepage, pointing off to a separate url with our giant big context file, that would be great

Why? Seems like extra complexity... just say that the new schema.org context file is at: https://schema.org/v1 and be done with it. The schema.org context is so large that implementations will ship with it or aggressively cache it. Speaking from our implementation experience, at one point a bug caused us to go out to the web and fetch schema.org for every digital signature we did and our dev environment suffered horribly - massive performance hit. We now ship with static copies of schema.org... we never go out to the network to get the massive context (and that is the way it should be). The only issue, of course, is there is no versioning for schema.org... but we haven't had an issue w/ that yet. We may have an issue when people start digitally signing schema.org content and expecting those signatures to stay valid for 3-5 years while schema.org shifts underneath them.

msporny commented 5 years ago

We are not interested in putting the whole context into our homepage; it is way too big. Similar issues may hold for Wikidata at some point.

Yes, correct, so we don't need the JSON-LD Context processing in HTML documents feature. No one is asking for that feature.

msporny commented 5 years ago

We appreciate the reluctance to entangle the pure json nature of json-ld with html, but note that the success of json-ld was achieved in large part through just such an entangling

I don't understand this statement. There are a number of us that are attempting to make JSON-LD work w/ pure JSON environments in a more harmonious way and have made great strides towards that with the help of JSON-LD 1.1's @protected feature. @danbri, could you please explain what you meant by the comment above?

BigBlueHat commented 5 years ago

The only issue, of course, is there is no versioning for schema.org... but we haven't had an issue w/ that yet.

There sort of is...but it could be better. For instance, all the versions are in a directory on GitHub: https://github.com/schemaorg/schemaorg/tree/master/data/releases

The 3.7 context file (for instance) lives at https://github.com/schemaorg/schemaorg/blob/104238766458b465e6a60cc7d049f887c542563a/data/releases/3.7/schemaorgcontext.jsonld

That's versioned--via git sha's--but not tagged in git (which would help) nor made available as "the 3.7 context file" from the release history page. All of that would help certainly.

BigBlueHat commented 5 years ago

From @danbri, posted with permission, after discussion with @gkellogg:

@azaroth42 it would be helpful (if possible) to see more of that thread, or to make this an actual conversation/call (again, if possible). Without it, it's not clear we're all talking about the same thing(s).

gkellogg commented 5 years ago

@BigBlueHat this was from hallway conversations at the Web Conference, so no thread to refer to. @danbri should clarify his position, but IIRC, they could turn off content-negotiation for http(s)://schema.org and return a stub context in a script tag which references the actual JSON-LD version of the context, which could help their usage. So, for example, the schema.org web page might look something like the following:

<!DOCTYPE html>
<html lang="en">
<head>
  <!-- Generated from headtags.tpl -->
    <meta charset="utf-8" >
    <link rel="shortcut icon" type="image/png" href="docs/favicon.ico"/>
    <link rel="stylesheet" type="text/css" href="docs/schemaorg.css" />
    <link rel="stylesheet" type="text/css" href="docs/prettify.css" />
    ...
    <script type="application/ld+json">{"@context": "https://schema.org/docs/jsonldcontext.jsonld"}</script>
    ...
</head>
</html>

Presently, content-negotiation does a redirect to https://schema.org/docs/jsonldcontext.jsonld, so this would simplify their hosting infrastructure.

BigBlueHat commented 5 years ago

Right, but it would vastly increase the amount of work a JSON-LD processor must do.

Given this as a data document:

{"@context": "https://schema.org/",
 "@type": "Person",
 "name": "me"}

The processor (without a cached context it says is valid for https://schema.org/) would need to...

  1. GET the default (HTML) response from https://schema.org/
  2. Parse that looking for data blocks (i.e. <script type="application/ld+json">)
    1. with the added requirement that one of them says it's a context file?
  3. Extract that JSON-LD datablock
  4. Parse it.
  5. If valid, GET the @context value(s).
  6. Parse those to create a single active context for the data document.

The processing requirements go from "use an HTTP(S) client" to "use an HTTP(s) client and HTML parser (which possibly supports JavaScript).

danbri commented 5 years ago

There is a massive amount of json-ld embedded within html. Tools without the capability to extract it are ignoring one of the biggest applications of json-ld. So perhaps the burden is not quite so huge?

On Fri, 14 Jun 2019 at 16:54, BigBlueHat notifications@github.com wrote:

Right, but it would vastly increase the amount of work a JSON-LD processor must do.

Given this as a data document:

{"@context": "https://schema.org/", "@type": "Person", "name": "me"}

The processor (without a cached context it says is valid for https://schema.org/) would need to...

  1. GET the default (HTML) response from https://schema.org/
  2. Parse that looking for data blocks (i.e. <script type="application/ld+json">)
    1. with the added requirement that one of them says it's a context file?
  3. Extract that JSON-LD datablock
  4. Parse it.
  5. If valid, GET the @context value(s).
  6. Parse those to create a single active context for the data document.

The processing requirements go from "use an HTTP(S) client" to "use an HTTP(s) client and HTML parser (which possibly supports JavaScript).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/w3c/json-ld-syntax/issues/172?email_source=notifications&email_token=AABJSGKMBJVJIJIX5FIWJ2TP2O5J3A5CNFSM4HK3Y2R2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXXGNAY#issuecomment-502163075, or mute the thread https://github.com/notifications/unsubscribe-auth/AABJSGMMHEFSWFIOWEAFWI3P2O5J3ANCNFSM4HK3Y2RQ .

BigBlueHat commented 5 years ago

@danbri certainly if you're already in that space doing that thing, you're all set. 😃 But if you're in a "pure" JSON-LD environment (database, IoT, etc), you'd very much want to avoid having higher processing requirements.

gkellogg commented 5 years ago

Right, but it would vastly increase the amount of work a JSON-LD processor must do.

Given this as a data document:

{"@context": "https://schema.org/",
 "@type": "Person",
 "name": "me"}

The processor (without a cached context it says is valid for https://schema.org/) would need to...

  1. GET the default (HTML) response from https://schema.org/
  2. Parse that looking for data blocks (i.e. <script type="application/ld+json">)

    1. with the added requirement that one of them says it's a context file?
  3. Extract that JSON-LD datablock
  4. Parse it.
  5. If valid, GET the @context value(s).
  6. Parse those to create a single active context for the data document.

The processing requirements go from "use an HTTP(S) client" to "use an HTTP(s) client and HTML parser (which possibly supports JavaScript).

Tools really need to cache contexts, anyway, so this might serve as an added incentive to do so.

iherman commented 5 years ago

This issue was discussed in a meeting.