w3c / wpub

W3C Web Publications
https://w3c.github.io/wpub/
Other
78 stars 19 forks source link

Is it acceptable to use HTML for the serialization of some infoset items, or should it all be in separate (JSON) file? #193

Closed iherman closed 6 years ago

iherman commented 6 years ago

This discussion has permeated many of the various issues (e.g., lately, #159, #181, or #186). It would help to get this design principle settled once and for all. In practice, the issue is whether the "entry (HTML) page" could be used as containing the infoset items, or not.

Note that the answer may not be clear-cut, and may depend on the nature of the infoset items. Indeed, it is different if:

  1. the item is, loosely speaking, some sort of a metadata, i.e., expressible via an HTML <meta> or <link> element (e.g., creation modification date or links to an ONIX file)
  2. the item is a slightly more complex structure that cannot be expressed fully in the HTML header (e.g., language and base direction)
  3. the item is an item that would be naturally expressed in HTML, and is often indeed done that way (e.g., Table of Content)

Another aspect that influences this decision is whether the WP consists of a single HTML file (which is also the entry page), with adjunct files like CSS or images. This is the typical case for, e.g., a scholarly journal article.

iherman commented 6 years ago

Trying to collect pros and cons, based also on earlier discussions. (Let us try to collect all the Pro/Con arguments in the most concise manner possible to make an informed decision...)

Note that, although we are not discussing WAM-s, the arguments in the section on the same issue in the WAM document (and the links in there) are also relevant.

llemeurfr commented 6 years ago

Note that editing such infoset by hand would be equally difficult in the html and json cases. Meaning that whatever the choice is btw a highly specific web page and a json structure, an authoring tool seems mandatory. Which leads to an additional Con.

HadrienGardeur commented 6 years ago

Just a few quick notes first:

I'd also like to list an additional con: may require additional network requests that could block the processing of the WP.

If I discover a publication through one of its chapter, this means that:

Since I'll only be able to discover these additional HTML resources through the manifest, this means that these fetch requests (plus all the processing related to HTML) will have to be done sequentially and not in parallel.

The majority of the pros listed by @iherman could also be challenged IMO because they're mixing up two different issues:

In the case of a single-HTML document, I don't think that using <meta> + <link>+ potentially RDFa is in any way better than just embedding JSON-LD in the HTML document.

There are less semantic issues with JSON-LD (the metadata is not necessarily about the document that contains them) and I would argue that it's easier to author JSON-LD than RDFa.

To go back to the list of pros, we could also say that JSON-LD embedded in HTML is:

I'm not really buying the redundancy arguments (we're not expressing the same information) or the more "natural bridge" one (browsers ignore the vast majority of metadata and links that we would end up using in HTML).

I'd like to hear @BCWalters opinion on this as well, now that we have a major browser actively participating in this WG, I think there's a lot of value to what they have to say about this.

RachelComerford commented 6 years ago

There is a business consideration that weighs into the HTML vs JSON question because it is easier and cheaper for me to find HTML coding resources than JSON coding resources and my team is less likely to follow a standard that is (even more) expensive to maintain. To confirm this, I reached out to our most commonly used vendors - all replied that they would need time to staff up and train JSON developers but that they had plenty HTML developers on staff.

iherman commented 6 years ago

Thanks @RachelComerford, this is a very important, non-technical point...

BigBlueHat commented 6 years ago

@iherman

the item is, loosely speaking, some sort of a metadata, i.e., expressible via an HTML or element (e.g., creation modification date or links to an ONIX file)

I'd not limit it to just <meta> and <link>. The growth and widespread usage of data-in-HTML formats (RDFa, Microdata, JSON-LD) show that developers and web publishers do know how to put metadata in their publications and apps, and are already incentivized to do so because search engines. Why not follow suit rather than creating a different, currently unknown, out-of-band location to look for metadata?

BigBlueHat commented 6 years ago

Con: per the HTML standard, the element's role is to express "document-level metadata" (see html5; emphasis is mine). Using it for expressing metadata for other entities (ie, the WP) is semantically not clean. (In the case of a single HTML file based WP one could argue that the document and the HTML file is the same, which would make it all right.)

Depending on how this is modeled and "gone about" it maybe that the "binding" document is imperceptible from the publication itself.

Or, alternatively:

Regardless, this is easily avoidable...so no an implicit "con."

Con: the metadata may be used for various purposes handling the WP instance itself, eg, indexing, bookshelves, etc. Parsing an HTML file to extract the information, though would use a standard toolset, requires a significant effort for the User Agent: parsing the HTML, building the DOM, the CSS DOM, the Accessibility DOM, etc, before giving access to the element. Compared to that, parsing a JSON file into Javascript structures is a breeze.

There's no requirement that a DOM, CSSOM, Accessibility OM, etc. be setup or available when extracting metadata from HTML files. It's possible to get it directly out of the markup without those things.

Additionally, when "browsed to" the browser will provide all those things, and could potentially make that data more easily extract-able by the developer (or within the UI of the browser).

Con: having both a manifest file and some data in some HTML resources complicates implementations that should follow a more complex path to get hold of the infoset item. (Note, however, that this argument has less weight than ease of authoring; there are more authors than implementers...)

Couldn't agree more...but that's not a "con" of an HTML-driven approach to these problems.

If, for instance, all the primary resources are referenced from an HTML-based "binding document" (perhaps through something like a latent-loading <iframe> or a <nav role="doc-toc"> like thing), then the request and processing needs are already defined and taken care of by the browser and the HTTP ecosystem specs (CORS, CSP, etc). However, if they're in the JSON (as noted in #104), there's an unknown relationship with the things stated there and the rest of the request/response processing constraints, browsing contexts, etc (again; hence #104). So...that's ultimately a "vote" for primary resources to be expressed from within the HTML.

Each of the current infoset items are expressible from within an HTML document (see my last comment for a handful of options), and what's needed next is to know how to enhance their expressions as available now such that they are more useful.

Moving such core concepts as the primary resources or redundantly expressing dependencies into a separate "manifest file" is duplication, will cause errors when out-of-sync, does create an over dependence on tooling, and ultimately puts the processing power out of the reach of the publisher/developer and into the hands of the "reading system" developer exclusively.

Consequently, I'd not see our currently defined <nav> processing algorithm as a fallback, but as the expression (or something like it) of the primary resources.

Ultimately, we'd go through the same process of finding homes for each of the infoset things in the HTML "binding document" (which is clearer than "entry point"), remove them from a/the JSON serialization until we find things that must be expressed in JSON.

tl;dr web publications exist already (built from HTML, JS, CSS, RDFa, etc), so how do we make them better, stronger, faster, more accessible, offline-able, etc.

deborahgu commented 6 years ago

I'd like to make another non-technical point: we should not be creating a complex creation systems for publishers. Descriptive metadata, including navigation items, should go in as few files, and as few formats, as is technically possible.

As Ivan said:

Pro: mainly in the case of single HTML file based WP this is a natural way of expressing the information.

If we tell publishers "in order to create a WP, you need to put this infoset data over here in HTML, and this infoset data over there in JSON," we're raising the barrier to entry for anyone who doesn't have a WP-aware authoring tool.

IMO, much better to choose an imperfect design which publishers will actually be able to use than the most perfectest beautifullest awesomest architecture which is a pain for creators.

(I have no horse in the race of actual location and format, and personally I'd be happiest if all the players in this conversation came to a place where they realize that no solution is perfect and all the people disagreeing have valid points. Unfortunately a classic compromise is the worst possible solution, because we really just need to pick one. There is literally no solution on offer without cons; we still have to choose one and move on to the rest of the work.)

iherman commented 6 years ago

@BigBlueHat

@iherman

the item is, loosely speaking, some sort of a metadata, i.e., expressible via an HTML or element (e.g., creation modification date or links to an ONIX file)

I'd not limit it to just <meta> and <link>. The growth and widespread usage of data-in-HTML formats (RDFa, Microdata, JSON-LD) show that developers and web publishers do know how to put metadata in their publications and apps, and are already incentivized to do so because search engines. Why not follow suit rather than creating a different, currently unknown, out-of-band location to look for metadata?

I know there can be more data than just <meta> or <link>. But I believe the characterization of a "different, currently unknown, out-of-band location to look for metadata" is a bit harsh. Putting metadata into a separate file, and link to it, is not a new approach, see (beyond the WAM) the work on Payment Method Manifest, and was also the routine approach to get to metadata before the creation of RDFa, with the metadata stored in different formats, let that be Turtle or (God forbid!) RDF/XML. (This was, e.g., the way to refer to CC metadata from an HTML page.)

iherman commented 6 years ago

@BigBlueHat,

Con: per the HTML standard, the element's role is to express "document-level metadata" (see html5; emphasis is mine). Using it for expressing metadata > for other entities (ie, the WP) is semantically not clean. (In the case of a single HTML file based WP one could argue that the document and the HTML file is > the same, which would make it all right.)

Depending on how this is modeled and "gone about" it maybe that the "binding" document is imperceptible from the publication itself.

Yes, I agree; this is the case of a single-document PW; this is one of the "Pro" arguments.

Or, alternatively:

  • publication address might be http://example.com/moby-dick/
  • binding document (currently "entry point") is returned upon that request (i.e. index.html per most server defaults), but has it's own URL http://example.com/moby-dick/index.html and consequently could have it's own metadata (in <meta> or wherever).

Sorry, but I do not agree. The quoted HTML specification does not refer to a URL, it refers to the document itself, whichever path was used to get there. I believe the HTML standard is pretty clear about it. If we use the HTML headers, we should simply accept that we are willfully overstepping the bounds that the HTML standard defines (but I am not sure the rest of the community would accept it, we may face major objections).

Regardless, this is easily avoidable...so no an implicit "con."

I think we have to agree that we disagree on that point.

Con: the metadata may be used for various purposes handling the WP instance itself, eg, indexing, bookshelves, etc. Parsing an HTML file to extract the information, though would use a standard toolset, requires a significant effort for the User Agent: parsing the HTML, building the DOM, the CSS DOM, the Accessibility DOM, etc, before giving access to the element. Compared to that, parsing a JSON file into Javascript structures is a breeze.

There's no requirement that a DOM, CSSOM, Accessibility OM, etc. be setup or available when extracting metadata from HTML files. It's possible to get it directly out of the markup without those things.

This is theoretically correct, but I do not think it is practically true. Any implementation will use one of the many, possibly "built-in" HTML parsers, and all those parsers build up the DOM. I do not think we can expect an implementation to have a different parser that would just look at the syntax or do some other tricks.

Additionally, when "browsed to" the browser will provide all those things, and could potentially make that data more easily extract-able by the developer (or within the UI of the browser).

I am not sure I understand what you mean. Yes, of course, if the UA begins to render, display, etc, the WP, then those data are already there, because they are in the DOM. The "Con" is for the cases when, say, the Reading System or the browser builds up, say, bookshelf, for which a number of Infoset items are necessary.

That being said, if we go along with the idea of finding the manifest file via a <link> element, then the same problem applies. So this may be one of the 'con'-s that we have to live with whatever we do, and we can consider it neutral in our discussions:-)

Con: having both a manifest file and some data in some HTML resources complicates implementations that should follow a more complex path to get hold of the > infoset item. (Note, however, that this argument has less weight than ease of authoring; there are more authors than implementers...)

Couldn't agree more...but that's not a "con" of an HTML-driven approach to these problems.

True... except that it remains to be proven that all infoset items can be expressed easily and in a user-friendly manner via the current HTML element set.

To take an example: we did say that the language tag in a content file (ie, an HTML file) is not the same as the language tag for the publication as a whole. In other words, the regular @lang attribute in the HTML file cannot be used as an encoding of the relevant infoset items: it has to be put somewhere else. We will have to define our own definitions for that in some way or other, which will not look very natural in HTML (and hence not very user friendly). We may have similar issues with, say, the title of the WP...

If, for instance, all the primary resources are referenced from an HTML-based "binding document" (perhaps through something like a latent-loading <iframe> or a <nav role="doc-toc"> like thing), then the request and processing needs are already defined and taken care of by the browser and the HTTP ecosystem specs (CORS, CSP, etc). However, if they're in the JSON (as noted in #104), there's an unknown relationship with the things stated there and the rest of the request/response processing constraints, browsing contexts, etc (again; hence #104). So...that's ultimately a "vote" for primary resources to be expressed from within the HTML.

To be honest, you lost me here; more exactly, I do not see the problem. If we say (as we seem to converge to in #104) that we simply take the browsing context as given, I just do not see the issue accessing the separate JSON file in this browsing context. That information is accessed from the entry point (in its own browsing context, as we seem to converge to in #104), then all the rest is clear: that is the context we are operating in. Let alone the fact that many elements in the infoset (title, authors, etc) are unaffected by the browsing context.

Each of the current infoset items are expressible from within an HTML document (see my last comment for a handful of options), and what's needed next is to know how to enhance their expressions as available now such that they are more useful.

See my comment above. I am absolutely not sure it is as simple, more exactly that the resulting definitions would be clearer and simpler than doing it in JSON.

Note that the experience in RDFa is not really good (alas!), meaning the relying on RDFa may not be that helpful. (Authoring RDFa can be a major challenge, and is very opaque for non RDF-savy persons (and is sometimes difficult even for people like me, I frequently have to run RDFa+HTML through my own distiller to see what the generated RDF is). Microdata is, maybe, even worse, because there are features that cannot even be expressed in microdata...)

iherman commented 6 years ago

Trying to move forward: would the usage of a <script> element alleviate the problems? (See also #122). Here is what this would mean:

What this means is that there is not necessarily a separate file to be authored; all is in the same file; would that alleviate your issues, @deborahgu and @RachelComerford ? It would not necessarily help with the issues of @llemeurfr because today's authoring tools rarely help for the authoring of embedded data. On the other hand, the semantics of the <script> element's content is under our control, ie, we would not violate the HTML spec.

The experience shows that authoring JSON for metadata-like information is simpler than doing it in, say, RDFa, so we would gain that.

Note also DanBri's comment: Schema.org also uses this JSON(-LD) wrapper to extract information.

(An even more radical proposal would be to use the embedded <script> element only. I am not sure I would go that far.)

danielweck commented 6 years ago

Ivan, +1 to the JSON-in-script / JSON-as-file approach (although I suspect reading system developers would prefer a directly-accessible standalone JSON, as this saves parsing an HTML document and performing an additional fetch request).

dauwhe commented 6 years ago

Sorry, but I do not agree. The quoted HTML specification does not refer to a URL, it refers to the document itself, whichever path was used to get there. I believe the HTML standard is pretty clear about it. If we use the HTML headers, we should simply accept that we are willfully overstepping the bounds that the HTML standard defines (but I am not sure the rest of the community would accept it, we may face major objections).

Consider the following document returned from www.example.com/book/

<!DOCTYPE html>
<html lang="en">
<head>
  <title>Moby-Dick</title>
  <meta name="author" content="Herman Melville">
</head>
<body>
  <nav>
    <ol>
      <li><a href="c1.html">One</a></li>
      <li><a href="c2.html">Two</a></li>
    </ol>
  </nav>
  <iframe id="c1" name="c1" src="c1.html"></iframe>
  <iframe id="c2" name="c2" src="c2.html"></iframe>
</body>
</html>

If c1.html does not have a meta name="author" element, who is the author of c1.html? The content of c1.html is literally a node in the document object of the original URL. Would the answer be different if c1.html was included via object, html imports, or a custom element?

mattgarrish commented 6 years ago

There's no requirement that a DOM, CSSOM, Accessibility OM, etc. be setup or available when extracting metadata from HTML files. It's possible to get it directly out of the markup without those things.

This is theoretically correct, but I do not think it is practically true.

How are these steps avoided? Is the idea that user agents will go through the process of obtaining and processing the manifest before the user makes any decision about whether they even want to initiate the reading experience, and stop rendering the document until a decision is made?

In other words, does an external file really save anything in processing time, except perhaps in the (rare?) situation where a user says to always initiate publications and the link is available in an HTTP header?

llemeurfr commented 6 years ago

@iherman about

Note that the experience in RDFa is not really good (alas!), meaning the relying on RDFa may not be that helpful. (Authoring RDFa can be a major challenge, and is very opaque for non RDF-savy persons (and is sometimes difficult even for people like me, I frequently have to run RDFa+HTML through my own distiller to see what the generated RDF is). Microdata is, maybe, even worse, because there are features that cannot even be expressed in microdata...)

I totally agree with that statements. At allocine.com, we embedded RDFa, then microdata (preferred), in our film / star etc. pages. But it was the work of the technical team, in page templates: certainly not the work of the editorial team. And I'm pretty sure that this is how 99.9% of websites containing RDFa or microdata are constructed.

mattgarrish commented 6 years ago

If c1.html does not have a meta name="author" element, who is the author of c1.html?

It might be simpler to use something like dcterms/schema.org isPartOf/hasPart to associate the fragments than duplicate metadata, but I don't follow the argument that a multi-part document cannot be wholly identified by the first of its resources.

iherman commented 6 years ago

@dauwhe (referring to https://github.com/w3c/wpub/issues/193#issuecomment-388031101) great questions...

I am not sure, and I do not think the HTML spec clearly says anything about this case. However, looking at the HTML spec, a document within an iframe has its own Document element (and own context), so my gut feeling is that, in your example, the author of the iframe-d content would be unknown. It is probably the same with object. The import case is even less clear, the current draft does not really say anything about Document elements or contexts (is that work still alive, b.t.w.?).

RachelComerford commented 6 years ago

@iherman... to be honest, I don't understand the solution?

_Trying to move forward: would the usage of a Githubissues.

  • Githubissues is a development platform for aggregating issues.