w3c / wpub

W3C Web Publications
https://w3c.github.io/wpub/
Other
79 stars 19 forks source link

Only allow embedded manifests #327

Closed BigBlueHat closed 5 years ago

BigBlueHat commented 6 years ago

Many of the recent issues and discussions have been around the interplay between the primary entry page and the potentially external manifest. Much of the lifecycle and canonicalization content exists primarily to consolidate the various scenarios possible when externalizing the manifest.

Proposal: require that the manifest MUST be included in the primary entry page which is itself returned in response to a request for the publication's canonical address.

The result is that when you get a publication address, you get (in a single response) the means to understand and use the publication.

HadrienGardeur commented 6 years ago

Sorry but this makes very little sense to me.

Having to deal with HTML instead of JSON directly adds an overhead that you don't always want/need. That's the case for instance with the audiobook example that I've created and I also believe that in the context of EPUB4, there's no good reason to force an "entry page".

The lifecycle is better off when we simplify it and I personally hope that we'll end up in a state where we need to fetch/parse as few additional HTML documents as possible.

iherman commented 6 years ago

I do not have strong feelings either way at this point. However, in all fairness, I do not believe

Much of the lifecycle and canonicalization content exists primarily to consolidate the various scenarios possible when externalizing the manifest.

is correct. The embedding or not does create some discrepancies, but, e.g., the canonicalization's main goal is to make the life of authors easier by not enforcing a strict set of structures in the JSON-LD part. This is is true regardless of embedding or not.

I can see that for many publication types the embedded version makes more sense; this is obviously the case for single page WP-s, and even for WP-s with very traditional content on the Web, ie, a collection of HTML content pages. On the other hand, I agree that in other cases, eg, for audio books but maybe for mangas, too, a separate manifest file makes more sense. I do not think that, at this point, we can make this type of decision.

There may be best practices issues, though. I would certainly not expect separate manifest files to be used for scholarly articles, for example.

BigBlueHat commented 6 years ago

Schema.org processors (Google, Bing, etc) only consume embedded JSON-LD data blocks, so (as mentioned earlier) "If we want any SEO value from the contents of the manifest, it MUST be embedded in the entry page."

Also...

Having to deal with HTML instead of JSON directly adds an overhead that you don't always want/need. is par for the course if discovery of the manifest URL happens via a <link> tag in the HTML returned from the publication address.

Embedding it in the publication address's response HTML has the added bonus of avoiding a secondary fetch and potential failure scenarios around that.

GarthConboy commented 6 years ago

I think there will need to be some tuning of the concept of "Primary Entry Page" (and ability to embed the manifest) in the Audio Book profile of (P)WP. All resources may want to be Audio. Though, this take would seem to only impact the packaged version.

BigBlueHat commented 6 years ago

This also solves for #321 fwiw. The origin and browsing context (see #104) of the publication would be created by the publication's address following the process described in https://www.w3.org/TR/html/browsers.html#ref-for-completely-loaded%E2%91%A4:

  1. Set the origin of document:
    • If the new browsing context has a creator browsing context, then the origin of document is the creator origin.
    • Otherwise, the origin of document is a unique opaque origin assigned when the new browsing context is created.

Putting the manifest within the document returned from the publication address removes the vagueness around which thing creates the origin: is it the publication address or the manifest?

If we pick "publication address" which returns the "binding" for the publication in a single request, then we have (thanks to the Open Web Platform) full definitions (created by others) around the creation and provision of origin and browsing context and opening up the use of CORS, CSP, ServiceWorkers, etc.

If it's the manifest (and not the address), we are charting brand new territory around security, authority, ownership, and distributed content.

iherman commented 6 years ago

@BigBlueHat

If it's the manifest (and not the address), we are charting brand new territory around security, authority, ownership, and distributed content.

i think this needs some clarification. One can access/fetch resources on the Web that may access other resources. Fetching an SVG file as a standalone resource comes to my mind. Why would a standalone JSON-LD file be different?

dauwhe commented 6 years ago

i think this needs some clarification. One can access/fetch resources on the Web that may access other resources. Fetching an SVG file as a standalone resource comes to my mind. Why would a standalone JSON-LD file be different?

We agree that a WP URL must resolve to an HTML page. I see embedding the manifest there as a simplification. We avoid issues of the manifest being on a different origin than the entry page. We avoid issues of being unable to fetch the manifest from the entry page. We provide a natural context for the publication.

iherman commented 6 years ago

@dauwhe I understand that an embedded manifest simplifies a number of considerations. However, what I was asking is why a separate manifest file would mean “charting brand new territory around security, authority, ownership, and distributed content.”

BigBlueHat commented 6 years ago

Unlike SVG and HTML, JSON(-LD) has no defined rendering or scripting semantics or affordances in today's browsers. Consequently, we chose HTML as the required format to be returned from a publication address--because it has known/defined security, authority, ownership, and even distributed content (see CORS, CSP, etc) affordances and (deliberate) limitations.

JSON(-LD) comes with none of these affordances, does not display in a browser, and therefore we chose for it not to be the immediate thing returned when requesting a publication address.

However, if an external manifest is seen as the "brains" of the publication, then implementers will ignore the entry page (as seen in many demos already) and lean exclusively on the manifest for their definition of the "publication." That, in turn, leads to the manifest URL becomes the actual publication URL for UAs that ignore the entry page or use it only initially to discover the manifest URL.

If the manifest URL is construable as the publication's actual URL, then the origin, ownership, CSP, CORS, of that URL becomes the more defining one...and not the publication URL (which ultimately is either conceptually overwritten by the manifest URL or is of nearly no value other than identification and redirection).

iherman commented 6 years ago

Thanks for the clarification, @BigBlueHat, I now understand what problems you are referring to. And yes, these are probably real issues to consider.

However... I have the impression you refer to a problem which does not correspond to the current issue. The only thing the current issue asks is whether having a separate JSON-LD document is allowed. It does not address the problem whether the (HTML) primary entry page is (or is not) the entry point to the Web Publication. Your arguments do not invalidate the situation where there is a primary entry page, that page is the resource returned on the HTTP request, but that page includes a link to a manifest rather then embeds one. I can see that in some types of workflows using a separate JSON-LD file may be more convenient; to give a simple example, there are a bunch of good text editors that can edit JSON and would shout at you if you made a syntax error, whereas no HTML editor (to my knowledge) would to that for the embedded case. Ie, for a more complex manifest I may prefer to author it in a separate file.

Considering the issue whether the primary entry page is the only entry point or not, the current draft is indeed a big vague in this respect. It says:

The primary entry page is a key [HTML] document required of every Web Publication. It represents the preferred starting resource for discovery of the Web Publication and enables discovery of the manifest.

The key term here is "preferred" as opposed to, say, "required".

I would therefore suggest we have two different issues and we should treat them separately:

  1. Can the manifest appear in a separate file or whether it must be embedded in the primary entry page (this issue)
  2. Is the primary entry page the required empty page to the publication, or whether it is "just" the preferred one.

If, for the sake of arguments, the answer to (2) is that the primary entry page must be the required entry page, would you still be opposed to allow the manifest to be in a separate file?

GarthConboy commented 6 years ago

For packaged audiobooks (which doesn't really need to impact WP proper) we likely won't want an entry page at all, and that will require a standalone manifest file.

iherman commented 6 years ago

@GarthConboy

(which doesn't really need to impact WP proper)

This may be unavoidable, we shall see, but I must admit I do not like the idea that an unpacked audiobook (or any type of publication type derived from WP), when unpacked, becomes an invalid WP.

BigBlueHat commented 5 years ago

For packaged audiobooks (which doesn't really need to impact WP proper) we likely won't want an entry page at all, and that will require a standalone manifest file.

I'd keep the entry page regardless. It means you could still ship an audiobook-focused WP to the Web and potentially use it (or navigate it) in the browser immediately (regardless of "WP support").

HadrienGardeur commented 5 years ago

I'd keep the entry page regardless. It means you could still ship an audiobook-focused WP to the Web and potentially use it (or navigate it) in the browser immediately (regardless of "WP support").

That's only true if the packaging format is supported by browsers as well.

iherman commented 5 years ago

@HadrienGardeur, not necessarily. I, as a reader, may get hold of a packaged wpub (actually, audio of not), may decide to unzip (or whatever program is necessary) on my machine and use it via localhost on my browser. This may be handy if I do not have any reader software installed on my machine.

llemeurfr commented 5 years ago

For the sake of consistency, I'm ok to keep an entry page, which can hold a pleasant presentation of the audiopub + the js code which will act as a polyfill in web browsers.

But this discussion has nothing to do with the title of this issue.

llemeurfr commented 5 years ago

To come back to the original issue, I propose that any CSS and JS is only allowed in an HTML page as embedded CSS and script, disallowing .css and .js external files. This will save a large volume of secondary fetch and potential failure scenarios around that. It will also solve the issue of CSS and JS not displayed natively in a browser, and any vagueness about the origin of these things.

BigBlueHat commented 5 years ago

To come back to the original issue, I propose that any CSS and JS is only allowed in an HTML page as embedded CSS and script, disallowing .css and .js external files.

Yeah...that's not the original issue. 😉 Feel free to write up a separate one, though! I'm curious to hear more.

I'm working on a more concrete proposal for the "always embed the manifest in the entry page" approach. The primary aim is to get the "binding" and metadata of a publication back in a single request of the publication address.

So... GET /publication-address/ results in HTML+JSON-LD which defines the "binding" and metadata.

More complete proposal forthcoming.

HadrienGardeur commented 5 years ago

@HadrienGardeur, not necessarily. I, as a reader, may get hold of a packaged wpub (actually, audio of not), may decide to unzip (or whatever program is necessary) on my machine and use it via localhost on my browser. This may be handy if I do not have any reader software installed on my machine.

@iherman I don't expect readers to unpackage a ZIP and run their own HTTP server. Only the geekiest ones would ever do that.

If the package is detected as a ZIP, they might unzip it and they would end up with audio files and a JSON manifest (optionally a cover and a TOC in HTML as well). That's very similar to what users are already getting today when they buy an audiobook and every OS or browser would be capable of reading such audio files without additional software.

Users would be much more likely to simply listen to those audio files directly than open index.html out of all the available files and listen to the audiobook in their browser. You're also implying by the way, that the package would not only contain an entry page, but also Javascript to handle the whole UX. Do you really expect every packaged audiobook to embed its own Web App as well?

(Plus you'd need to have something that works with file://, which support is very inconsistent among browsers).

BigBlueHat commented 5 years ago
  1. Is the primary entry page the required [entry] page to the publication, or whether it is "just" the preferred one.

If, for the sake of arguments, the answer to (2) is that the primary entry page must be the required entry page, would you still be opposed to allow the manifest to be in a separate file?

@iherman I think the confusion may stim from the term "entry page." This proposal would certainly still allow (and encourage) linking directly to a document that is within the publication (i.e. listed in the reading order). It would also still allow that "child" resource to express that it is part of a specific publication (vs. pointing at a manifest which would then point to the publication's address...which would return markup which would then point back to the manifest...).

The proposal is to always embed the manifest, so that the publication address returns the all the information to build the publication. Consequently, a related change would be to make rel="publication" point to the "publication address." This would remove the requirement for the entry page to (confusingly) contain a rel="publication" link because it would more clearly be the Web Publication's "binding" because it would always contain the manifest.

Here are three scenarios (2 current and 1 proposed)...

current embedded manifest

GET /moby-dick/

<html>
  <link rel="publication" href="#wpub">
  <script type="application/ld+json;profile=.../tr/wpub" id="wpub">
  {"@context" : ["https://schema.org","https://www.w3.org/ns/wp-context"],
   "title": "demo",
   "readingOrder": ["chapter5.html"]}
  </script>
</html>

GET /moby-dick/chapter5.html

<html>
  <link rel="publication" href="/moby-dick/#wpub">
</html>

current external manifest

GET /moby-dick/

<html>
  <link rel="publication" href="/moby-dick/wpub.json">
</html>

GET /moby-dick/wpub.json

  {"@context" : ["https://schema.org","https://www.w3.org/ns/wp-context"],
   "title": "demo",
   "readingOrder": ["chapter5.html"]}

GET /moby-dick/chapter5.html

<html>
  <link rel="publication" href="/moby-dick/wpub.json">
</html>

proposed always-embedded manifest

GET /moby-dick/

<html>
  <script type="application/ld+json;profile=.../tr/wpub">
  {"@context" : ["https://schema.org","https://www.w3.org/ns/wp-context"],
   "title": "demo",
   "readingOrder": ["chapter5.html"]}
  </script>
</html>

GET /moby-dick/chapter5.html

<html>
  <link rel="publication" href="/moby-dick/">
</html>

The current examples must both be handled by implementations, and consequently leaves the response to a publication address in a strange state where its purpose is unclear, and where the JSON document really becomes the "brain"...but that brain might be in a couple places and MUST always be hunted for prior to rendering/reading. It also makes the publication address seem useless or it's purpose confusing (as it doesn't directly resolve to the publication's binding).

The proposed solution sees the HTML returned from the publication's address as the binding document of the publication. It has the advantage that it can be shipped today and requires minimal "discovery" steps to find a potentially detached "brain."

This "always embedded" approach also plays nicely with our use of Schema.org as it provides the actual SEO value for publications (because SEO bots currently only read metadata out of the HTML). Consequently, all Web Publications would then get the expected SEO value from our use of Schema.org terms.

To summarize, I've answered "Rachel's 5 questions" which I hope makes the proposed benefits and reasoning clearer.

5 Q's

  1. What problem are you trying to solve?

    • simplify the process of discovering the publication (and it's manifest)
    • provide immediate SEO value from our use of Schema.org terminology
    • relate all publication assets to the publication address (rather than the manifest URL)
      • i.e. point rel="publication at the publication address
  2. What solution are you proposing?

    • require the manifest always be embedded in the HTML returned from the publication address (aka "the entry page")
  3. Why do you believe this is the ideal solution?

    • URLs which return an HTML page containing the manifest are more clearly Web Publications
      • vs. finding a <link>, GETing OR extracting the manifest, determining if the current URL is the top-level url in the manifest
    • SEO bots consume JSON-LD in HTML and related them to the URLs from which they were retrieved
    • JSON-LD metadata and "binding" is immediate available to JavaScript running within the entry page
      • i.e. removes possibly failed request overhead
  4. What alternatives did you consider

    • the current discovery solution--where the manifest can be in or out of the publication
    • requiring the manifest always be stored as a standalone JSON file
  5. What is your back up plan?

    • leave the situation as is and focus best practice documentation around the embedded case
iherman commented 5 years ago

@bigbluehat,

first of all, your examples in https://github.com/w3c/wpub/issues/327#issuecomment-446744246 are not entirely correct. At this point there is no obligation for a publication to add a reference to the manifest in all resources. I.e.,

GET /moby-dick/chapter5.html

<html>
  <link rel="publication" href="/moby-dick/#wpub">
</html>

is not necessarily true. This fact does reduce the complexity you refer to: a linkage to the manifest must happen at one place only, namely in the Primary Entry Page (PEP).

I think the only clear argument in favor of the original proposal is the reference to SEO, i.e., that no current schema.org processors we know about handle JSON-LD data put into a separate file. And I do find this argument very compelling. I.e., I would be perfectly fine saying that embedding the manifest into the PEP is RECOMMENDED.

But whether it is a MUST: I do not find your arguments more convincing than before. I realize that @llemeurfr's comment was meant to be sarcastic but I could indeed say that, if I create a Web page which heavily relies on a complex scripts for processing (say, a gmail page), the "brain" of that Web Application, as you put it, is in that JS code. Nevertheless, no one is saying that the Javascript code MUST be embedded in the Web page instead of being linked via a <link> element. There may be pragmatic reasons why the author want to accept that extra complexity represented by that extra link. I do not see why we would have to be more restrictive than what is already done on the Web.

Also note that the Web App Manifest people decided to go exactly the opposite way: the only way you can assign a WAM to a page is via an external link. If one created some sort of a Web App + Web Publication (which could be a reasonable setup for a complex educational environment, for example) then it is a strange situation for an author to have to use two, completely different mechanisms side-by-side.

My personal opinion on the issues raised in the thread is:

  1. There MUST be a PEP in HTML, i.e., that is where the game begins. The arguments you put forward about CORS, origin, and all that complex conglomerate of issues is very compelling, and relying on the HTML PEP saves us a whole lot of trouble. It would also facilitate to make a publication more "simple browser aware".
  2. It is RECOMMENDED to embed the manifest into the PEP, primarily due to the current restrictions by schema.org processors.
mattgarrish commented 5 years ago

the only way you can assign a WAM to a page is via an external link

No, they offer choice just like we currently do. The only difference is that they chose to allow data: URLs in link elements instead of embedding via script.

It is RECOMMENDED to embed the manifest into the PEP, primarily due to the current restrictions by schema.org processors.

Should we write a specification based on current practicalities? This sounds like a best practice, as what happens if Google and friends start crawling the linked manifests?

What worries me about requiring embedding is the potential effects on authoring. It's much simpler to maintain an external data set, like WAM or the OPF, that is independent of the resources that make up the publication when you have more than just a simple one-page publication. It doesn't require digging into a resource to extract the data, parsing it into a data structure to modify, and then reversing the process plus wiping the old data to put it back. It also seems like it has the potential to make resource sharing more difficult.

If something breaks by having the manifest external, then it's a whole different story, but without a compelling reason like that to limit the location I don't think this is something for the spec to enforce.

iherman commented 5 years ago

No, they offer choice just like we currently do. The only difference is that they chose to allow data: URLs in link elements instead of embedding via script.

Hm. I have not seen this. The only trace I found is:

Developers need to be aware of the security considerations [...] in relation to making data: a valid source for the purpose of "inlining" a manifest. Doing so can enable XSS attacks by allowing a manifest to be included directly in the document itself; this is best avoided completely.

This is certainly not the same as what we are proposing.

iherman commented 5 years ago

If something breaks by having the manifest external, then it's a whole different story, but without a compelling reason like that to limit the location I don't think this is something for the spec to enforce.

👍 to that.

mattgarrish commented 5 years ago

Hm. I have not seen this. The only trace I found is:

Right, I don't think it's fully endorsed, but they haven't ruled it out, either. I recall we went over this earlier on when we were looking at this issue wrt WAM. This comment is a couple of years old, but confirms it was still allowed then: https://github.com/w3c/manifest/issues/534#issuecomment-265154234

Further back, they were taking a wait and see approach: https://github.com/w3c/manifest/issues/91

I believe that's why the comment you found remains.

llemeurfr commented 5 years ago

Sorry to have been sarcastic with my previous comment. It was indeed meaning that js and css can be either embedded or referenced, and the Web works very well, thanks; I see no reason why we could not do the same with our json manifest.

One thing @BigBlueHat is pointing at and indeed is unbalanced in the current spec is that the PEP is "where the game begins", as @iherman states; but documents included in a publication (optionally) point to the json manifest. The json manifest may be embedded in the PEP, therefore the publication link has a dual form (to json or to html).

If the final consensus in the WG is that the PEP is mandatory (let's remind that the Audiopub TF does not 100% agrees with this), then it would be much clearer to have publication links point to the PEP, always.

The approach of a UA discovering a document with one or more publication links would then be to show a snippet of the PEP (maybe using the manifest to extract metadata from it) and ask the user if he wants to browse one of the publications listed here.

iherman commented 5 years ago

If the final consensus in the WG is that the PEP is mandatory (let's remind that the Audiopub TF does not 100% agrees with this), then it would be much clearer to have publication links point to the PEP, always.

You are right that the @rel value "publication" may be ambiguous. We may have to have two @rel values: "publication", used to link to the PEP and, say, "wpm" (or something more verbose like "web-publication-manifest", or something between the two) used to link to the manifest itself.

HadrienGardeur commented 5 years ago

I think the only clear argument in favor of the original proposal is the reference to SEO, i.e., that no current schema.org processors we know about handle JSON-LD data put into a separate file

It's also worth pointing out that the same SEO benefits can be achieved with an external manifest: all you need to do is add JS code that will inject the manifest in your page. This is fully supported by Google for example.

In many cases, the "marketing" page and the content itself live on two completely different systems (with different domains or sub-domains as well) and it's much easier to link to a separate manifest rather than embed it.

For example, at Feedbooks the primary use case for WP would be in allowing users to read or listen to samples. Our "marketing" page would be our product details page, which is handled by our main Rails app. Anything content related (covers, EPUB or resources in a publication) is handled by a different app written in Go and hosted on another domain. That separation between content and app serving the marketing/product page is fairly common, I've seen something similar many times.

BigBlueHat commented 5 years ago

At this point there is no obligation for a publication to add a reference to the manifest in all resources. I.e.,

GET /moby-dick/chapter5.html

<html>
  <link rel="publication" href="/moby-dick/#wpub">
</html>

is not necessarily true. This fact does reduce the complexity you refer to: a linkage to the manifest must happen at one place only, namely in the Primary Entry Page (PEP).

@iherman, I realize there's no requirement for "chapter5" to point back to the publication it's inside of. However, when one does, it would make more sense (architecturally) to point to the publication's address rather than it's "configuration file"--i.e. the manifest. Combining the two into a single unit would seem to alleviate our frequent confusions around the role of the PEP.

BigBlueHat commented 5 years ago

all you need to do is add JS code that will inject the manifest in your page.

This would effectively introduce the manifest into Web Publication a second time, though, wouldn't it? Making discovery a bit of a cat and mouse game.

In many cases, the "marketing" page and the content itself live on two completely different systems (with different domains or sub-domains as well) and it's much easier to link to a separate manifest rather than embed it.

Not sure I follow the use case here. Are you thinking the marketing page would link directly to the manifest vs. the publication address?

The separate of marketing from content serving apps wouldn't effect the embedding of the manifest--given that (as currently defined) one must access the publication address, do discovery, and then use the manifest.

iherman commented 5 years ago

This issue was discussed in a meeting.

iherman commented 5 years ago

@mattgarrish will you take this? Or should I?

iherman commented 5 years ago

@mattgarrish scrap my previous question; it is the PR #412.

mattgarrish commented 5 years ago

Okay, just getting back up to speed so always happy to see things done... :)