w3c / wpub

W3C Web Publications
https://w3c.github.io/wpub/
Other
78 stars 19 forks source link

We need a section of the document that explicitly defines the bounds of a publication #205

Closed lrosenthol closed 6 years ago

lrosenthol commented 6 years ago

We talk about the bounds of the publication but we never explicitly define what it means, where it comes from and what a UA is to do with it (and in what specific use cases).

wareid commented 6 years ago

Speaking as a reading system, we would expect the bounds of a document to be everything an author/publisher considers the document. All of the required content (text, images, video, etc.). This can include external links to other resources (i.e. a textbook referring to a website), but those links are not essential to the document and therefore would not be listed in a file list of any kind. Should the external links fail in any way (offline, out of date links, armageddon, etc.), the core document is not diminished. I should add I think publishers will really self-define the bounds of their documents (they're used to it), we should just specify that if they do not include something in the infoset, they should know that they are defining it as unnecessary to the core document.

iherman commented 6 years ago

I think the tension is to

  1. have enough information to the reading systems if they do offlining/caching/packaging and
  2. to reduce the amount of redundancy for the authors, ie, not to force the author to list references that they already list as part of the content itself (e.g., CSS files)

Note that, elsewhere in the current draft, we've already made some steps along (2), namely for the title of the document (that can be extracted from the <title> of the primary entry page) or the Table of Contents.

(2) may suggest to define some sort of an automatism based on, say, the content files themselves; we have discussed (in Toronto) that CSS, javascript, and image files might be good candidates for such automatism: ie, the User Agent, would automatically include these resources in, essentially, the 'resources' infoset item without the necessity to list those in the corresponding manifest item. However, one of the main problems with this is the indiscriminate nature of all this: ie, including, eg, CSS files or JS libraries that are not even under the author's control.

I can see three different approaches that we could follow, in an increasing level of complexity.

All CSS, Font, Javascript, JPG, PNG, GIF, and SVG(Z) additional resources, that are referred to from a resource listed in the Default Reading Order, are automatically considered to be part of the "list of resources" infoset item if:

  1. (simplest version) the URLs of the additional resource is a relative URL with the URL of the primary entry page as a 'base'

  2. (moderate complexity) the URLs of the additional resource is a relative URL relative to a URL listed in a (new) scope manifest item, itself may be a relative URL with the URL of the primary entry page as a 'base'. A missing scope falls back to version (1) above.

  3. (full complexity) the URL-s of the additional resource maps against a URL template (rfc650) defined in the manifest via a separate template manifest item. A missing template falls back to version (1) above. (Note #67, a very old issue about templates).

(Ie, (1) is always available, (2) or (3) means some additional possibilities for the authors.)

For example, in the "Single Document Example" there is no need to list the local CSS and JS files in the list of resources, but some standard reference (in that case) to "central" CSS file or logos should be listed explicitly for the purpose of, say, packaging.


Personally, I'm mildly in favour of (2) above, but not sure it is worth the trouble. As I mentioned in #67, (3) may be for a future release...)

bduga commented 6 years ago

@iherman For case 1, does that mean an author would have to provide base in every html file other than the entry page? What about references from non-html files, like css?

iherman commented 6 years ago

@bduga

does that mean an author would have to provide base in every html file other than the entry page?

Not sure, to be honest. That would be a solution, but it is obviously a drag for the author. That may be a good argument for the second approach, actually: by adding a list of URLs as scope(s) the author of the metadata can have a finer control and avoid the side-effects of having a separate base statement in each content file.

What about references from non-html files, like css?

You mean, for example, a css file imported by another css file? Apart from being a pain for the User Agent, I guess that can be covered, too: the imported css file has, eventually, its own URL that can be compared against the manifest for scope or the URL of the primary entry page... But you are right that this should have been mentioned in the proposal.

BigBlueHat commented 6 years ago

We talk about the bounds of the publication but we never explicitly define what it means, where it comes from and what a UA is to do with it (and in what specific use cases).

Still not sure anything above this comment answers @lrosenthol's questions.

"bounds of the publication"

We keep getting caught up on the "where does it come from?" rather than the more foundational "meaning" and "usage" questions. Let's zoom back out.

TzviyaSiegman commented 6 years ago

I think it would benefit us to look at prior art for the way offlining etc is done on the Web today. Even if we do not use tools like Service Workers, we should look at how SW approaches this. Service Workers defines a scope to focus what should be offlined, which is similar to defining the bounds of the WP.

https://w3c.github.io/ServiceWorker/#dfn-service-worker-registration

A service worker registration of an identical scope url when one already exists in the user agent causes the existing service worker registration to be replaced.

iherman commented 6 years ago

@BigBlueHat, also related to what @TzviyaSiegman just said: the question, for me, is very pragmatic. When I create, say, the manifest for the W3C document that is in the example, should I add to the list of resources all the CSS files that are used by the document or not? And this is obviously related to packaging and or offlining/cashing/whatever. (For EPUB 3, the answer is 'yes', I should do it.)

What is the UA supposed to do with it (and in what specific use cases)?

I do not really see what the issue with this is: the offlining/cashing/packaging is a pretty clear example of what the UA is supposed to do with it. Of course, if a particular UA does not do any of these, it can probably ignore the "bounds". Because the author would not have to do too much, that is not really an issue for her. But if the UA does those things, e.g., if the WP is packaged into an EPUB4, then we have to define very clearly what is within the bounds, which is the way I interpret the original question of @lrosenthol.

TzviyaSiegman commented 6 years ago

@iherman In the world of SW, no I do not add all CSS files, etc. I list the URL of the document to be cached. If this works for Service Workers, it should work for whatever offlining we are using.

iherman commented 6 years ago

@TzviyaSiegman I am not sure I understand... a script using service workers should know which CSS files to offline, doesn't it? I don't think that is done automatically by service workers...

iherman commented 6 years ago

@TzviyaSiegman look at the service worker script of https://hpbn.co. It lists all the assets, including svg images or css files, explicitly...

TzviyaSiegman commented 6 years ago

thanks @iherman - someone gave me incorrect information.

dauwhe commented 6 years ago

I am not sure I understand... a script using service workers should know which CSS files to offline, doesn't it? I don't think that is done automatically by service workers...

Much depends on how the service worker script was written. Most examples I've seen explicitly list the resources to be cached, including CSS files and fonts. It's also possible to write a service worker that would automatically cache resources associated with a particular HTML resource, as @BigBlueHat has done. Service workers do nothing automatically—they are just a low-level tool that developers can use to create a caching strategy.

iherman commented 6 years ago

@dauwhe

It's also possible to write a service worker that would automatically cache resources associated with a particular HTML resource

Indeed, that is what a User Agent would do. However, unless we precisely define what resources should be associated with a particular resource for the purpose of the UA, we may have problems with interoperability. Hence my proposal to define more precisely what should be part of the association (clever UA-s may decide to go beyond that, but that is a different issue).

css-meeting-bot commented 6 years ago

The Working Group just discussed Github issue 205.

The full IRC log of that discussion <wolfgang> Topic: Github issue 205
<tzviya> github topic https://github.com/w3c/wpub/issues/205
<wolfgang> tzviya: how to offline a publication
<ivan> github topic: https://github.com/w3c/wpub/issues/205
<wolfgang> tzviya: how to define the bounds of a publication?
<wolfgang> ... how would offlining work?
<Hadrien> q+
<timCole> q+
<tzviya> ack garth
<tzviya> ack Hadrien
<duga> q+
<dkaplan3> q+
<tzviya> ack timCole
<wolfgang> hadrien: packaging or caching? 10 different ways of caching? we will never be able to say how it works
<wolfgang> tim: one of the challenges raised - publishers should be able to caching a whole wp - what do you want to cache (ref to parts)
<tzviya> ack duga
<Hadrien> +1 to what Tim said
<tzviya> +1 to brady
<ivan> +1 to brady
<ivan> q+
<wolfgang> brady: taking a complete wp offline - not equal to caching
<tzviya> ack dkaplan
<wolfgang> dkaplan: caching or packaging is part of implementation while offlining is an affordance
<tzviya> ack ivan
<wolfgang> ... premature technical issue (caching or packaging)
<wolfgang> ivan: whatever we do, we need to consider the bounds if we want to offline/cache/package - at the moment it's very vague - we have readingOrder and resources, but what about images, etc.
<dkaplan3> q+
<wolfgang> ... how can author get a level of control what is inside the wp when offlined/cached, etc.
<tzviya> ack dkaplan
<George> 6q+
<tzviya> q+ George
<tzviya> q+
<tzviya> ack George
<wolfgang> dkaplan: my concern is that every time this topic came up, the difference between caching vs. packing came up - we are talking about the affordance of offlining
<wolfgang> george: wp should have a mechanism to create an epub 4 for this wp - to offline a pub when a student goes home at night, not the same as a product for sale
<tzviya> ack tzv
<ivan> regrets+ matt
<wolfgang> tzviya: we need to know the bounds of the publication - hard to make this assessment - focus on the technical issues
BigBlueHat commented 6 years ago

There's two things going on in most ServiceWorkers--catching fetches (dictated by origin and scope) and populating a cache/storage from which to (potentially) return values.

Populating the cache/storage is what we've primarily been discussing.

It can be done exhaustively by populating the cache from a predefined list.

Or it can be done progressively by catching the fetch's as they happen and populating the cache from there: https://github.com/dauwhe/html-first/blob/gh-pages/sw.js#L21-L23

In either scenario, the question is about cache/storage populating and how much the UA needs to know when in order to properly populate that storage for the right scenarios.

Consequently, we'll benefit most from defining explicit scenarios (i.e. "reader wants the whole publication" or "reader wants chapter 4" or "reader wants video 1 on page 3") and then building what's needed for each/all of those.

iherman commented 6 years ago

@BigBlueHat while what you describe about how user agents can do what they do in terms of caching or anything similar is perfectly fine. Describing the various scenarios is important (and is a partial answer to the original question of @lrosenthol) and should be done alongside the affordances' section.

However, at this moment we simply do not say what are the resources that we are talking about. The only thing I am interested here in is to define, in an interoperable way what are the resources that come into the picture in the first place.

More exactly: we do have the list of resources and the resources in the default reading order. The question that we MUST answer is: are these to be considered as an exhaustive list for caching (or whatever similar operations, I do not care of the details right now) or, for example, search (eg, search into SVG files)?

  1. Answering 'yes' is a consistent answer, and this is the equivalent of EPUB. It is a pain for authors but one might say that tools can generate those lists, so it is not such a big deal. On the other hand, it makes life of UA-s very easy.
  2. Answering 'no' (which is, essentially, the case today because this is left open) leads to the question of 'What else then?'.

If the author wants to be provide a WP that is prepared for various scenarios in an interoperable manner, then she must know the answer to these questions. This is not the case today.

All I did in https://github.com/w3c/wpub/issues/205#issuecomment-401766036 was to propose some possible answers to these questions: the WP would consist, in terms of offlining/cashing/packaging/whatever, but also in terms of search and other possible features, of the resources on the reading order, the extra resource list, plus whatever is in https://github.com/w3c/wpub/issues/205#issuecomment-401766036 (modulo some comments of @bduga in https://github.com/w3c/wpub/issues/205#issuecomment-401806313). It strikes me as providing a balance between the author's ease of producing a WP and providing an exhausting set of information.

(An oft quoted fact: what WP brings to the table, as a concept, is the fact of talking about a collection of resources as one conceptual unit. As a minimum we should be clear what this collection consists of...)

HadrienGardeur commented 6 years ago

More exactly: we do have the list of resources and the resources in the default reading order. The question that we MUST answer is: are these to be considered as an exhaustive list for caching (or whatever similar operations, I do not care of the details right now) or, for example, search (eg, search into SVG files)?

This shouldn't be affordance specific (caching), but yes I believe that these two lists taken together are the only real bounds for the publication.

We can discover additional resources through other means, but we can't know the intent of the author for them.

For caching specifically:

I think that this is the most that we can do. Anything more than a "network then cache" policy could interfere with the expected behaviour of the publication and we can't require all UAs to prerender all resources in the background (this is very CPU/RAM intensive and should be decided based on the device being used).

Packaging is a separate issue, but if we adopt ZIP for EPUB4 this will require quite a lot of processing on the UA's part in order to rewrite references to various resources in the reading order and list of resources. We might want to keep packaging on the side until Web Packaging is ready for primetime.

iherman commented 6 years ago

@HadrienGardeur just to be very specific and see if I understand your intention.

If I create a WP out of (say) our own W3C WP draft (something that a recently added javascript extension to respec already does), and my intention is that I should be able to read the draft on the plane (via some suitable WP extension in my browser or some other additional service) I am supposed to also list all the CSS and image files that are in the "common" subdirectory of our specification, then I am supposed to list all those CSS and image files in the "resources" array in the manifest?

I realize this is doable, but it is nevertheless a drag for the author. I also realize this is how EPUB3 has been defined. My intention is, however, to make life easier for those who author such a WP for a, I believe, fairly frequent usage case.

HadrienGardeur commented 6 years ago

@iherman not necessarily.

If the UA implements all the things that I've listed above for caching, it'll work offline entirely even if you don't include all CSS/JS/images/fonts in the manifest. Search and other affordances though might be limited strictly to what's in the manifest.

iherman commented 6 years ago

@HadrienGardeur

If the UA implements all the things that I've listed above for caching, it'll work offline entirely even if you don't include all CSS/JS/images/fonts in the manifest.

You mean "it will work" because it can be displayed without any styling, or it will work with all the styling because the UA gathers all the CSS and image files? I presume, if the latter, this is not something done by some magic (ie, by some system code) but because the UA has this all encoded. Am I missing something?

If the UA does the gathering, it must have some of its own, ad-hoc policies, which may become an issue in interoperability. How does it know which files, referred to from a top level content, should be cached and which one should not?

HadrienGardeur commented 6 years ago

@iherman it will work with styling as well because the UA:

For the policy, that's why I recommended using a simple "network then cache" policy for the Service Worker, to avoid as much as possible interfering with the HTTP headers in each resource's response.

iherman commented 6 years ago

@HadrienGardeur if this works indeed that smoothly, I am fine with this and we can drop (at least this part) of the issue, though some form of a (informal) description of this may be useful in the draft.

However, looking ahead to EPUB4 which is, in my view, simply a packaged version of WP, this means that the tool for the packaging itself will have to do/simulate those policies (or the author have to be much more explicit if packaging is also a goal). Again, obviously can be done, but it may make the tool more complex. (I am thinking in terms of something like ZIP, not in terms of Web Packaging, which may still be way down the line...)

HadrienGardeur commented 6 years ago

@iherman well it's not always simple...

We can't expect the UA to always prerender everything in the background by default for instance, this is too CPU/RAM expensive.

Based on the device and the browser, we'll also have various limitations for the size of our cache. It's very likely that large resources (audio/video) won't be cached for that reason. There's also a risk that a browser could purge its cache after a period of time.

On the packaging side of things, it would be a much better option for "long term" storage of publications that you'd like to read offline. That said, I don't think it's doable with ZIP, or at least it will be very complicated.

Packaging is IMO only viable once Web Packaging is available.

There's a good reason why I'm always dividing things into caching/packaging, they impact the user experience considerably, it's not just a technical issue.

iherman commented 6 years ago

@HadrienGardeur, o.k. But what should then be, in your view, the answer to the original question (in this respect) of the issue? And what should be a reasonable strategy for an author when creating a manifest with the list of resources? This is still blurry to me...

As for ZIP: I may be wrong, but I think that the community will vote for ZIP for EPUB4. I am not sure Web Packaging will be mature enough with enough tools around to base PWP on it (at least in the lifetime of this Working Group). Alas!, I would add.

HadrienGardeur commented 6 years ago

@iherman if you want to be sure that your resources will be cached, you must include them in the list of resources.

If not, there's always a risk that they won't be available offline.

It's also import to point out that large resources may not be cached anyway, no matter if you include them in the list of resources or not.

For packaging in a ZIP, we would either need to:

Both of these solutions will require a lot of work and I'm not sure they would be possible on all platforms.

I really think that we can't reasonably expect to be able to package a WP without Web Packaging being widely available in browsers.

GarthConboy commented 6 years ago

I have been leaning more toward Zip for EPUB4. I don't find the URL issues above to be insurmountable. But, that's a bridge we can burn somewhat later.

HadrienGardeur commented 6 years ago

@GarthConboy I'm leaning towards ZIP for EPUB4 as well, but I still think that this impacts our ability to create a PWP from a WP.

Web Packaging is really designed for this specific use case, since it truly extends how HTTP normally works in this context.

For the two options with ZIP that I listed above:

dauwhe commented 6 years ago

We talk about the bounds of the publication but we never explicitly define what it means, where it comes from and what a UA is to do with it

Emphasis mine.

What happens if a user clicks a link in a web publication that points outside the web publication? Presumably that means leaving the publication mode (see also #276). Is "publication mode" a top-level browsing context? Do we need to define things the way that WAM defines navigating beyond the scope of a web app?

If the URL of the resource being loaded in the navigation is not within scope of the navigation scope of the application context's manifest, then the user agent MUST behave as if the application context is not allowed to navigate. This provides the ability for the user agent to perform the navigation in a different browsing context, or in a different user agent entirely. If during the handle redirects step of HTML's navigate algorithm the redirect URL is not within scope of the navigation scope of the application context's manifest, abort HTML's navigation algorithm with a SecurityError.

mattgarrish commented 6 years ago

We started to consider this when we were looking at the reading order and resource list and had this prose for external resources:

If a user agent encounters a resource that it cannot locate in the resource list, it MUST treat the resource as external to the Web Publication (e.g., it might alert the user before loading, open the resource in a new window, or unload the current Web Publication and resume normal Web browsing).

But according to a note in the document, the text was pulled during the last f2f.

iherman commented 6 years ago

This issue will be discussed at the F2F; I thought I would share a specific example that may help our discussion. This example is not a traditional book, it is a scholarly article. (@atyposh and @TzviyaSiegman will appreciate...).

Look at https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.2006229. It is a bona-fide scientific article. Note that the article (what we would call a WP) is not the Web site you see; the site contains other items, including (at least on my screen) an advertisement for Leica and a CFP for another journal. In our terminology it is a reading system that displays the publication, which also has a number of other user interface goody (e.g., download the paper in PDF...).

The publication contains the HTML text, but also links to a number of other resources: if you scroll down there are figures with links to larger versions, to a PP slide and, actually, when clicking on a figure it shows a separate panel that allows zooming into an image (ie, I suspect those panels are separate resources with some Javascript). There are also data referred to from the paper, eg., at https://doi.org/10.1371/journal.pbio.2006229.s017 (this is an excel sheet). In my book, all of these are part of the publication.

Obviously, I want to read this paper and look at the data and zoomed images offline. (Note that the reader mode, at least in Mozilla, is very crude for anything beyond the pure text, it is not appropriate for a thorough read. Nor is the PDF version, for that matter.) But I do not necessarily want the Leica advertisement, the CFP, or even the download link to PDF. So the 'boundaries' must be clearly established. Some questions/comments:

There may be other good questions... I found the example interesting.

JayPanoz commented 6 years ago

@iherman At the core of your wonderings, that’s probably where web origin policies (same-origin, CORS, etc.) are very likely to come into play, at least in browsers – if that can answer some questions.

I’d say taking a look at reading modes is not necessarily the best idea there, because there have other goals e.g. stripping ads, JS, CSS etc. and put a lot of heuristics in place to achieve those goals.

We’re going back to the opaque origin issue. To put it simply, if it’s opaque it’s probably out of bounds. But once again that’s for browser vendors/user agents to confirm – you can’t really tell what they will do under the existing policies.

lrosenthol commented 6 years ago

I think we need to start by understanding who determines the boundaries of the WP. IMO, they are defined by the author of the WP! With that being the case, then what you as the reader want (no ads, etc.) doesn't matter and we don't need to make any decisions. We simply need to allow the author a way to define them.

It might be an extra feature of a UA to offer you a choice of which things to take offline (instead of all things) - but the starting point isn't the reader.

On Thu, Oct 18, 2018 at 2:21 AM Jiminy Panoz notifications@github.com wrote:

@iherman https://github.com/iherman At the core of your wonderings, that’s probably where web origin policies (same-origin, CORS, etc.) are very likely to come into play, at least in browsers – if that can answer some questions.

I’d say taking a look at reading modes is not necessarily the best idea there, because there have other goals e.g. stripping ads, JS, CSS etc. and put a lot of heuristics in place to achieve those goals.

We’re going back to the opaque origin issue. To put it simply, if it’s opaque it’s probably out of bounds. But once again that’s for browser vendors/user agents to confirm – you can’t really tell what they will do under the existing policies.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/w3c/wpub/issues/205#issuecomment-430715002, or mute the thread https://github.com/notifications/unsubscribe-auth/AE1vNVqsdhzuPv2pe1t0-28iQ60jra3kks5ul2clgaJpZM4UTsm- .

iherman commented 6 years ago

This issue was discussed in a meeting.