Closed BigBlueHat closed 6 years ago
conceptually, this means that the bounds of the resources are limited to the reading order (there's no way to indicate that a resource is part of the publication but not in the reading order anymore)
The reading order (however it's created/expressed) points to the primary resources, which in turn state what resources they depend upon. so, the way you would "indicate that a resource is part of the publication" is by referencing it from a primary resource (i.e. <img src="">
, <video>...
, <link rel="stylesheet">
).
The default consumption of the publication within a browser "just works" (as this is how the browser do things now--there's no "resource list" for a web site you visit, just pages referencing their dependencies).
The caching/offlining scenario can currently be handled by setting up a ServiceWorker and requesting the primary resources (currently via <iframe>
s) to populate a publication cache--as can be seen in this copy of Moby-Dick. However, long term, the process would likely be handled similarly to what's described in the Resource Hints spec.
In either case, anything not in the reading order and not referenced from within a primary resource would not be considered part of the publication.
this could trigger the download of very large resources (HD videos)
One instance of handling "very large resources" on the Web is the <source>
element. It describes dimensions, encodings, etc, and leaves it to the browser to determine (from it's usage context) what the best resource for the scenario would be. Using that (and imaginable things like it), publishers would have the option to create multi-modal publications leaving the determination of "very large" to the browser to determine from within the usage context--vs. the publisher publishing multiple publications based on various "sizes."
Additionally, there is an in-progress spec for adding "HTTP Client Hints". These headers would again allow browsers to make determinations about resource loading based on a combination of the headers and the usage scenario.
Point being, determining "very large" is best kept up to the device (and software on it) based on the contextual knowledge of device space, screen size, etc.
or resources that are useless for rendering (analytics scripts) that would otherwise have been excluded by the author when caching and/or packaging a publication
Caching processes can be limited (currently) via ServiceWorker scope
such that requests outside of that scope (in my understanding) is not handled by Handle Fetch routine. Presumably, something like that would be in place to limit what gets cached.
Alternatively, more descriptive approaches such as Cache-Control
headers, Content Security Policy, or <iframe sandbox>
(depending on the scenario), etc. could be used to further prevent offlining stuff the publication doesn't need (or can't use in an offline context).
Lastly, I envision the packaging process to be somewhat similar to caching. In the case of Google's Web Packaging format, the whole thing is stored as a set of HTTP exchanges--so presumably you'd have to make all those requests to gather the request/response pairs to store in the HTTP exchange bundle. Internally, the format is very similar to the HTTP Archive format--which can be output from a browsers dev console.
Additionally, other webby formats like MHTML (current supported in Chrome) and the TAG's upgrade of MHTML also use HTTP headers to express Content-Type
, Content-Location
(the URL of the contained item on the Web), etc, such that building one of those would look similar--i.e. recording the response from the Web into the package.
However, until Packaged Web Publications is farther along there's no immediate requirement (that I know of) to address packaging concerns explicitly in the Web Publication design and architecture.
to cache or package a publication, you need to render every single resource from the reading order, which is going to be very slow and CPU+memory intensive (we've experimented with background rendering in Readium-2 and limit things to only 3 resources)
There's no requirement to "render" anything--just to make the related HTTP requests and cache (or potentially package) the responses.
Additionally, since the dependencies are expressed from each primary resource (i.e. their relationship is known), then the UA could potentially offer the user the option to cache only part of the publication.
Whereas, if the list of resources is exhaustive and contains both primary resources and dependencies in a single list (with no stated relationship between them), then the UA could not offer that option because there'd be no way to determine that relationship from the exhaustive list.
not all UAs may be able to intercept network requests to cache or package them, this would exclude native mobile apps for example from supporting WPs properly
In as much as I'm expecting any supporting Web Publication UAs to be Web-connected (at least at first request of the publication address), then all such UAs would have the ability to: a) retrieve the resources as needed (per HTML on the Web) and b) store them in a cache (or package) to display should the network become unavailable.
UAs won't be able to do intelligent preloading on their own (for instance by loading fonts in cache in advance)
The entry page could prefetch
or preload
dependencies if that optimization is desired.
This is another case where the intention of the author/publisher would be expressed more clearly than an exhaustive list of primary resources and dependencies. In the exhaustive list case there's no expression which items are of greater importance (or larger size, etc). However, in this more contextual case the desire to prefetch/preload are expressible and expressible throughout the publication (i.e. prefetching a font from chapter4.html that's needed for the rest of the publication).
The reading order (however it's created/expressed) points to the primary resources, which in turn state what resources they depend upon. so, the way you would "indicate that a resource is part of the publication" is by referencing it from a primary resource (i.e. ,
This is only true for some of the resources, others may be behind an <a>
element (non-linear resources in EPUB) or could be dynamically fetched using JS. Simply pre-rendering and catching network requests to cache them using a Service Worker won't work in such cases.
One instance of handling "very large resources" on the Web is the
This opens the door to a different set of problems related mostly to packaging. There are widespread differences for example in audio/video formats supported by various browser on the Web. By packaging a publication the way you describe using browser A, I could end up with a package that won't work properly on browser B.
(I know that you're mostly ignoring the packaging use case for now, but this impacts our technical decisions as well).
In such situations (HD videos available in multiple formats to support various browsers), an author might prefer to completely avoid caching/packaging such resources. The problem with your approach is that there's no way for the author to indicate that a resource shouldn't be fetched and cached/packaged aside from HTTP headers (I would also need to double check if such headers are handled properly if the resource is cached by a SW, plus HTTP headers only handle caching and not packaging anyway).
Alternatively, the author might prefer to have all versions of a resource packaged, to provide an optimal experience on every device. That use case is also not supported by your scenario.
Caching processes can be limited (currently) via ServiceWorker scope such that requests outside of that scope (in my understanding) is not handled by Handle Fetch routine. Presumably, something like that would be in place to limit what gets cached.
Are you suggesting that to provide caching properly, authors of WPs will need to write their own Service Worker?
There's no requirement to "render" anything--just to make the related HTTP requests and cache (or potentially package) the responses.
But to achieve such HTTP requests and cache, you rely on prerendering in the background and on a Service Worker that caches resources upon a successful fetch request.
If you can't do background prerendering (for obvious performance reasons) or Service Workers are unavailable (current webview on Android), then the alternative option is to parse every single HTML document to extract a list of resources. As I've pointed out before, this is both:
Additionally, since the dependencies are expressed from each primary resource (i.e. their relationship is known), then the UA could potentially offer the user the option to cache only part of the publication.
I don't think that this is currently listed under our use cases.
Additionally, there is an in-progress spec for adding "HTTP Client Hints". These headers would again allow browsers to make determinations about resource loading based on a combination of the headers and the usage scenario.
whilst this is correct, the practical reality is that 99% of authors/publishers/etc. will have no way of influencing the HTTP headers returned by a server to the clients. I do not think we should base our specs on such headers to avoid pulling in possibly large files.
A resource list is not only relevant in terms of managing caches, packaging, etc. There are affordances that rely on these, too: e.g., a client side search should not be required to search through all references in a content, but only through those that constitute the Web Publication.
Beyond all the efficiency, etc, issues: In more generic terms, a WP is actually defined by the list of resources that it contains: that is what makes it a WP in the first place, as opposed to an average Web page. It determines the scope of various metadata, for example. "Just" relying on extracting all the links in the entry page and declare them to be the list of resources is going against the very definition of a WP.
This is only true for some of the resources, others may be behind an element (non-linear resources in EPUB) or could be dynamically fetched using JS. Simply pre-rendering and catching network requests to cache them using a Service Worker won't work in such cases.
This scenario is why I'm an advocate for keeping HTML Imports for descriptive (non-JS-based prescriptive) references. Additionally, things like the prefetch
Resource Hints in <link>
headers could also be used to descriptively reference these resources.
Essentially, the underlying premise is to reference resources as close to their actual use as possible, and in such a way that requests for those resource can be optimized from an understanding of their use. As you noted, that is currently unclear if one simply uses an unrefined <a>
or hidden fetch code in a JS script--i.e. the relationship between the document (or script) is either unclear or unknowable (without processing a script).
By packaging a publication the way you describe using browser A, I could end up with a package that won't work properly on browser B.
(I know that you're mostly ignoring the packaging use case for now, but this impacts our technical decisions as well).
Much of this will depend on how and what does the packaging. If using one were attempting to package a Web page now which referenced multiple videos (or images or audio) of varying size and quality, it would be up to the packaging software (and the package format it intends to output) which of those resources were added to the package based on similar heuristics used by browsers at request time.
The expression of the available options is provided in context and with a refined expression which includes sizes, etc. which can be used by a packaging tool (or browser) to make those determinations given its intended output.
A singular "resource list" will either lack that information (as wpub does now) or the addition of such information will ultimately look like aggregating all these in-context HTML expressions into that list--which then ultimately will look rather similar to lumping all the HTML references (with the sizes
, srcset
, and media
style information) into a single file.
Are you suggesting that to provide caching properly, authors of WPs will need to write their own Service Worker?
Not long term, no. It's what's currently available and (consequently) can be used to inform our thinking when we look toward getting this work into the browsers. The important part of that comment was that "requests outside of that scope (in my understanding) is not handled by Handle Fetch routine. Presumably, something like that would be in place to limit what gets cached."
Point being that we may also need (regardless of how we express the structural properties) something similar to the scope
property as seen in ServiceWorkers (for Fetch handling) and in Web App Manifest (for navigation limiting). In both those cases, that's how the "edge" of a Progressive Web App is being defined.
the alternative option is to parse every single HTML document to extract a list of resources.
True. In order to "gather" the entire publication, each primary resource's references must be considered/processed/parsed in order to retrieve the entire publication (into cache, etc).
The exhaustive "resource list" approach doesn't guarantee that one can avoid that, however. The exhaustive list also raises concerns of things potentially being (or becoming) out of sync during the publications history (i.e downloading resources one doesn't need or not downloading resources one does need).
Ultimately to make the exhaustive list, something somewhere is going to have parse every single primary resource, gather it's dependencies, and put them in that list. That "something" might be a human at a keyboard, or it might be some Python script using BeautifulSoup, or it might be a browser-based editor of some kind. Regardless, in the case of the exhaustive list, any reference made from any primary resource which is needed for the experience of that primary resource MUST be recorded in that exhaustive list.
If something isn't recorded in the exhaustive list, then...what happens? Invalid cache? Fallback to the network (i.e. use the Web)?
I don't think that this is currently listed under our use cases.
I'd been seeing this as a potential application of Random Access to Content--especially for massive works like textbooks where only a few chapters/sections may be needed offline at a time.
In the exhaustive list case, the entire publication would have to be available/considered/cached before Random Access could be provided.
In the case of a textbook from which a student only needs/wants to offline "Chapter 4" (and it's dependencies), it's not clear how the UAs would provide that that Random Access request without either: a) the overhead of caching the whole publication b) or without ignoring the exhaustive list and taking the "gathering" approach.
Ultimately the primary resources themselves become authorities over their own content and dependencies.
whilst this is correct, the practical reality is that 99% of authors/publishers/etc. will have no way of influencing the HTTP headers returned by a server to the clients. I do not think we should base our specs on such headers to avoid pulling in possibly large files.
Let's not guess at statistics. :wink: I also wasn't suggesting we "base our specs on such headers" just that they exist (or may/will) and subsequently will be considered by the browser when making any of these requests (see Content Security Policy for one such example).
Consequently, we shouldn't ignore their existence--as any deployment of these things will ultimately have to work within them, and conversely should make use of them wherever it makes sense for our use cases.
Any spec we bring to a browser vendor will be considered in light of the existing (and potential) Web Platform specifications. HTTP Headers are (increasingly) a big part of that puzzle.
A resource list is not only relevant in terms of managing caches, packaging, etc. There are affordances that rely on these, too: e.g., a client side search should not be required to search through all references in a content, but only through those that constitute the Web Publication.
This is what the "binding" is about. It provides a linear progression of resources which are "bound" into a (new) thing called a Web Publication. Those resources can then be (in some lovely future): searched, offlined, etc.
A resource list will be built in either of these cases. It's just the "how" and the "when" that's in question--afaict.
"Just" relying on extracting all the links in the entry page and declare them to be the list of resources is going against the very definition of a WP.
I've never suggested we "[extract] all the links in the entry page," but rather narrowed them (as our spec currently does) to a defined area of expression (currently <nav role="doc-toc">
in our fallback reading order system). Given that's currently in our spec as a way to create the reading order, I don't think we can say it's "going against the very definition of a WP."
However, we can certainly say that the approach of gathering the resources vs. listing them exhaustively are different approaches to the same problem--both of which have their consequences (good and ill).
@BigBlueHat I just try to understand. Is your proposal that, instead of listing the resources in the JSON part, you would use something like
<nav role="doc-something">
<li><a href="resource1"></a></li>
<li><a href="resource2"></a></li>
</nav>
and that (and only that!) defines the list of resources?
And what if resource2 needs a script or references an image? Each of which would need to be in said list of required resources... we can't require a crawl to generate a resource list. We wouldn't know where to stop. I think a non-explicit or non-exhaustive resource list should be considered a non-starter.
@iherman as long as everyone's clear about the "something like" part--meaning, there's room for (and should be more!) exploration here. But in short, yes.
The proposal is that HTML is the best place to put resource relationships because the mechanics of doing that are already defined (and we can build up from those), that they'll be used regardless (see the failure scenarios mentioned above), and moving these structural properties into JSON brings a less webby model into play upon the Web.
And what if resource2 needs a script or references an image?
Then (as happens now) a request for resource2 (either directly or via some process) retrieves those dependencies.
Each of which would need to be in said list of required resources...
In the exhaustive list case, they would have to be listed in both the exhaustive list and referenced from the primary resource which depends on them.
we can't require a crawl to generate a resource list.
There's no "crawling" (at least not in the open ended since used here). There are two things:
The "gathering" of All The Things (if/when needed for a use case), would happen by requesting the primary resource(s) needed and then requesting their resources--which is how loading a Web page works now. If one were to collect a specific list of Web pages together (no crawling! just that list) right now, this is exactly the process one would use.
We wouldn't know where to stop.
We do know where to stop. We only get the primary resources and their referenced dependencies. We don't "crawl" random, inline <a>
tags, etc.
For instance, if the Moby-Dick reading order reference an "About the Author" page which in turn linked (inline) to the Herman Melville Wikipedia page, there would be no expectation that the "gathering" process should collect that Wikipedia page--since that Wikipedia page was not included in the reading order.
I think a non-explicit or non-exhaustive resource list should be considered a non-starter.
The list of primary resource is explicit and exhaustive--in that it defines the primary boundary of the publication in a linear fashion. The dependencies are the thing in question.
Defining an exhaustive list which contains both primary resources and dependencies (or even just a list of dependencies) seems likely to:
Those would be (some at least) of my "cons" for the exhaustive list approach (echoing the "cons" listed earlier of the "gathering" approach).
If there are specific scenarios of use not yet expressed here, it'd be great to hear more about what's informing the "non-starter" thoughts. As yet, all of these thoughts/comments assume consumption of a Web Publication via a Web browser (in the broadest since of the term).
I just wanted to add a clarifying point of this discussion about why, to my perspective, some of the contributors in this and other tickets seem to be talking past each other. It might help clarify the points of contention.
(Obligatory disclaimer: I take no stance on the following dichotomy.)
As I said, I take no stance on either of the sides in this dichotomy, except to say that, as Ivan pointed about above, we actually have documented affordances and use cases. If we think of the two sides in their most extreme representation (which nobody actually is advocating for) as "a WP is just an exploded EPUB viewable in a browser" vs. "a WP is just a collection of HTML pages with some metadata and chapter navigation", both of those would fail to serve our use cases and affordances completely. (As I think everyone would agree!)
I don't think there's a fundamental conflict between these two points of view. But I do think it's worth clarifying them, because they are the source of so many of these contentious issues.
Just as a little experiment, I made an EPUB that did not list secondary resources—CSS, images, etc— in the manifest. This is obviously illegal, and the EPUB rightly failed validation. But it worked perfectly in the three reading systems that I tried. Of course this doesn't prove anything. But the entire web operates without such lists. Web applications don't need such a list. Having such a list involves costs, and we shouldn't immediately assume that the costs are negligible, or that the benefits are so obvious they don't need to be stated.
As much as I love @dauwhe ... such an EPUB would not work on, say, Google Play Books. And, I view this resource list as required for packaging, or for that matter off-lining, so the fact that it works on the Web or some EPUB RS, doesn't sell me. :-)
On the bright side, you inspired me to install Google Play Books! And yes, it won't open my little experiment. Does it run epubcheck on upload?
Now I have to go write a demo of packaging without an explicit resource list :)
@dauwhe yes we do run epubcheck on upload. But that just an early detection... we wouldn't serve un-manifested resources.
"Now I have to go write a demo of packaging without an explicit resource list" – be prepared to let it run for awhile, as you might end up with the whole Web! :-)
I think it's also worth pointing out again that the "list of resources" is an optional infoset item.
It's up to the author to decide what should or shouldn't be listed in there. Unlike EPUB, there won't be validation involved for that list (EPUB requires authors to list every resource from the package in <manifest>
).
By including an item in that list, you're simply making sure that the UA will eventually cache/package it along with the resources listed in the reading order.
If you don't want to be bothered by providing such a separate list, that's fine. But you can't expect the UA to be able to guess everything for you in that case, because there are many reasons that this could fail.
"Now I have to go write a demo of packaging without an explicit resource list" – be prepared to let it run for awhile, as you might end up with the whole Web! :-)
That's not how it's ever worked, @GarthConboy, so I'd appreciate it not being restated as if it were a possibility. It's only causing fears and concerns about things that don't happen. Thanks.
The technical response posted earlier hopefully makes it clear that the approach which I've proposed is not in danger of crawling the whole Web: https://github.com/w3c/wpub/issues/198#issuecomment-389571549
By including an item in that list, you're simply making sure that the UA will eventually cache/package it along with the resources listed in the reading order.
If you don't want to be bothered by providing such a separate list, that's fine. But you can't expect the UA to be able to guess everything for you in that case, because there are many reasons that this could fail.
Wait...so it's not exhaustive? The resource list is currently defined as "all resources":
The resource list enumerates all resources that are used in the processing and rendering of a Web Publication (i.e., that are within its bounds).
If it is instead merely additive (i.e. explicitly stating that the UA not forget to get that resource), then I've far less of an issue with it--however there's still concerns (for me) around moving structural/resource-loading semantics out of the HTML (which is by definition for such constructions) and into a new place.
@HadrienGardeur could you clarify whether you believe an exhaustive list is a requirement, or that there's simply a need for something to reference dependencies which should "not be forgotten" (or perhaps a list of things that "must not be gotten" 😉)?
Thanks.
It's always exhaustive in the sense that this list + the reading order taken together are responsible for establishing the boundaries of the publication.
But you certainly don't have to list all the resources in there (for instance, you would definitely avoid including your Google Analytics script or similar resources that won't work offline+packaged), it's up to the author to decide which resources are part of the publication or not.
so I'd appreciate it not being restated as if it were a possibility
But it is a possibility. The reading order is much less stringent in its requirements right now than the resource list, so you can't rule out that hyperlinked resources are part of the publication if the resource list becomes optional. Of course, any sensible user agent wouldn't actually crawl them. I think the point is more that this obscures the bounds of the publication.
I recall we already went round and round this problem back when we were having fun with primary and secondary resources and their relation to the reading order and resource list. Is it that hard to find a compromise between the two extremes: you must list all primary/top-level resources in the resource list, but should list all dependencies if you don't want to risk a user agent failing to properly cache (or whatever) your publication. You should also list any resources that might not be easily determined by inspection (script-necessary files, etc.). And maybe flag complete/incomplete lists so user agents know which need processing. Sort of a return to: https://github.com/w3c/wpub/issues/22#issuecomment-321081579
The answer is probably different for EPUB, where listing all resources becomes a requirement and probably not an unreasonable ask.
@mattgarrish what you expressed in https://github.com/w3c/wpub/issues/22#issuecomment-321081579 is spot on, and describes precisely what I seem to have been failing to express. Thank you, Matt! 😄
What Matt said in https://github.com/w3c/wpub/issues/22#issuecomment-321081579 also matches what I'm seeing in @iherman's examples where he's only referenced resources which were not already listed in the spine/ToC:
Additionally, I understand now (based on the comment above and a quick read on non-linear
content in EPUB), that the "crawling" concerns were probably related to the following text from the EPUB 3.2 spec (and similar text in it's predecessors):
Authors MUST provide a means of accessing all non-linear content (e.g., hyperlinks in the content or from the EPUB Navigation Document).
With that as the backstory, I now understand @GarthConboy's concerns. 😃
Indeed... I was trying not to type the dreaded linear="no"
, but that's what I had in mind. :-) Also content required/referenced by included scripts, which is largely unknowable.
We may, I think, conflate three problems (at least they are mixed in my mind):
These affect the "offline" aspect of the WP (whether offline temporarily/locally via a cache or a package) but also affordances.
The relevant section in the current draft seems to give an answer to (1). Is there any reason to re-open that draft, or should be taken as granted for now? Has there been any new evidences that would warrant to reopen (1)? I do not think so.
Ie, we should concentrate on (2) and (3). Note that the draft also says:
If a user agent encounters a resource that it cannot locate in the resource list, it MUST treat the resource as external to the Web Publication.
which means that there may be references in the primary document that are not part of the final resource list (as it should be, imho), which makes (2) above fairly unclear.
Indeed. I think it's pretty clear that there is a required list of Primary Resources (default reading order). It seems to me that we need to fully decide whether a WP MUST be offline-able/package-able, or if it MAY be (as in up to the author). The section of the current draft than @iherman pointed to above implies this is a MAY ("it is strongly RECOMMENDED to provide a comprehensive list of all of the Web Publication's constituent resources"), which seems to imply that supplying the list of secondary resources may be optional. I tend to lean more to a MUST. But, if we affirm this, one way or anther, we can decide Secondary Resource list question.
The Working Group just discussed https://github.com/w3c/wpub/issues/198
.
Comments from WG call:
-- Provision of such an exhaustive resource list (resources beyond those in the required default reading order) is, indeed, RECOMMENDED (per current spec). And this likely should be considered settled.
-- Some agreement around that if the exhaustive list of resources is provided, it should somehow not duplicate the default reading order -- it should be those additional resources, to avoid duplication.
-- It was pointed out that publications can be created such that no such exhaustive resource list is required for the WP to be offline-ed or packaged (all resources bundled in with those in the default reading order). But, such an exhaustive list may be required for full/correct offline-ing or packaging.
Garth's proposal for this issue remains:
"Is an exhaustive "resource list" required to create a Web Publication? No. Such an an exhaustive list may be needed to make the WP fully offline-able or package-able. But, an exhaustive list of resources required (beyond the primary reading order) is not required, as it is up to the author whether a WP is fully offline-able or package-able." (from the RECOMMENDED above)."
@deborahgu said that we need to talk more about how such a list would be used. I think this would be very helpful!
I'm also wondering if there's some confusion about what such a list makes possible. HTML has APIs to get a list of images (document.images
) or stylesheets (document.styleSheets
) associated with any HTML document. So if we have the default reading order, we can get all the images and stylesheets associated with those documents. Several of us have built example WPs that are packageable and/or cacheable without an exhaustive list.
EPUB had an exhaustive list of all those things, partly to help with validation, and partly so that a reading system could learn something about the resources without parsing the HTML documents. If we're making arguments about that, we should be explicit about them. And we should remember that in the early days of ebooks, many reading systems were stunningly underpowered.
We've also talked about how the exhaustive list defines the content of an WP. I'd like to see more detail about that. If I'm reading a WP, and encounter an image that isn't listed in the exhaustive list, what happens? If I then cache the WP, what happens? If I package it, what happens? If I click a link in an HTML document in the default reading order, which leads to a HTML document on the same origin which is not part of the default reading order, what happens?
I would assume the primary purpose of the resource list is so that the reading system can establish when a user is within the scope of the publication. You don't need the secondary resources for that, but you also can't rely on the reading order in all cases we've discussed.
If you don't have a list of all the primary resources somehow between those lists, the publication would get "exited" whenever you navigate to an unlisted primary resource. The secondary resources are inconsequential to this process. It doesn't matter whether what is loading within the page is listed, as we're not going to change HTML rendering/security/etc.
As far as offlining and packaging go, having all the secondary resources would speed the process up, and be more accurate in some cases, but why put this requirement exclusively on authors when user agents are capable of performing the step?
I suspect that web publication authoring tools are going to give a complete list of resources and make the issue moot in a large number of cases, but I also think the simpler it is to create a web publication, no matter your choice of tools, the better.
I sort of question whether a user agent should ever rely on authors to get the list of needed resources correct, or should always be inspecting the primary resources to determine what is needed and might be missing. If we want reliability, depending on authors is not the best idea.
In response to @dauwhe 's:
I'm also wondering if there's some confusion about what such a list makes possible. HTML has APIs to get a list of images (document.images) or stylesheets (document.styleSheets) associated with any HTML document. So if we have the default reading order, we can get all the images and stylesheets associated with those documents. Several of us have built example WPs that are packageable and/or cacheable without an exhaustive list.
and @mattgarrish 's:
As far as offlining and packaging go, having all the secondary resources would speed the process up, and be more accurate in some cases, but why put this requirement exclusively on authors when user agents are capable of performing the step?
I think it's clear that UA/RS' can, for simple publications, suss out list needed to cache/offline/package some WP's with a light crawl of the resources from the default reading order list – finding the CSS, images, and scripts referenced and including (only) those.
However, there are clearly cases where that can't be done – e.g., "required" content that is not in the default reading order (linear="no"
content, for example foot or end notes content) or resources required by included scripts. So, I get back to such an exhaustive list (beyond the default reading order) SHOULD or COULD be provided... per my suggested resolution at the end of my previous comment.
@mattgarrish I agree that a list from the author of "supporting" resources such as fonts, CSS, and images is perhaps not terribly useful. Or at least not required for many use cases.
The interesting question is HTML that's not in the default reading order, but is linked to from the publication. If it's not part of the publication, then it's just like any other web link. But what if it's part of the publication? EPUB hasn't really solved this problem—just say linear=no
during a call and listen to the groans.
What does it mean to be part of the publication but outside of the default reading order? This would certainly affect some affordances. "previous" and "next" controls would presumably be disabled. How would one return to the linear reading experience? Display such content as a modal? Provide a link back to where you opened the resource?
But what if it's part of the publication?
That's why I don't think we can rely on the reading order being a useful listing of primary resources. It's also why the current prose only requires one document in the reading order. Not because some publications will only be one resource, but because people have expressed a desire to create publications that don't offer a linear progression by default but rely on other means, like following links.
What does it mean to be part of the publication but outside of the default reading order?
It has to mean nothing more special than that there is no automatic path forward, but it's one of a variety of things about bringing an epub reading experience to the web that is thoroughly confusing to me, too. If you go back to a linear document, and it assumes a different next document, what does that do to the browsing history? Assuming new tabs get spawned would make a mess of the link-based model.
I'm going to repeat what I said during our last call:
Among some of these affordances and expectations:
As we can see, caching and packaging are only two affordances among many others.
It has to mean nothing more special than that there is no automatic path forward, but it's one of a variety of things about bringing an epub reading experience to the web that is thoroughly confusing to me, too. If you go back to a linear document, and it assumes a different next document, what does that do to the browsing history? Assuming new tabs get spawned would make a mess of the link-based model.
I think we can ask such questions to the Edge team (cc @BCWalters ) since they've already addressed those issues for their reading mode (which covers resources from the Web, EPUB and PDF).
As seen in their presentation in Berlin, the reading mode has its own affordance for moving forward/backward in the reading order but they also keep the URL bar and the back button in there as well.
@dauwhe
I sort of question whether a user agent should ever rely on authors to get the list of needed resources correct, or should always be inspecting the primary resources to determine what is needed and might be missing. If we want reliability, depending on authors is not the best idea.
I must admit I continue to be wary of the approach whereby the UA would crawl through all primary resources to gather the list of all resources. The existence of document.images
or document.styleSheets
obviously helps, but that does not solve everything. E.g., this would require parse each CSS files to see if there are font references (I would suspect that document.styleSheets
includes the stylesheets imported via a @import
in CSS), parse the HTML resources looking for videos, JS files, CSV files or other datasets (used by some interactive content in the document), etc. How would the UA know what to include and what not to include?
We may of course decide to include some sort of an "exclusion" list rather than an "inclusion" lists. Ie, instead of listing what the list of resources includes, we may require to list what is excluded when crawling the resources. However, we always have to see what the consequences of bad authoring would mean, and forgetting to explicitly exclude something may lead the addition of a full DNA dataset of several GB-s in the WP… Ie, I am not sure that should be a good idea either.
Alternatively, we may make a strong use of document.images
or document.styleSheets
and we would not require the author to include JS files and images in a resource list. That may simplify the authoring of many simple WP-s.
Another thing that worries me is the time it takes. While I realize that UA-s operate in much better environments than in the early EPUB2 days, fetching and parsing a whole series of HTML content just to gather what would be part of the WP is still a significant effort; parsing an HTML doesn't only mean parsing the syntax (which is significant already) but also building 2-3 different "trees" (DOM, CSS, Accessibility…).
I must admit I continue to be wary of the approach whereby the UA would crawl through all primary resources to gather the list of all resources.
Since that quote was from me, I'll respond by saying it's not an approach I would take. But I'm not averse to there being a process whereby the user agent does the work, with a clear caveat emptor for anyone who takes advantage of it.
I also don't think a complete list of supporting resources is necessary if, as Hadrien says, you don't want those affordances. But we haven't addressed how an author specifies what a user agent can do with a publication. We'll need explicit metadata at some point.
More what I was wondering, though, is whether user agents are going to crawl the resources regardless of any stated completeness to discover whether there are any missing resources. If they have to do this for some, will they do it for all? We don't forbid this anywhere, but we also implicitly accept that if resources aren't listed they won't be put in caches or packages. Should we have both a global no-offlining and per-resource no-offlining instructions so that we don't forgo completeness of the resource list to achieve an unrelated need, and so user agents don't try to put these resources back in the list?
I would assume the primary purpose of the resource list is so that the reading system can establish when a user is within the scope of the publication.
Can you expand on what you mean by "within the scope of a publication?" In other specs, "navigation scope" is limited by a path prefix such that any navigation request made that does not contain that prefix causes the navigation to happen elsewhere (i.e. outside).
If you don't have a list of all the primary resources somehow between those lists, the publication would get "exited" whenever you navigate to an unlisted primary resource.
It sounds like you're wanting/expecting a similar sort of experience scenario where navigation is somehow limited within some view/container, and any navigation that happens must match a list of URLs or it opens elsewhere. Is that correct?
It sounds like you're wanting/expecting a similar sort of experience scenario where navigation is somehow limited within some view/container, and any navigation that happens must match a list of URLs or it opens elsewhere. Is that correct?
In EPUB, a number of reading apps handle things similarly.
Here's how iBooks handles resources:
I'm not suggesting that this is how we should handle things as well, but it's worth knowing about the current behaviour for EPUB. I haven't tested Edge in such details yet, but I imagine that they also have their own take on this.
More what I was wondering, though, is whether user agents are going to crawl the resources regardless of any stated completeness to discover whether there are any missing resources. If they have to do this for some, will they do it for all? We don't forbid this anywhere, but we also implicitly accept that if resources aren't listed they won't be put in caches or packages. Should we have both a global no-offlining and per-resource no-offlining instructions so that we don't forgo completeness of the resource list to achieve an unrelated need, and so user agents don't try to put these resources back in the list?
I'm not a fan of having "no-caching" or "no-packaging" directives in the manifest. They can be just as easily ignored and this breaks some of the promises of WP.
You're right that user agents may attempt to crawl resources and various reading modes (plus dedicated services like Pocket) probably have their own heuristics for that.
Crawling is not necessarily a bad thing though, it mostly depends what you do with those resources. If you simply preload them and then respect their caching directives, that's mostly harmless (although the process of crawling could be quite CPU/memory intensive). If you start caching them heavily and ignore HTTP headers, that's a completely different story.
In general, I think that UA should only deploy a proxy with a "network then cache" policy on resources listed in the default reading order or the list of resources. Everything else should at best trigger a GET request in the background and nothing more.
We haven't discussed yet how UAs should handle caching, mostly because we keep getting lost discussing "offlining".
Looking back at the thread, I am a little bit worried that we are complicating things too much. The simple model, whereby it is recommended that the author explicitly lists the resources (that are not in the reading order anyway) as part of the manifest seems to be simple and I do not see any major downside to it. Whether the UA does offlining/caching/whatever with those is not for the author to care about; it is up to the UA. The affordances, whenever appropriate, have a clear scope with those resources. And that is it...
The only easy expansion of this may be what I said earlier:
Alternatively, we may make a strong use of
document.images
ordocument.styleSheets
and we would not require the author to include JS files and images in a resource list.
Meaning that these are supposed to be used (ie, considered to be part of the resources) by the UA automatically. (Provided that CPU/memory requirement of crawling is acceptable.)
-1 to the idea of mentioning document.images
or document.styleSheets
at all
While we can't stop UAs from gathering resources in the background, we shouldn't push this forward as an acceptable alternative.
I think we need to be wary of specifying RS implementation details.
I'll posit an (only slightly tuned) proposal for closing this issue:
An exhaustive "resource list" is not required to create a Web Publication. Such an an exhaustive list may be needed to make the WP fully offline-able or package-able or to enable provision of other affordances. Providing such a list of required resources beyond the default reading order is RECOMMENDED. However, it is ultimately up to the author whether a WP is fully offline-able or package-able or provides a RS/UA sufficient data to enable all desired affordances.
[Note, this is not really my preferred solution, but I think I have been convinced that changing the RECOMMENDED to REQUIRED is likely just not practical.]
It sounds like you're wanting/expecting a similar sort of experience scenario where navigation is somehow limited within some view/container, and any navigation that happens must match a list of URLs or it opens elsewhere. Is that correct?
How the publication is visually manifested is more a secondary consideration. What I'm concerned with here is the idea of there being a publication state that transcends the resources.
It doesn't matter if that state actually persists in the background of the user agent and is checked against a list of URLs as each new resource is requested, or whether it is unloaded and reloaded with each resource and re-checked. Whatever the case, the state needs to be grounded in a concrete list of resources. Otherwise, what stops me from spoofing your publication simply by putting a manifest link into any malicious document I feel like? Without the bounds, the centre cannot hold.
I'm not a fan of having "no-caching" or "no-packaging" directives in the manifest.
No, didn't mean to imply I'm a fan. And maybe I've missed where the discussion now is. I thought I saw that omission from the resource list was an acceptable practice for not caching/packaging resources, but maybe I was seeing things. If so, though, that would make validation and authoring of web publications much harder.
Similarly, no-cache would not be an effective means of declaring what to package/not package. Caching and packaging are not congruous concepts. I may not want a resource cached between page views, but I still need it in a package just as I need it to view the page properly. If we want a way to exclude resources, we may need to have something more explicit.
I thought we had a use case that would provide a way for authors to indicate that they don't want their publications offline-able or package-able? (Personally, I think offline should always be on the table.) It wouldn't provide any measure security against those actions being taken, of course, but at least conforming user agents would be expected to respect them.
Maybe we are approaching consensus? From the perspective of a UA, the relevant defining characteristic of a WP for this discussion is that it may be (typically will be) an aggregation of several HTML documents, media files, etc., rather than just a single HTML document or file. The reading order and when necessary a non-exhaustive list of essential 'non-linear' resources must be enumerated sufficient to establish the boundary of a WP.
On the other hand it sounds like we do not want an 'exhaustive' list of all resources (are we agreed about this yet?). In fact, given the ample number of specs that tell a UA how to deal with (e.g., how to cache, stream, etc.) HTML documents, media files, etc. I would prefer that WP authors be discouraged from enumerating all the CSS, JS, embedded images, codec files etc. needed to cache, package, render or otherwise process a WP - leave this work that UAs already know how to do to the UAs. I don't think any performance advantage is worth the risk of the manifest in the entry point document getting out of sync with the links in the files that comprise the WP (imagine you change the name or location of a CSS or JS file). I suspect that some or most UAs would ignore the inclusion of such items in a WP manifest list anyway and rely on what they find when they open each constituent file of the WP.
However, regardless of our consensus on this point, I do have a couple of concerns left:
Cache-Control: Given that the WP's 'entry' point will typically be an HTML document and will certainly be retrievable via HTTP, how should a UA interpret any HTTP Cache-Control headers? My thought would be that we specify that the UA assume any such Cache-Control headers apply both to the entry point document itself and to the WP as a whole. This avoids the need for us to regenerate and specify a separate Cache-Control mechanism for WPs. Or is this too much of a stretch of the definition of this HTTP header?
Nesting. Consider a journal volume as a WP. Presumably it's reading order is a list of issues. Each issue is then also a WP itself with a reading order which is a list of articles. Do we expect that the UA wanting to cache the volume-level WP will discover by opening each issue WP that it needs to also cache all the articles in each issue? Or should/must the journal-level WP include a listing of all the articles in the volume, e.g., in its reading order? (Which again seems an invitation to inconsistencies between the reading order specified in the volume WP and the reading orders specified in the issue WPs.)
(Admin comment!)
Notwithstanding the (genuine!) issues raised by @tcole3 in https://github.com/w3c/wpub/issues/198#issuecomment-391166490, I have the impression (see also the comment of @GarthConboy https://github.com/w3c/wpub/issues/198#issuecomment-391047369) that we have a consensus on the original question of the issue. The answer being "no", and it also looks like the current text in the draft stands as is. I would therefore propose to close this issue with no further action.
Except that... there is one problem that is pending and needs further discussion, but it is not exactly on the issue as asked here. I would therefore also propose to propose an issue "how should the infoset item 'resource list' be expressed in the WP manifest?". There were several different approaches listed here on whether it is in the manifest, it is fully in HTML, partially here and there... That should be decided and I do not believe we have consensus on that technical problem.
This was another topic which surfaced during the #193 discussions.
@HadrienGardeur posed some concerns to a "dependency gathering" approach proposed by @BigBlueHat.
There are several potential consumption scenarios which should be considered:
Current:
Future:
@HadrienGardeur's concerns to the gathering process are below...