Is an exhaustive "resource list" required to create a Web Publication?

w3c / wpub

W3C Web Publications

https://w3c.github.io/wpub/

Other

78 stars 19 forks source link

Is an exhaustive "resource list" required to create a Web Publication? #198

Closed BigBlueHat closed 6 years ago

BigBlueHat commented 6 years ago

This was another topic which surfaced during the #193 discussions.

@HadrienGardeur posed some concerns to a "dependency gathering" approach proposed by @BigBlueHat.

There are several potential consumption scenarios which should be considered:

Current:

reading through a Web browser (i.e. follow a link to a publication address)
caching and/or offlining the publication in a browser
Future:
creation of a packaged publication (see https://github.com/w3c/pwpub/)

@HadrienGardeur's concerns to the gathering process are below...

conceptually, this means that the bounds of the resources are limited to the reading order (there's no way to indicate that a resource is part of the publication but not in the reading order anymore)

this could trigger the download of very large resources (HD videos) or resources that are useless for rendering (analytics scripts) that would otherwise have been excluded by the author when caching and/or packaging a publication

to cache or package a publication, you need to render every single resource from the reading order, which is going to be very slow and CPU+memory intensive (we've experimented with background rendering in Readium-2 and limit things to only 3 resources)

not all UAs may be able to intercept network requests to cache or package them, this would exclude native mobile apps for example from supporting WPs properly

UAs won't be able to do intelligent preloading on their own (for instance by loading fonts in cache in advance)

BigBlueHat commented 6 years ago

conceptually, this means that the bounds of the resources are limited to the reading order (there's no way to indicate that a resource is part of the publication but not in the reading order anymore)

The reading order (however it's created/expressed) points to the primary resources, which in turn state what resources they depend upon. so, the way you would "indicate that a resource is part of the publication" is by referencing it from a primary resource (i.e. <img src="">, <video>..., <link rel="stylesheet">).

The default consumption of the publication within a browser "just works" (as this is how the browser do things now--there's no "resource list" for a web site you visit, just pages referencing their dependencies).

The caching/offlining scenario can currently be handled by setting up a ServiceWorker and requesting the primary resources (currently via <iframe>s) to populate a publication cache--as can be seen in this copy of Moby-Dick. However, long term, the process would likely be handled similarly to what's described in the Resource Hints spec.

In either case, anything not in the reading order and not referenced from within a primary resource would not be considered part of the publication.

BigBlueHat commented 6 years ago

this could trigger the download of very large resources (HD videos)

One instance of handling "very large resources" on the Web is the <source> element. It describes dimensions, encodings, etc, and leaves it to the browser to determine (from it's usage context) what the best resource for the scenario would be. Using that (and imaginable things like it), publishers would have the option to create multi-modal publications leaving the determination of "very large" to the browser to determine from within the usage context--vs. the publisher publishing multiple publications based on various "sizes."

Additionally, there is an in-progress spec for adding "HTTP Client Hints". These headers would again allow browsers to make determinations about resource loading based on a combination of the headers and the usage scenario.

Point being, determining "very large" is best kept up to the device (and software on it) based on the contextual knowledge of device space, screen size, etc.

or resources that are useless for rendering (analytics scripts) that would otherwise have been excluded by the author when caching and/or packaging a publication

Caching processes can be limited (currently) via ServiceWorker scope such that requests outside of that scope (in my understanding) is not handled by Handle Fetch routine. Presumably, something like that would be in place to limit what gets cached.

Alternatively, more descriptive approaches such as Cache-Control headers, Content Security Policy, or <iframe sandbox> (depending on the scenario), etc. could be used to further prevent offlining stuff the publication doesn't need (or can't use in an offline context).

Lastly, I envision the packaging process to be somewhat similar to caching. In the case of Google's Web Packaging format, the whole thing is stored as a set of HTTP exchanges--so presumably you'd have to make all those requests to gather the request/response pairs to store in the HTTP exchange bundle. Internally, the format is very similar to the HTTP Archive format--which can be output from a browsers dev console.

Additionally, other webby formats like MHTML (current supported in Chrome) and the TAG's upgrade of MHTML also use HTTP headers to express Content-Type, Content-Location (the URL of the contained item on the Web), etc, such that building one of those would look similar--i.e. recording the response from the Web into the package.

However, until Packaged Web Publications is farther along there's no immediate requirement (that I know of) to address packaging concerns explicitly in the Web Publication design and architecture.

BigBlueHat commented 6 years ago

to cache or package a publication, you need to render every single resource from the reading order, which is going to be very slow and CPU+memory intensive (we've experimented with background rendering in Readium-2 and limit things to only 3 resources)

There's no requirement to "render" anything--just to make the related HTTP requests and cache (or potentially package) the responses.

Additionally, since the dependencies are expressed from each primary resource (i.e. their relationship is known), then the UA could potentially offer the user the option to cache only part of the publication.

Whereas, if the list of resources is exhaustive and contains both primary resources and dependencies in a single list (with no stated relationship between them), then the UA could not offer that option because there'd be no way to determine that relationship from the exhaustive list.

BigBlueHat commented 6 years ago

not all UAs may be able to intercept network requests to cache or package them, this would exclude native mobile apps for example from supporting WPs properly

In as much as I'm expecting any supporting Web Publication UAs to be Web-connected (at least at first request of the publication address), then all such UAs would have the ability to: a) retrieve the resources as needed (per HTML on the Web) and b) store them in a cache (or package) to display should the network become unavailable.

BigBlueHat commented 6 years ago

UAs won't be able to do intelligent preloading on their own (for instance by loading fonts in cache in advance)

The entry page could prefetch or preload dependencies if that optimization is desired.

This is another case where the intention of the author/publisher would be expressed more clearly than an exhaustive list of primary resources and dependencies. In the exhaustive list case there's no expression which items are of greater importance (or larger size, etc). However, in this more contextual case the desire to prefetch/preload are expressible and expressible throughout the publication (i.e. prefetching a font from chapter4.html that's needed for the rest of the publication).

HadrienGardeur commented 6 years ago

The reading order (however it's created/expressed) points to the primary resources, which in turn state what resources they depend upon. so, the way you would "indicate that a resource is part of the publication" is by referencing it from a primary resource (i.e. ,

This is only true for some of the resources, others may be behind an <a> element (non-linear resources in EPUB) or could be dynamically fetched using JS. Simply pre-rendering and catching network requests to cache them using a Service Worker won't work in such cases.

One instance of handling "very large resources" on the Web is the element. It describes dimensions, encodings, etc, and leaves it to the browser to determine (from it's usage context) what the best resource for the scenario would be. Using that (and imaginable things like it), publishers would have the option to create multi-modal publications leaving the determination of "very large" to the browser to determine from within the usage context--vs. the publisher publishing multiple publications based on various "sizes."

This opens the door to a different set of problems related mostly to packaging. There are widespread differences for example in audio/video formats supported by various browser on the Web. By packaging a publication the way you describe using browser A, I could end up with a package that won't work properly on browser B.

(I know that you're mostly ignoring the packaging use case for now, but this impacts our technical decisions as well).

In such situations (HD videos available in multiple formats to support various browsers), an author might prefer to completely avoid caching/packaging such resources. The problem with your approach is that there's no way for the author to indicate that a resource shouldn't be fetched and cached/packaged aside from HTTP headers (I would also need to double check if such headers are handled properly if the resource is cached by a SW, plus HTTP headers only handle caching and not packaging anyway).

Alternatively, the author might prefer to have all versions of a resource packaged, to provide an optimal experience on every device. That use case is also not supported by your scenario.

Caching processes can be limited (currently) via ServiceWorker scope such that requests outside of that scope (in my understanding) is not handled by Handle Fetch routine. Presumably, something like that would be in place to limit what gets cached.

Are you suggesting that to provide caching properly, authors of WPs will need to write their own Service Worker?

There's no requirement to "render" anything--just to make the related HTTP requests and cache (or potentially package) the responses.

But to achieve such HTTP requests and cache, you rely on prerendering in the background and on a Service Worker that caches resources upon a successful fetch request.

If you can't do background prerendering (for obvious performance reasons) or Service Workers are unavailable (current webview on Android), then the alternative option is to parse every single HTML document to extract a list of resources. As I've pointed out before, this is both:

very heavy and also CPU/memory expensive
unlikely to cover all the ressources that could be useful for a publication's resource

Additionally, since the dependencies are expressed from each primary resource (i.e. their relationship is known), then the UA could potentially offer the user the option to cache only part of the publication.

I don't think that this is currently listed under our use cases.

iherman commented 6 years ago

Additionally, there is an in-progress spec for adding "HTTP Client Hints". These headers would again allow browsers to make determinations about resource loading based on a combination of the headers and the usage scenario.

whilst this is correct, the practical reality is that 99% of authors/publishers/etc. will have no way of influencing the HTTP headers returned by a server to the clients. I do not think we should base our specs on such headers to avoid pulling in possibly large files.

iherman commented 6 years ago

A resource list is not only relevant in terms of managing caches, packaging, etc. There are affordances that rely on these, too: e.g., a client side search should not be required to search through all references in a content, but only through those that constitute the Web Publication.

Beyond all the efficiency, etc, issues: In more generic terms, a WP is actually defined by the list of resources that it contains: that is what makes it a WP in the first place, as opposed to an average Web page. It determines the scope of various metadata, for example. "Just" relying on extracting all the links in the entry page and declare them to be the list of resources is going against the very definition of a WP.

BigBlueHat commented 6 years ago

This is only true for some of the resources, others may be behind an element (non-linear resources in EPUB) or could be dynamically fetched using JS. Simply pre-rendering and catching network requests to cache them using a Service Worker won't work in such cases.

This scenario is why I'm an advocate for keeping HTML Imports for descriptive (non-JS-based prescriptive) references. Additionally, things like the prefetch Resource Hints in <link> headers could also be used to descriptively reference these resources.

Essentially, the underlying premise is to reference resources as close to their actual use as possible, and in such a way that requests for those resource can be optimized from an understanding of their use. As you noted, that is currently unclear if one simply uses an unrefined <a> or hidden fetch code in a JS script--i.e. the relationship between the document (or script) is either unclear or unknowable (without processing a script).

By packaging a publication the way you describe using browser A, I could end up with a package that won't work properly on browser B.

(I know that you're mostly ignoring the packaging use case for now, but this impacts our technical decisions as well).

Much of this will depend on how and what does the packaging. If using one were attempting to package a Web page now which referenced multiple videos (or images or audio) of varying size and quality, it would be up to the packaging software (and the package format it intends to output) which of those resources were added to the package based on similar heuristics used by browsers at request time.

The expression of the available options is provided in context and with a refined expression which includes sizes, etc. which can be used by a packaging tool (or browser) to make those determinations given its intended output.

A singular "resource list" will either lack that information (as wpub does now) or the addition of such information will ultimately look like aggregating all these in-context HTML expressions into that list--which then ultimately will look rather similar to lumping all the HTML references (with the sizes, srcset, and media style information) into a single file.

Are you suggesting that to provide caching properly, authors of WPs will need to write their own Service Worker?

Not long term, no. It's what's currently available and (consequently) can be used to inform our thinking when we look toward getting this work into the browsers. The important part of that comment was that "requests outside of that scope (in my understanding) is not handled by Handle Fetch routine. Presumably, something like that would be in place to limit what gets cached."

Point being that we may also need (regardless of how we express the structural properties) something similar to the scope property as seen in ServiceWorkers (for Fetch handling) and in Web App Manifest (for navigation limiting). In both those cases, that's how the "edge" of a Progressive Web App is being defined.

the alternative option is to parse every single HTML document to extract a list of resources.

True. In order to "gather" the entire publication, each primary resource's references must be considered/processed/parsed in order to retrieve the entire publication (into cache, etc).

The exhaustive "resource list" approach doesn't guarantee that one can avoid that, however. The exhaustive list also raises concerns of things potentially being (or becoming) out of sync during the publications history (i.e downloading resources one doesn't need or not downloading resources one does need).

Ultimately to make the exhaustive list, something somewhere is going to have parse every single primary resource, gather it's dependencies, and put them in that list. That "something" might be a human at a keyboard, or it might be some Python script using BeautifulSoup, or it might be a browser-based editor of some kind. Regardless, in the case of the exhaustive list, any reference made from any primary resource which is needed for the experience of that primary resource MUST be recorded in that exhaustive list.

If something isn't recorded in the exhaustive list, then...what happens? Invalid cache? Fallback to the network (i.e. use the Web)?

I don't think that this is currently listed under our use cases.

I'd been seeing this as a potential application of Random Access to Content--especially for massive works like textbooks where only a few chapters/sections may be needed offline at a time.

In the exhaustive list case, the entire publication would have to be available/considered/cached before Random Access could be provided.

In the case of a textbook from which a student only needs/wants to offline "Chapter 4" (and it's dependencies), it's not clear how the UAs would provide that that Random Access request without either: a) the overhead of caching the whole publication b) or without ignoring the exhaustive list and taking the "gathering" approach.

Ultimately the primary resources themselves become authorities over their own content and dependencies.

BigBlueHat commented 6 years ago

whilst this is correct, the practical reality is that 99% of authors/publishers/etc. will have no way of influencing the HTTP headers returned by a server to the clients. I do not think we should base our specs on such headers to avoid pulling in possibly large files.

Let's not guess at statistics. :wink: I also wasn't suggesting we "base our specs on such headers" just that they exist (or may/will) and subsequently will be considered by the browser when making any of these requests (see Content Security Policy for one such example).

Consequently, we shouldn't ignore their existence--as any deployment of these things will ultimately have to work within them, and conversely should make use of them wherever it makes sense for our use cases.

Any spec we bring to a browser vendor will be considered in light of the existing (and potential) Web Platform specifications. HTTP Headers are (increasingly) a big part of that puzzle.

BigBlueHat commented 6 years ago

A resource list is not only relevant in terms of managing caches, packaging, etc. There are affordances that rely on these, too: e.g., a client side search should not be required to search through all references in a content, but only through those that constitute the Web Publication.

This is what the "binding" is about. It provides a linear progression of resources which are "bound" into a (new) thing called a Web Publication. Those resources can then be (in some lovely future): searched, offlined, etc.

A resource list will be built in either of these cases. It's just the "how" and the "when" that's in question--afaict.

"Just" relying on extracting all the links in the entry page and declare them to be the list of resources is going against the very definition of a WP.

I've never suggested we "[extract] all the links in the entry page," but rather narrowed them (as our spec currently does) to a defined area of expression (currently <nav role="doc-toc"> in our fallback reading order system). Given that's currently in our spec as a way to create the reading order, I don't think we can say it's "going against the very definition of a WP."

However, we can certainly say that the approach of gathering the resources vs. listing them exhaustively are different approaches to the same problem--both of which have their consequences (good and ill).

iherman commented 6 years ago

@BigBlueHat I just try to understand. Is your proposal that, instead of listing the resources in the JSON part, you would use something like

<nav role="doc-something">
 <li><a href="resource1"></a></li>
 <li><a href="resource2"></a></li>
</nav>

and that (and only that!) defines the list of resources?

GarthConboy commented 6 years ago

And what if resource2 needs a script or references an image? Each of which would need to be in said list of required resources... we can't require a crawl to generate a resource list. We wouldn't know where to stop. I think a non-explicit or non-exhaustive resource list should be considered a non-starter.

BigBlueHat commented 6 years ago

@iherman as long as everyone's clear about the "something like" part--meaning, there's room for (and should be more!) exploration here. But in short, yes.

The proposal is that HTML is the best place to put resource relationships because the mechanics of doing that are already defined (and we can build up from those), that they'll be used regardless (see the failure scenarios mentioned above), and moving these structural properties into JSON brings a less webby model into play upon the Web.

BigBlueHat commented 6 years ago

And what if resource2 needs a script or references an image?

Then (as happens now) a request for resource2 (either directly or via some process) retrieves those dependencies.

Each of which would need to be in said list of required resources...

In the exhaustive list case, they would have to be listed in both the exhaustive list and referenced from the primary resource which depends on them.

we can't require a crawl to generate a resource list.

There's no "crawling" (at least not in the open ended since used here). There are two things:

primary resources -- referenced from the reading order
dependencies -- reference from the primary resource which depends upon them

The "gathering" of All The Things (if/when needed for a use case), would happen by requesting the primary resource(s) needed and then requesting their resources--which is how loading a Web page works now. If one were to collect a specific list of Web pages together (no crawling! just that list) right now, this is exactly the process one would use.

We wouldn't know where to stop.

We do know where to stop. We only get the primary resources and their referenced dependencies. We don't "crawl" random, inline <a> tags, etc.

For instance, if the Moby-Dick reading order reference an "About the Author" page which in turn linked (inline) to the Herman Melville Wikipedia page, there would be no expectation that the "gathering" process should collect that Wikipedia page--since that Wikipedia page was not included in the reading order.

I think a non-explicit or non-exhaustive resource list should be considered a non-starter.

The list of primary resource is explicit and exhaustive--in that it defines the primary boundary of the publication in a linear fashion. The dependencies are the thing in question.

Defining an exhaustive list which contains both primary resources and dependencies (or even just a list of dependencies) seems likely to:

fall out of sync with the publication
does require duplicative effort and maintenance overhead
does not sufficiently model things in a way that UAs might strategically handle narrower requests (i.e. caching just "Chapter 4")
nor does it clearly provide the semantics to avoid the "large file" scenarios expressed earlier (as at least currently, we've not discussed a "resource list" which would contain the same level of "expressiveness" as HTML currently has on hand).

Those would be (some at least) of my "cons" for the exhaustive list approach (echoing the "cons" listed earlier of the "gathering" approach).

If there are specific scenarios of use not yet expressed here, it'd be great to hear more about what's informing the "non-starter" thoughts. As yet, all of these thoughts/comments assume consumption of a Web Publication via a Web browser (in the broadest since of the term).

deborahgu commented 6 years ago

I just wanted to add a clarifying point of this discussion about why, to my perspective, some of the contributors in this and other tickets seem to be talking past each other. It might help clarify the points of contention.

(Obligatory disclaimer: I take no stance on the following dichotomy.)

@BigBlueHat and others are fundamentally seeing a WP as a way of representing a publication on the web, with the only two starting points being "the content of the publication" and "the web."
- They are not using previous digital representations of publication in HTML (i.e. EPUB) as a jumping off point.
- They are not thinking of PWP (packaged WP) as a product which will be created in a pipeline from WP.
- They do not see an equivalency between WP and unpackaged EPUB 3 (or 4).
@HadrienGardeur and others are fundamentally seeing a WP as a way of representing a publication on the web, starting with our existing fundamentals
- They are using EPUB as a jumping off point
- They do think of PWP as a product in the same production pipeline as WP
- They do think of a connection between WP and unpackaged EPUB 3 (4).

As I said, I take no stance on either of the sides in this dichotomy, except to say that, as Ivan pointed about above, we actually have documented affordances and use cases. If we think of the two sides in their most extreme representation (which nobody actually is advocating for) as "a WP is just an exploded EPUB viewable in a browser" vs. "a WP is just a collection of HTML pages with some metadata and chapter navigation", both of those would fail to serve our use cases and affordances completely. (As I think everyone would agree!)

I don't think there's a fundamental conflict between these two points of view. But I do think it's worth clarifying them, because they are the source of so many of these contentious issues.

dauwhe commented 6 years ago

Just as a little experiment, I made an EPUB that did not list secondary resources—CSS, images, etc— in the manifest. This is obviously illegal, and the EPUB rightly failed validation. But it worked perfectly in the three reading systems that I tried. Of course this doesn't prove anything. But the entire web operates without such lists. Web applications don't need such a list. Having such a list involves costs, and we shouldn't immediately assume that the costs are negligible, or that the benefits are so obvious they don't need to be stated.

GarthConboy commented 6 years ago

As much as I love @dauwhe ... such an EPUB would not work on, say, Google Play Books. And, I view this resource list as required for packaging, or for that matter off-lining, so the fact that it works on the Web or some EPUB RS, doesn't sell me. :-)

dauwhe commented 6 years ago

On the bright side, you inspired me to install Google Play Books! And yes, it won't open my little experiment. Does it run epubcheck on upload?

Now I have to go write a demo of packaging without an explicit resource list :)

GarthConboy commented 6 years ago

@dauwhe yes we do run epubcheck on upload. But that just an early detection... we wouldn't serve un-manifested resources.

"Now I have to go write a demo of packaging without an explicit resource list" – be prepared to let it run for awhile, as you might end up with the whole Web! :-)

HadrienGardeur commented 6 years ago

I think it's also worth pointing out again that the "list of resources" is an optional infoset item.

It's up to the author to decide what should or shouldn't be listed in there. Unlike EPUB, there won't be validation involved for that list (EPUB requires authors to list every resource from the package in <manifest>). By including an item in that list, you're simply making sure that the UA will eventually cache/package it along with the resources listed in the reading order.

If you don't want to be bothered by providing such a separate list, that's fine. But you can't expect the UA to be able to guess everything for you in that case, because there are many reasons that this could fail.

BigBlueHat commented 6 years ago

"Now I have to go write a demo of packaging without an explicit resource list" – be prepared to let it run for awhile, as you might end up with the whole Web! :-)

That's not how it's ever worked, @GarthConboy, so I'd appreciate it not being restated as if it were a possibility. It's only causing fears and concerns about things that don't happen. Thanks.

The technical response posted earlier hopefully makes it clear that the approach which I've proposed is not in danger of crawling the whole Web: https://github.com/w3c/wpub/issues/198#issuecomment-389571549

BigBlueHat commented 6 years ago

By including an item in that list, you're simply making sure that the UA will eventually cache/package it along with the resources listed in the reading order.

If you don't want to be bothered by providing such a separate list, that's fine. But you can't expect the UA to be able to guess everything for you in that case, because there are many reasons that this could fail.

Wait...so it's not exhaustive? The resource list is currently defined as "all resources":

The resource list enumerates all resources that are used in the processing and rendering of a Web Publication (i.e., that are within its bounds).

If it is instead merely additive (i.e. explicitly stating that the UA not forget to get that resource), then I've far less of an issue with it--however there's still concerns (for me) around moving structural/resource-loading semantics out of the HTML (which is by definition for such constructions) and into a new place.

@HadrienGardeur could you clarify whether you believe an exhaustive list is a requirement, or that there's simply a need for something to reference dependencies which should "not be forgotten" (or perhaps a list of things that "must not be gotten" 😉)?

Thanks.

HadrienGardeur commented 6 years ago

It's always exhaustive in the sense that this list + the reading order taken together are responsible for establishing the boundaries of the publication.

But you certainly don't have to list all the resources in there (for instance, you would definitely avoid including your Google Analytics script or similar resources that won't work offline+packaged), it's up to the author to decide which resources are part of the publication or not.

mattgarrish commented 6 years ago

so I'd appreciate it not being restated as if it were a possibility

But it is a possibility. The reading order is much less stringent in its requirements right now than the resource list, so you can't rule out that hyperlinked resources are part of the publication if the resource list becomes optional. Of course, any sensible user agent wouldn't actually crawl them. I think the point is more that this obscures the bounds of the publication.

I recall we already went round and round this problem back when we were having fun with primary and secondary resources and their relation to the reading order and resource list. Is it that hard to find a compromise between the two extremes: you must list all primary/top-level resources in the resource list, but should list all dependencies if you don't want to risk a user agent failing to properly cache (or whatever) your publication. You should also list any resources that might not be easily determined by inspection (script-necessary files, etc.). And maybe flag complete/incomplete lists so user agents know which need processing. Sort of a return to: https://github.com/w3c/wpub/issues/22#issuecomment-321081579

The answer is probably different for EPUB, where listing all resources becomes a requirement and probably not an unreasonable ask.

BigBlueHat commented 6 years ago

@mattgarrish what you expressed in https://github.com/w3c/wpub/issues/22#issuecomment-321081579 is spot on, and describes precisely what I seem to have been failing to express. Thank you, Matt! 😄

What Matt said in https://github.com/w3c/wpub/issues/22#issuecomment-321081579 also matches what I'm seeing in @iherman's examples where he's only referenced resources which were not already listed in the spine/ToC:

https://github.com/w3c/wpub/blob/master/experiments/manifest_script/mobydick.html#L22-L28 https://github.com/w3c/wpub/blob/master/experiments/separate_manifest/mobydick.json#L30-L37

Additionally, I understand now (based on the comment above and a quick read on non-linear content in EPUB), that the "crawling" concerns were probably related to the following text from the EPUB 3.2 spec (and similar text in it's predecessors):

Authors MUST provide a means of accessing all non-linear content (e.g., hyperlinks in the content or from the EPUB Navigation Document).

With that as the backstory, I now understand @GarthConboy's concerns. 😃

GarthConboy commented 6 years ago

Indeed... I was trying not to type the dreaded linear="no", but that's what I had in mind. :-) Also content required/referenced by included scripts, which is largely unknowable.

iherman commented 6 years ago

We may, I think, conflate three problems (at least they are mixed in my mind):

What should the information set include explicitly for the set of "secondary" resources? By "secondary" I mean not the primary (for the sake of simplicity, HTML) content, but CSS/JS files, auxiliary data files related to a scientific discourse, images, videos, etc.
How should these information set items be collected: either by listing them as explicit information items using some serialization or by crawling through the primary resources and get hold of the links that are destined to be part of the final information set.
if the items are to be listed explicitly, whether this "list" is in some type of HTML element (like https://github.com/w3c/wpub/issues/198#issuecomment-389552429) or as part of a JSON serialization.

These affect the "offline" aspect of the WP (whether offline temporarily/locally via a cache or a package) but also affordances.

The relevant section in the current draft seems to give an answer to (1). Is there any reason to re-open that draft, or should be taken as granted for now? Has there been any new evidences that would warrant to reopen (1)? I do not think so.

Ie, we should concentrate on (2) and (3). Note that the draft also says:

If a user agent encounters a resource that it cannot locate in the resource list, it MUST treat the resource as external to the Web Publication.

which means that there may be references in the primary document that are not part of the final resource list (as it should be, imho), which makes (2) above fairly unclear.

GarthConboy commented 6 years ago

Indeed. I think it's pretty clear that there is a required list of Primary Resources (default reading order). It seems to me that we need to fully decide whether a WP MUST be offline-able/package-able, or if it MAY be (as in up to the author). The section of the current draft than @iherman pointed to above implies this is a MAY ("it is strongly RECOMMENDED to provide a comprehensive list of all of the Web Publication's constituent resources"), which seems to imply that supplying the list of secondary resources may be optional. I tend to lean more to a MUST. But, if we affirm this, one way or anther, we can decide Secondary Resource list question.

css-meeting-bot commented 6 years ago

The Working Group just discussed https://github.com/w3c/wpub/issues/198.

The full IRC log of that discussion

<dauwhe> Topic: https://github.com/w3c/wpub/issues/198
<dauwhe> Github: https://github.com/w3c/wpub/issues/198
<dauwhe> garth: there's a primary reading order, but what about the rest of the issues. Must they specified fully?
<dauwhe> ... there are some comments in there that I like, from Ivan
<dauwhe> ... the list of secondary resoources should be those needed for offlining
<duga> q+
<dauwhe> ... the Q is whether that list of secondary resources is required. MUST all web pubs be offlinable/packageable?
<josh> q+
<dauwhe> ... so the secondary list is author-optional?
<garth> https://w3c.github.io/wpub/#wp-resource-list
<dauwhe> garth: it does have the statement that it is strongly recommended to supply a list of all resources
<dauwhe> ... that's a may, not a must.
<garth> q?
<dauwhe> ... if that's really what we mean that drives the issue
<dauwhe> duga: we can discuss this again, but this exact question has been asked
<Hadrien> I think the current draft is fine
<dauwhe> ... the very clear answer from the group is that not all publications can be cached/offlined
<dauwhe> garth: I don't love it, I can live with it
<Hadrien> q+
<dauwhe> ack duga
<dkaplan31> q+
<ivan> q+
<dauwhe> duga: I'm not necessarily in the camp, but I asked the q and the answer was clear
<garth> ack josh
<dauwhe> josh: I'm not entirely clear how the requirement for a resource list factors into packagability
<dauwhe> ... even a non-offlineable publication could have a list of constituent pieces
<dauwhe> ... but I wanted to weigh in that I feel strongly that WP 'may" be packagable
<dauwhe> ... there are a lot of publications where it would be impractical to package
<dauwhe> ... and we don't want things to be automatically packageable
<garth> q?
<dauwhe> ... we want to tag things as unpackagable, possibly with a license
<dauwhe> ... in terms of offlinable, I feel less strongly
<dauwhe> ... it might be a worthwile challenge to say they must be offlinable
<dauwhe> .... what that means to me is that they are desgined so that if you have minimal bandwidth, then a minimal amout of data to start the publication, and it doesn't lock up if you go into the proverbial train tunnel
<dauwhe> ... perhaps without videos etc
<dauwhe> ... but at least it doesn't "lock up" when it loses connection
<dauwhe> garth: let me answer the first thing
<dauwhe> ... my view is that the exhaustive list of secondary resources is required to make things offlinable
<dauwhe> ... or packaged
<dauwhe> ... if that list is missing
<garth> q?
<dauwhe> ... then you're going thru the list of primary, then web browsers do know how to get the associated resources
<dauwhe> Hadrien: the current spec language is fine
<dauwhe> ... I don't agree with what you said
<dauwhe> ... it's possible to have publications with everything in html
<dauwhe> ... with CSS inline, images as base64
<dauwhe> ... so I don't think we should tie ability to offline to a list of resources
<dauwhe> garth: then you don't need a list of secondary resources
<dauwhe> ... I agree
<garth> q?
<garth> ack Hadrien
<ivan> ack Hadrien
<dauwhe> Hadrien: basically the list of resources is not what is going to indicate if a WP is packagable
<dauwhe> ... you can always package or cache the primary resources
<dauwhe> ... but if your WP depends on JS, css, etc, and you don't include those in the list of resources, than this can affect the quality of the experience
<dauwhe> ... this is not similar for all publications
<dauwhe> ... some will heavily rely on JS, and if you don't include JS they will break
<dauwhe> dkaplan31: I want to say a variant
<dauwhe> ... this is a controversial thing
<dauwhe> ... we shouldn't just say let's agree
<dauwhe> ... we are conflating too many things
<dauwhe> ... it's hard to differentiate between packaging and offlining
<dauwhe> ... packaging is not the same as rights management
<dauwhe> ... packaging and piracy are separate
<dauwhe> ... it's important to understand what secondary resources are
<dauwhe> ... in some cases ALL CSS and JS are necessary for the publication
<dauwhe> ... if they are necessary they're not secondary
<dauwhe> ... if the publication is not usuable without the resource, then there's an arguement that it's not secondary
<garth> q?
<duga> q+
<garth> ack dkaplan
<dauwhe> ivan: I won after all :)
<garth> ack ivan
<dauwhe> ... we have to be careful to distinguish between offline and packaging
<dauwhe> ... these are two different things
<dauwhe> ... we may have to have yet another entry in our infoset which says does the author allow offlining or packaging
<dauwhe> ... if we decide there are non-offlinable WPs, we must state this, we cannot deduce this from magic
<dkaplan31> so to clarify:
<dkaplan31> 1. Let's have an up/down vote on language if we're deciding today, not a quick silence = consent
<dkaplan31> 2. Packaging != offlining != rights management
<dkaplan31> 3. If a resource is necessary, it's not secondary. If it's necessary, it needs to be listed. If it's not necessary, it doesn't need to be listed.
<dauwhe> ... I would prefer we say every WP is at the minimum offlineable
<dauwhe> ... and the author should say this explicitly
<dkaplan31> I agree with Ivan, re: offlineable. +1
<Bill_Kasdorf> Note that there is a difference between "allowing" offlining and "enabling" offlining.
<dauwhe> ... the list controls what goes into the offline version
<timCole> q+
<dauwhe> ivan: I think every web publication is offlineable, but it's up to the author to say what's in the offline version
<dauwhe> garth: I think with the language that the full list is recommended but not required
<dauwhe> ... that means the pub may or not be fully offlineable
<dauwhe> ivan: what do you mean?
<dkaplan31> I don't agree with that, Garth.
<ivan> q+
<dauwhe> garth: if such a list of 2ndary resources isn't there, then offline would get primary resources plus their direct links, which might not be enough
<garth> ack duga
<josh> +1 (please God, no DRM)
<ivan> +1 to duga
<dauwhe> duga: I want to remind people that DRM is out of scope, and we shouldn't worry about it
<dkaplan31> duga++
<laudrain> +1
<dauwhe> timCole: we did have a converstation that relates to offlinability and caching
<timCole> https://github.com/w3c/wpub/issues/183
<dauwhe> timCole: that doesn't get all the way to DRM, but in browsers you might say that something shouldn't be cached because it changes too quickly
<dauwhe> ... we also haven't defined what offlining means
<garth> q?
<ivan> ack timCole
<garth> ack timCole
<dauwhe> ... so I don't agree with Ivan that everything should be techncially offlinable, that might be too much
<dauwhe> garth: I'm gonna paste something in that Ivan might disagree with
<garth> “Is an exhaustive "resource list" required to create a Web Publication?
<garth> No. Such an an exhaustive list may be needed to make the WP fully offline-able or package-able. But, an exhaustive list of resources required (beyond the primary reading order) is not required, as it is up to the author whether a WP is fully offline-able or package-able.”
<dauwhe> ... the issue: is an exhaustive list required? I'll propose above as resolution
<dauwhe> ivan: you said something that worries me, garth
<dauwhe> ... you said you take the primary resources offline, and the CSS and etc... that is pandora's box
<dauwhe> ... there are things I didn't mention explicity that are offlined
<dkaplan31> q+
<dauwhe> garth: I'm for requiring the list
<garth> q?
<dauwhe> ivan: I don't think we need a decision
<duga> q+
<dauwhe> garth: the spec says the list isn't required
<dauwhe> ivan: let's not go into linguistic analysis
<garth> q+ Hadrien
<garth> ack ivan
<dauwhe> ... it just meant that the resource list may be a selective list that are used by WP but selected by what can go into a cache and what can't
<dauwhe> ... this is not easy
<ivan> ack dkaplan
<dauwhe> dkaplan31: i agree with ivan
<dauwhe> ... we are not stating about packaging and offline
<dauwhe> ... this is a Q about resource list
<dauwhe> ... offline and packaging aren't in the scope of the next 11 min
<dauwhe> ... but it's legitimate to say what that list of resoruces will enable
<dauwhe> ... what is in the resources defined what could be cached, what could be offlined, what could be packaged, what could be preloaded
<dauwhe> ... here are affordances provided by a such a list. if a resource isn' tin the list, then these things aren't possible
<dauwhe> ... this is just a minimum requirement
<garth> ack duga
<dauwhe> duga: +1 to understanding what this list is for
<dauwhe> ... if CSS is not listed in 2ndary list, am I forbidden from downloading it?
<dauwhe> ... one reason we didn't require resources to be in this list is scripting
<jbuehler> +1 dauwhe beyond scope of next 10 min - I need to think about this more myself
<garth> q?
<dauwhe> ... where it might not be possible to determine what resources are used by script
<garth> ack Hadrien
<bigbluehat> q+ to ask for clarity around how exhaustiveness (related to the "why do we need this list?" questions)
<dauwhe> Hadrien: a comment on terminolgy
<dauwhe> ... i think primary and secondary are confusing
<dauwhe> ... we're talking about reading order and list of resources
<dauwhe> ... we need to be careful
<dauwhe> ... talkign about offlining is confusing.
<dauwhe> ... there are many ways to do that.
<dauwhe> ... we should talking about caching and packaging
<dauwhe> ... packaging is a way of offlining, too
<dauwhe> ... I think default reading order and resources are two lists where we ahve expectation of user agents
<dauwhe> ... we expect UA to do something more, put them in package, to have a proxy to intercept requests
<garth> q?
<dauwhe> ... this is what we should be discussing
<dauwhe> ... what the UA should be doing that the web doesn't do now
<dauwhe> ... everyhting else should be like the web
<dauwhe> ... if we have image with cache-control headers, it might still work offline because of caching even when it isn't in the list
<dauwhe> q?
<timCole> +1 Hadrien
<garth> https://www.w3.org/publishing/groups/publ-wg/Meetings/Minutes/2018/2018-05-07-pwg.html
<bigbluehat> +1 to minutes
<dauwhe> minutes approved
<ivan> Resolved: last meeting's minutes approved
<garth> q?
<dauwhe> ack bigbluehat
<Zakim> bigbluehat, you wanted to ask for clarity around how exhaustiveness (related to the "why do we need this list?" questions)
<garth> ack bigbluehat
<dauwhe> bigbluehat: the Q that came up as to what H said, why do we need the list...
<dauwhe> ... working out the scenarios for its use
<dauwhe> ... currently it's recommended, but now we say that it has to include all the primary resources, and we've doubled everything
<dauwhe> ... we also need to define exhaustive
<dauwhe> ... and we haven't repeated existing components
<dauwhe> ... it would be good to work thru the "why is this here" stuff
<dauwhe> garth: I'm giving up on my fantasy of closing this issue
<dauwhe> ... we do have agreement that the primary reading order is required
<dauwhe> s/primary/default/
<bigbluehat> +1 to not restating stuff
<garth> q?
<dauwhe> ... I would envision the resource list is stuff beyond the default the reading order, so we don't restate stuff
<bigbluehat> (as in not restating stuff in the resource list)
<Hadrien> +1 to avoid redundancy between default reading order and list of resources
<ivan> +1, too
<laudrain> +1

GarthConboy commented 6 years ago

Comments from WG call:

-- Provision of such an exhaustive resource list (resources beyond those in the required default reading order) is, indeed, RECOMMENDED (per current spec). And this likely should be considered settled.

-- Some agreement around that if the exhaustive list of resources is provided, it should somehow not duplicate the default reading order -- it should be those additional resources, to avoid duplication.

-- It was pointed out that publications can be created such that no such exhaustive resource list is required for the WP to be offline-ed or packaged (all resources bundled in with those in the default reading order). But, such an exhaustive list may be required for full/correct offline-ing or packaging.

Garth's proposal for this issue remains:

"Is an exhaustive "resource list" required to create a Web Publication? No. Such an an exhaustive list may be needed to make the WP fully offline-able or package-able. But, an exhaustive list of resources required (beyond the primary reading order) is not required, as it is up to the author whether a WP is fully offline-able or package-able." (from the RECOMMENDED above)."

dauwhe commented 6 years ago

@deborahgu said that we need to talk more about how such a list would be used. I think this would be very helpful!

I'm also wondering if there's some confusion about what such a list makes possible. HTML has APIs to get a list of images (document.images) or stylesheets (document.styleSheets) associated with any HTML document. So if we have the default reading order, we can get all the images and stylesheets associated with those documents. Several of us have built example WPs that are packageable and/or cacheable without an exhaustive list.

EPUB had an exhaustive list of all those things, partly to help with validation, and partly so that a reading system could learn something about the resources without parsing the HTML documents. If we're making arguments about that, we should be explicit about them. And we should remember that in the early days of ebooks, many reading systems were stunningly underpowered.

We've also talked about how the exhaustive list defines the content of an WP. I'd like to see more detail about that. If I'm reading a WP, and encounter an image that isn't listed in the exhaustive list, what happens? If I then cache the WP, what happens? If I package it, what happens? If I click a link in an HTML document in the default reading order, which leads to a HTML document on the same origin which is not part of the default reading order, what happens?

mattgarrish commented 6 years ago

I would assume the primary purpose of the resource list is so that the reading system can establish when a user is within the scope of the publication. You don't need the secondary resources for that, but you also can't rely on the reading order in all cases we've discussed.

If you don't have a list of all the primary resources somehow between those lists, the publication would get "exited" whenever you navigate to an unlisted primary resource. The secondary resources are inconsequential to this process. It doesn't matter whether what is loading within the page is listed, as we're not going to change HTML rendering/security/etc.

As far as offlining and packaging go, having all the secondary resources would speed the process up, and be more accurate in some cases, but why put this requirement exclusively on authors when user agents are capable of performing the step?

I suspect that web publication authoring tools are going to give a complete list of resources and make the issue moot in a large number of cases, but I also think the simpler it is to create a web publication, no matter your choice of tools, the better.

I sort of question whether a user agent should ever rely on authors to get the list of needed resources correct, or should always be inspecting the primary resources to determine what is needed and might be missing. If we want reliability, depending on authors is not the best idea.

GarthConboy commented 6 years ago

In response to @dauwhe 's:

I'm also wondering if there's some confusion about what such a list makes possible. HTML has APIs to get a list of images (document.images) or stylesheets (document.styleSheets) associated with any HTML document. So if we have the default reading order, we can get all the images and stylesheets associated with those documents. Several of us have built example WPs that are packageable and/or cacheable without an exhaustive list.

and @mattgarrish 's:

As far as offlining and packaging go, having all the secondary resources would speed the process up, and be more accurate in some cases, but why put this requirement exclusively on authors when user agents are capable of performing the step?

I think it's clear that UA/RS' can, for simple publications, suss out list needed to cache/offline/package some WP's with a light crawl of the resources from the default reading order list – finding the CSS, images, and scripts referenced and including (only) those.

However, there are clearly cases where that can't be done – e.g., "required" content that is not in the default reading order (linear="no" content, for example foot or end notes content) or resources required by included scripts. So, I get back to such an exhaustive list (beyond the default reading order) SHOULD or COULD be provided... per my suggested resolution at the end of my previous comment.

dauwhe commented 6 years ago

@mattgarrish I agree that a list from the author of "supporting" resources such as fonts, CSS, and images is perhaps not terribly useful. Or at least not required for many use cases.

The interesting question is HTML that's not in the default reading order, but is linked to from the publication. If it's not part of the publication, then it's just like any other web link. But what if it's part of the publication? EPUB hasn't really solved this problem—just say linear=no during a call and listen to the groans.

What does it mean to be part of the publication but outside of the default reading order? This would certainly affect some affordances. "previous" and "next" controls would presumably be disabled. How would one return to the linear reading experience? Display such content as a modal? Provide a link back to where you opened the resource?

mattgarrish commented 6 years ago

But what if it's part of the publication?

That's why I don't think we can rely on the reading order being a useful listing of primary resources. It's also why the current prose only requires one document in the reading order. Not because some publications will only be one resource, but because people have expressed a desire to create publications that don't offer a linear progression by default but rely on other means, like following links.

What does it mean to be part of the publication but outside of the default reading order?

It has to mean nothing more special than that there is no automatic path forward, but it's one of a variety of things about bringing an epub reading experience to the web that is thoroughly confusing to me, too. If you go back to a linear document, and it assumes a different next document, what does that do to the browsing history? Assuming new tabs get spawned would make a mess of the link-based model.

HadrienGardeur commented 6 years ago

I'm going to repeat what I said during our last call:

from an infoset perspective, the default reading order and list of resources are both meant to establish the bounds of the publication (publication resources), all other resources are external to the publication
both lists are also meant to enable affordances in the UA and we have additional expectations from the UA about the resources included in both lists

Among some of these affordances and expectations:

to enable offline reading, we can expect the UA to cache both lists, ideally behind a proxy (could be a Service Worker) with a "network first, then cache" policy
when packaging is requested, the UA will fetch all resources in both lists and add them in a single package
when the user triggers a search, the UA will look for specific keywords in these resources or index them ahead of time to enable such keyword search
when viewing a resource included in the default reading order, the UA will provide affordances to move forward or backward in the reading order

As we can see, caching and packaging are only two affordances among many others.

HadrienGardeur commented 6 years ago

It has to mean nothing more special than that there is no automatic path forward, but it's one of a variety of things about bringing an epub reading experience to the web that is thoroughly confusing to me, too. If you go back to a linear document, and it assumes a different next document, what does that do to the browsing history? Assuming new tabs get spawned would make a mess of the link-based model.

I think we can ask such questions to the Edge team (cc @BCWalters ) since they've already addressed those issues for their reading mode (which covers resources from the Web, EPUB and PDF).

As seen in their presentation in Berlin, the reading mode has its own affordance for moving forward/backward in the reading order but they also keep the URL bar and the back button in there as well.

iherman commented 6 years ago

@dauwhe

I sort of question whether a user agent should ever rely on authors to get the list of needed resources correct, or should always be inspecting the primary resources to determine what is needed and might be missing. If we want reliability, depending on authors is not the best idea.

I must admit I continue to be wary of the approach whereby the UA would crawl through all primary resources to gather the list of all resources. The existence of document.images or document.styleSheets obviously helps, but that does not solve everything. E.g., this would require parse each CSS files to see if there are font references (I would suspect that document.styleSheets includes the stylesheets imported via a @import in CSS), parse the HTML resources looking for videos, JS files, CSV files or other datasets (used by some interactive content in the document), etc. How would the UA know what to include and what not to include?

We may of course decide to include some sort of an "exclusion" list rather than an "inclusion" lists. Ie, instead of listing what the list of resources includes, we may require to list what is excluded when crawling the resources. However, we always have to see what the consequences of bad authoring would mean, and forgetting to explicitly exclude something may lead the addition of a full DNA dataset of several GB-s in the WP… Ie, I am not sure that should be a good idea either.

Alternatively, we may make a strong use of document.images or document.styleSheets and we would not require the author to include JS files and images in a resource list. That may simplify the authoring of many simple WP-s.

Another thing that worries me is the time it takes. While I realize that UA-s operate in much better environments than in the early EPUB2 days, fetching and parsing a whole series of HTML content just to gather what would be part of the WP is still a significant effort; parsing an HTML doesn't only mean parsing the syntax (which is significant already) but also building 2-3 different "trees" (DOM, CSS, Accessibility…).

mattgarrish commented 6 years ago

I must admit I continue to be wary of the approach whereby the UA would crawl through all primary resources to gather the list of all resources.

Since that quote was from me, I'll respond by saying it's not an approach I would take. But I'm not averse to there being a process whereby the user agent does the work, with a clear caveat emptor for anyone who takes advantage of it.

I also don't think a complete list of supporting resources is necessary if, as Hadrien says, you don't want those affordances. But we haven't addressed how an author specifies what a user agent can do with a publication. We'll need explicit metadata at some point.

More what I was wondering, though, is whether user agents are going to crawl the resources regardless of any stated completeness to discover whether there are any missing resources. If they have to do this for some, will they do it for all? We don't forbid this anywhere, but we also implicitly accept that if resources aren't listed they won't be put in caches or packages. Should we have both a global no-offlining and per-resource no-offlining instructions so that we don't forgo completeness of the resource list to achieve an unrelated need, and so user agents don't try to put these resources back in the list?

BigBlueHat commented 6 years ago

I would assume the primary purpose of the resource list is so that the reading system can establish when a user is within the scope of the publication.

Can you expand on what you mean by "within the scope of a publication?" In other specs, "navigation scope" is limited by a path prefix such that any navigation request made that does not contain that prefix causes the navigation to happen elsewhere (i.e. outside).

If you don't have a list of all the primary resources somehow between those lists, the publication would get "exited" whenever you navigate to an unlisted primary resource.

It sounds like you're wanting/expecting a similar sort of experience scenario where navigation is somehow limited within some view/container, and any navigation that happens must match a list of URLs or it opens elsewhere. Is that correct?

HadrienGardeur commented 6 years ago

It sounds like you're wanting/expecting a similar sort of experience scenario where navigation is somehow limited within some view/container, and any navigation that happens must match a list of URLs or it opens elsewhere. Is that correct?

In EPUB, a number of reading apps handle things similarly.

Here's how iBooks handles resources:

resources included in the reading order are displayed full screen with affordances to move forward/backward in addition to pagination.
resources that are part of the publication but not in the reading order ("non linear resources") are displayed in a modal view and they're not paginated
resources that are not part of the publication are displayed on a separate screen, within a webview that can be closed to go back to the publication

I'm not suggesting that this is how we should handle things as well, but it's worth knowing about the current behaviour for EPUB. I haven't tested Edge in such details yet, but I imagine that they also have their own take on this.

HadrienGardeur commented 6 years ago

More what I was wondering, though, is whether user agents are going to crawl the resources regardless of any stated completeness to discover whether there are any missing resources. If they have to do this for some, will they do it for all? We don't forbid this anywhere, but we also implicitly accept that if resources aren't listed they won't be put in caches or packages. Should we have both a global no-offlining and per-resource no-offlining instructions so that we don't forgo completeness of the resource list to achieve an unrelated need, and so user agents don't try to put these resources back in the list?

I'm not a fan of having "no-caching" or "no-packaging" directives in the manifest. They can be just as easily ignored and this breaks some of the promises of WP.

You're right that user agents may attempt to crawl resources and various reading modes (plus dedicated services like Pocket) probably have their own heuristics for that.

Crawling is not necessarily a bad thing though, it mostly depends what you do with those resources. If you simply preload them and then respect their caching directives, that's mostly harmless (although the process of crawling could be quite CPU/memory intensive). If you start caching them heavily and ignore HTTP headers, that's a completely different story.

In general, I think that UA should only deploy a proxy with a "network then cache" policy on resources listed in the default reading order or the list of resources. Everything else should at best trigger a GET request in the background and nothing more.

We haven't discussed yet how UAs should handle caching, mostly because we keep getting lost discussing "offlining".

iherman commented 6 years ago

Looking back at the thread, I am a little bit worried that we are complicating things too much. The simple model, whereby it is recommended that the author explicitly lists the resources (that are not in the reading order anyway) as part of the manifest seems to be simple and I do not see any major downside to it. Whether the UA does offlining/caching/whatever with those is not for the author to care about; it is up to the UA. The affordances, whenever appropriate, have a clear scope with those resources. And that is it...

The only easy expansion of this may be what I said earlier:

Alternatively, we may make a strong use of document.images or document.styleSheets and we would not require the author to include JS files and images in a resource list.

Meaning that these are supposed to be used (ie, considered to be part of the resources) by the UA automatically. (Provided that CPU/memory requirement of crawling is acceptable.)

HadrienGardeur commented 6 years ago

-1 to the idea of mentioning document.images or document.styleSheets at all

While we can't stop UAs from gathering resources in the background, we shouldn't push this forward as an acceptable alternative.

GarthConboy commented 6 years ago

I think we need to be wary of specifying RS implementation details.

I'll posit an (only slightly tuned) proposal for closing this issue:

An exhaustive "resource list" is not required to create a Web Publication. Such an an exhaustive list may be needed to make the WP fully offline-able or package-able or to enable provision of other affordances. Providing such a list of required resources beyond the default reading order is RECOMMENDED. However, it is ultimately up to the author whether a WP is fully offline-able or package-able or provides a RS/UA sufficient data to enable all desired affordances.

[Note, this is not really my preferred solution, but I think I have been convinced that changing the RECOMMENDED to REQUIRED is likely just not practical.]

mattgarrish commented 6 years ago

It sounds like you're wanting/expecting a similar sort of experience scenario where navigation is somehow limited within some view/container, and any navigation that happens must match a list of URLs or it opens elsewhere. Is that correct?

How the publication is visually manifested is more a secondary consideration. What I'm concerned with here is the idea of there being a publication state that transcends the resources.

It doesn't matter if that state actually persists in the background of the user agent and is checked against a list of URLs as each new resource is requested, or whether it is unloaded and reloaded with each resource and re-checked. Whatever the case, the state needs to be grounded in a concrete list of resources. Otherwise, what stops me from spoofing your publication simply by putting a manifest link into any malicious document I feel like? Without the bounds, the centre cannot hold.

mattgarrish commented 6 years ago

I'm not a fan of having "no-caching" or "no-packaging" directives in the manifest.

No, didn't mean to imply I'm a fan. And maybe I've missed where the discussion now is. I thought I saw that omission from the resource list was an acceptable practice for not caching/packaging resources, but maybe I was seeing things. If so, though, that would make validation and authoring of web publications much harder.

Similarly, no-cache would not be an effective means of declaring what to package/not package. Caching and packaging are not congruous concepts. I may not want a resource cached between page views, but I still need it in a package just as I need it to view the page properly. If we want a way to exclude resources, we may need to have something more explicit.

I thought we had a use case that would provide a way for authors to indicate that they don't want their publications offline-able or package-able? (Personally, I think offline should always be on the table.) It wouldn't provide any measure security against those actions being taken, of course, but at least conforming user agents would be expected to respect them.

tcole3 commented 6 years ago

Maybe we are approaching consensus? From the perspective of a UA, the relevant defining characteristic of a WP for this discussion is that it may be (typically will be) an aggregation of several HTML documents, media files, etc., rather than just a single HTML document or file. The reading order and when necessary a non-exhaustive list of essential 'non-linear' resources must be enumerated sufficient to establish the boundary of a WP.

On the other hand it sounds like we do not want an 'exhaustive' list of all resources (are we agreed about this yet?). In fact, given the ample number of specs that tell a UA how to deal with (e.g., how to cache, stream, etc.) HTML documents, media files, etc. I would prefer that WP authors be discouraged from enumerating all the CSS, JS, embedded images, codec files etc. needed to cache, package, render or otherwise process a WP - leave this work that UAs already know how to do to the UAs. I don't think any performance advantage is worth the risk of the manifest in the entry point document getting out of sync with the links in the files that comprise the WP (imagine you change the name or location of a CSS or JS file). I suspect that some or most UAs would ignore the inclusion of such items in a WP manifest list anyway and rely on what they find when they open each constituent file of the WP.

However, regardless of our consensus on this point, I do have a couple of concerns left:

Cache-Control: Given that the WP's 'entry' point will typically be an HTML document and will certainly be retrievable via HTTP, how should a UA interpret any HTTP Cache-Control headers? My thought would be that we specify that the UA assume any such Cache-Control headers apply both to the entry point document itself and to the WP as a whole. This avoids the need for us to regenerate and specify a separate Cache-Control mechanism for WPs. Or is this too much of a stretch of the definition of this HTTP header?
Nesting. Consider a journal volume as a WP. Presumably it's reading order is a list of issues. Each issue is then also a WP itself with a reading order which is a list of articles. Do we expect that the UA wanting to cache the volume-level WP will discover by opening each issue WP that it needs to also cache all the articles in each issue? Or should/must the journal-level WP include a listing of all the articles in the volume, e.g., in its reading order? (Which again seems an invitation to inconsistencies between the reading order specified in the volume WP and the reading orders specified in the issue WPs.)

iherman commented 6 years ago

(Admin comment!)

Notwithstanding the (genuine!) issues raised by @tcole3 in https://github.com/w3c/wpub/issues/198#issuecomment-391166490, I have the impression (see also the comment of @GarthConboy https://github.com/w3c/wpub/issues/198#issuecomment-391047369) that we have a consensus on the original question of the issue. The answer being "no", and it also looks like the current text in the draft stands as is. I would therefore propose to close this issue with no further action.

Except that... there is one problem that is pending and needs further discussion, but it is not exactly on the issue as asked here. I would therefore also propose to propose an issue "how should the infoset item 'resource list' be expressed in the WP manifest?". There were several different approaches listed here on whether it is in the manifest, it is fully in HTML, partially here and there... That should be decided and I do not believe we have consensus on that technical problem.

w3c / wpub

Is an exhaustive "resource list" required to create a Web Publication? #198

Current:

Future: