single resource unit - Githubissues

marcoscaceres commented 8 years ago

Req. 6: It should be possible to create and distribute a PWP as a uniquely identified single resource unit.

A PWP, no matter how many pieces, must be distributable to readers or consumers as a single unit for distribution so that users can consume the necessary content that is identified by the PWP.

The use cases don't justify packaging. The first use case even alludes to using a different format (PDF) :/. The interconnectedness of the whole is not determined by the container, but the relationships between the resources that make up a text.

iherman commented 8 years ago

... and it was not intended to refer to packaging (only). So the text can indeed be misunderstood.

Proposed reformulation:

Req. 6: It should be possible to create and distribute a PWP as a uniquely identified logical unit, whether as a single resource or otherwise

sideshowbarker commented 8 years ago

The use cases don't justify packaging.

Yes, the use cases stated here could be met addressed with existing Web technologies and without the need for any new packaging format.

Req. 6: It should be possible to create and distribute a PWP as a uniquely identified single resource unit.

A URL on the Web is clearly a “uniquely identified single resource unit”. So the use cases stated here can be met simply by putting a document on the Web a sharing the URL with whoever needs it. There is no more need for “distribution” separate from that than there is for the billions of other existing multiple-part documents on the Web.

This is another section that seems to be starting from an assumption a new packaging format is needed (and that users are required to use off-Web “reading systems” of some kind instead of just accessing documents over the actual Web), instead of starting just from actual user needs.

iherman commented 8 years ago

@sideshowbarker the document does not start from an assumption that a new packaging format is needed. Can we put this assumption aside, please?

Yes, the Web provides the URL. What is not clear what that URL returns when it is dereferenced.

sideshowbarker commented 8 years ago

@sideshowbarker the document does not start from an assumption that a new packaging format is needed. Can we put this assumption aside, please?

No, we can’t, actually. The entire document is titled Portable Web Publications Use Cases and Requirements and includes dozens of explicit references to “packaged publication”, “packaging”, “packed” and “unpacked”, etc.

That is what I mean by “new packaging format”, and that is what I continue to object to. I do not mean it assumes use of ZIP or anything else. I mean instead that it assumes Web documents need to be “packaged” into some kind of “Portable Web Publication” format that is is different from just publishing them at a URL on the Web, and that Web documents need to be “packed” and “unpacked” in order to meet the user needs.

Yes, the Web provides the URL. What is not clear what that URL returns when it is dereferenced.

I do not know what you mean by that. The user use cases described in this document can all be addressed by a normal Web-runtime user agent serving a Web document to the user in the same way such documents are already served to users (including with existing standard features of the Web runtime such as Service Workers).

There is no need for ambiguity around that. There is specifically no need to assume that a user agent instead needs to serve some kind of new packaged form of a Web publication to a user in order to address the use cases.

marcoscaceres commented 8 years ago

@sideshowbarker is raising the same concerns I had here: https://github.com/w3c/dpub-pwp/issues/21

"The spec is a little hand-wavy about required resources and fonts, etc. but it doesn't prove that current web technologies don't already do everything described."

Trimming use cases already met by the Web might be really good - because it will distill this whole effort down into a couple of work items. So far, all I've identified as missing is:

full text search of resources.
providing a ToC.
Maybe require some special handling to say this is a "book"

I've yet to go past section 5.

BorisAnthony commented 8 years ago

Barring all the expectations of the representatives of the publishing industry (to be distinguished from the "role of a publisher", which could be anyone and who wouldn't expect such things as rigidly defined data structures and content control mechanism), yes! I agree with you totally. :)

sideshowbarker commented 8 years ago

Trimming use cases already met by the Web might be really good

Yeah I think that would be a better way to re-formulate/re-scope this document. Otherwise as-is its scope seems unbounded—if the intention it to exhaustively list all the use cases that people have for reading Web publications and the various contexts in which people want to read Web publications.

because it will distill this whole effort down into a couple of work items. So far, all I've identified as missing is:

full text search of resources

Yeah, I agree that seems like specific that we could build a new feature around for the Web runtime (even if it is just a convenience feature given that we may already have the primitives needed for Web developers to construct full-text search of cached resources at least).

[will post separate comments about the other “missing“ items @marcoscaceres listed]

sideshowbarker commented 8 years ago

providing a ToC

We have talked for years about providing ToC generation as a native part of the platform—including a declarative <toc> element (and various alternative names) but there has never been enough interest from either Web developers in having it nor from browser-engine implementors in providing it—never enough interest for anybody to make it a priority.

Yet we still have countless sites on the Web with multi-document ToCs. So clearly it’s possible for developers to create such ToCs without needing to have a native mechanism for it. And from a user use-case POV, users just want ToCs and do not care how the ToCs are created (e.g., through some native <toc> declarative support vs the way Web ToCs are generated now and always have been.

So what we would be needed here is new information that provides some justification that’s compelling enough to convince browser-engine implementors to prioritize the need for native ToC generation a lot higher than they have in the past.

Incidentally, a few years ago I personally proposed that we create an Element.outline method for programmatically generating an Outline object of any element subtree (so if you called it on the the body element, it would use the HTML outline algorithm to generate a Outline object for the entire Document—which a Web developer could then write code to generate a ToC or whatever from).

But there were no indications of interest from browser-engine implementors for that Element.outline proposal, and not really any from Web developers either.

All that said though, that was of course for the case of generating an Outline/ToC for a single Document rather than a collection of Document objects (e.g., a book). So maybe the case of needing to do it for a collection of Document objects is what’s needed to justify adding a new platform feature. And I think that leads to the question of whether we need a new object in the platform for representing such collections of Document objects.

sideshowbarker commented 8 years ago

Maybe require some special handling to say this is a "book"

So to me that is the main new thing here that I think we might be interesting for browser-engine implementors and Web developers.

Specifically, I mean given that we already Document as a DOM object for representing a Web page/document and programmatically interacting with it, to me what seems interesting is the question of if we can/should add a new object for representing a collection of Document objects and the structure of those Document objects in relation to one another.

Such a “collection of documents” object is certainly something the Web runtime does not yet have.

I don’t know how an (client-side) object of that kind could be exposed/constructed for a collection of documents in the not-already-cached case, but it’s easy to imagine how it could be for the already-cached (with Service Workers) case.

iherman commented 8 years ago

@sideshowbarker : we are converging... what you call 'an object for representing a collection of Document objects' is, essentially, what we are looking for when we are talking about a 'single unit' for a book. And such a think should have a URI to identify it and we should find out what such a URI returns if dereferenced.

sideshowbarker commented 8 years ago

@sideshowbarker : we are converging... what you call 'an object for representing a collection of Document objects' is, essentially, what we are looking for when we are talking about a 'single unit' for a book

If that is the case then I think the editors might considering adding an explicit requirement that states something about “The platform should have an object/API exposed in JavaScript for representing a collection of documents as a single unit and interacting with it as such.“

And such a think should have a URI to identify it and we should find out what such a URI returns if dereferenced.

I think that we could have requirement for a programatic DocumentCollection (or whatever) object without needing to have any kind of new requirement for URI to represent it. The two needs are orthogonal. The DocumentCollection object is something that could be exposed to JavaScript code without the need of a URI to represent it. And a URI for a book could be provided without the need for it to deference into anything more special than a ToC (or front matter + ToC or whatever) as a single document.

From the assets embedded in the document at that URI, the underlying Web application could just dynamically assemble the collection of documents into a single unit (for use offline or whatever)—even without the need for a DocumentCollection object to directly expose it as such.

I think that’s also the reason why we don’t necessarily need any new markup (a <package> element or whatever) to declaratively represent collections of documents, nor necessarily any kind of manifest. At least not for the case where we assume the collection of documents has already been fetched and cached.

So there still seems to be disagreement about whether we need some explicit standard packaging mechanism in order to make a collection of Web documents available as a single unit. I think we do not. I think using Service Workers and other existing stand features of the platform we can have a user agent automatically do even all the (pre)fetching and caching necessary to make a document collection available as such—without anything new beyond that all being necessarily required.

iherman commented 8 years ago

So there still seems to be disagreement about whether we need some explicit standard packaging mechanism in order to make a collection of Web documents available as a single unit. I think we do not. I think using Service Workers and other existing stand features of the platform we can have a user agent automatically do even all the (pre)fetching and caching necessary to make a document collection available as such—without anything new beyond that all being necessarily required.

I do not want to use the word packaging. It is too loaded. But yes, maybe there is a disagreement: I/we believe that the concept of a document collection is not only a JavaScript API level concept, but authors/publishers should be able to 'declare' this information by somehow enumerating what is and what is not part of the collection, and that 'information' should have a unique ID, distinct from the constituent documents' URI-s. Creating this collection (which may be as simple as adding it to a standard Web Manifest and use the manifest's URI as an identification) is what we meant by a number of use cases.

lrosenthol commented 8 years ago

@sideshowbarker PWP ABSOLUTELY DOES presuppose that there is a packaged (potentially self-contained) set of resources that can be distributed in an 'ad-hoc' manner (external to the web, such as file sharing services or even via USB key). This is a key requirement for PWP.

dauwhe commented 8 years ago

@marcoscaceres wrote:

Trimming use cases already met by the Web might be really good - because it will distill this whole effort down into a couple of work items. So far, all I've identified as missing is:

full text search of resources. providing a ToC. Maybe require some special handling to say this is a "book"

My list is slightly longer, since I'm spelling out the "special handling," and making no judgments on whether the open web platform user agent runtime (OWPUAR) can meet the requirements:)

I think "collection of documents" is one of the crucial points in the entire enterprise. So as a content creator, I want to:

Define the sequence of documents in the collection
Assign metadata about the collection of documents viewed as a whole, in addition to metadata about the documents themselves.

As a reader, I want to:

Access all the content in a linear fashion through a simple user interface.
Stay oriented within the collection. How much have I read? how much more? What section am I in?
Stop reading, and return to that same point when I resume reading.
Read whether or not I have an internet connection.
Have permanent, irrevocable access to the collection (if not borrowing or renting).
Easily change the appearance of the documents to suit my needs and preferences.
Search within the collection.
Annotate the collection, or documents within the collection.

Perhaps all we need is a manifest with scope and metadata, rel=prev/next/contents, a service worker, and a UI mode optimized for long-form reading.

lrosenthol commented 8 years ago

@dauwhe don't forget that a reader may also wish to create their own collections, based on content from other individual documents and/or collections.

also, your #6 item there isn't specific to collections - that would be general for any document (I assume). (and would also depend on whether the author/publisher allows it)

marcoscaceres commented 8 years ago

Agreeing with @lrosenthol in that the things @dauwhe lists would be generally applicable to any web page or collection thereof. For instance, substitute "collection" for "news website" or whatever.

I'm also interested in the "Assign metadata about the collection of documents viewed as a whole, in addition to metadata about the documents themselves."

What metadata and for the benefit of who? Be great to tease that out in the document.

sideshowbarker commented 8 years ago

I do not want to use the word packaging. It is too loaded.

I think you’re right. So I propose replacing it throughout with something that’s not loaded, like “a document collection” or “a collection of documents”—preferably where document is a cross-reference to the definition of the term in DOM standard.

But yes, maybe there is a disagreement: I/we believe that the concept of a document collection is not only a JavaScript API level concept, but authors/publishers should be able to 'declare' this information by somehow enumerating what is and what is not part of the collection, and that 'information' should have a unique ID, distinct from the constituent documents' URI-s. Creating this collection (which may be as simple as adding it to a standard Web Manifest and use the manifest's URI as an identification) is what we meant by a number of use cases.

I think all of what is described in the paragraph above is still covered by the term “a collection of documents”, without any need to refer to it instead as a “package” or a “packaged publication“.

In other words, any desire for “packaging” as such is orthogonal to the requirement for UAs to handle a collection of documents as a single unit; a UA can handle a collection of documents as a single unit without the documents needing to be explicitly packaged together in some form.

iherman commented 8 years ago

There is clearly need for a cleanup of the terminology. This is all the more needed because we are at the intersection of two communities which do use the term in a different way. To keep to the example of "package": the EPUB3 document uses this term for what (I believe) you and I would call "manifest", ie, a bunch of information (metadata like author and unique ID, content of a book, etc).

Ie, +1 to a cleaned up terminology in the document, hopefully we can start doing so at TPAC

Cc: @TzviyaSiegman @hlflanagan @GarthConboy

lrosenthol commented 8 years ago

@sideshowbarker I agree that we need to find some terms that work. Let me clarify our usage.

For PWP:

A package is about a file format that can be distributed in an ad-hoc manner (eg. EPUB, ZIP, PDF, etc.).
A collection is higher level construct (not sure what it looks like technically yet) that enables the collation of one or more PWP (either packaged or not) into a logical grouping.

Defining document is definitely harder and not necessary one we probably all agree on. @iherman defiitely sounds like work for TPAC...

TzviyaSiegman commented 8 years ago

TBD if PWP is a file format or a concept fulfilled by existing tools (like service workers). Let's not make announcements about implementations before we discuss them.

GarthConboy commented 8 years ago

PWP == Portable Web Publication. If the name was just WP, we'd be having different conversations, likely solely around CSS rendering fidelity and perhaps pagination.

To me the leading "P" implies that the WP can somehow leave the confines of the Web, which leads to a manifestation as a single unit , and packaging. The exact method of such TBD.

marcoscaceres commented 8 years ago

To me the leading "P" implies that the WP can somehow leave the confines of the Web, which leads to a manifestation as a single unit , and packaging. The exact method of such TBD.

But this also applies equally to any web application. Like if I install a "progressive web app" (PWA), the portability requirement is exactly the same:

if I have a tablet and a phone, I want my PWAs to be installed and synch'ed on both.
even when not a PWA, I want my bookmarks, or even tabs, to be synch'ed across my devices (which browsers already do).

Thus, there is nothing special about "Web Publications" when compared to other classes of web applications when it comes to portability as a requirement.

The "P" can be safely dropped.

lrosenthol commented 8 years ago

@marcoscaceres As mentioned in numerous other threads here - you can't email or put on a USB key (for example) a "PWA". Therefore, it is NOT exactly the same.

marcoscaceres commented 8 years ago

@marcoscaceres As mentioned in numerous other threads here - you can't email or put on a USB key (for example) a "PWA". Therefore, it is NOT exactly the same.

It's still contested if that should be a requirement or not. USB keys could go the way of the floppy disk, CD-ROM, DVD, etc. in the next few years. Historically, the USB key requirement (i.e., transferable as a single unit in some kind of package) seems weak at best.

GarthConboy commented 8 years ago

I do think one needs to consider that 100% of current EPUB usage is of the packaged model. Publishers need to "snapshot" a book/publication and deliver a packaged instantiation to retail channels for ingest, and numerous "Reading Systems" deliver this packaged format to users for rendering.

While transition to universality of "it's just a URL on the Web" could well happen, an immediate transition won't. So, to follow what Tzviya proposed on another thread, it seems logical to have this effort be "P,WP" concurrently -- the WP portion the actual content eventually expected to be rendered directly by browsers, and the P portion being the packaging format to be used for single unit interchange, but perhaps not natively supported by browsers.

marcoscaceres commented 8 years ago

the WP portion the actual content eventually expected to be rendered directly by browsers, and the P portion being the packaging format to be used for single unit interchange, but perhaps not natively supported by browsers.

Ok, but this begs the question: what's wrong with EPUB and the existing formats? I bought an e-book today, and it came in a Zip file that contained: .epub, .pdf, and a .mobi!... do we really need to create another packaging format? Are we not just doing https://xkcd.com/927/

screenshot 2016-09-19 18 13 55

GarthConboy commented 8 years ago

@marcoscaceres agree, we may well not need a 15th!

baldurbjarnason commented 8 years ago

Ok, but this begs the question: what's wrong with EPUB and the existing formats? I bought an e-book today, and it came in a Zip file that contained: .epub, .pdf, and a .mobi!... do we really need to create another packaging format?

Three core issues with ePub as currently implemented:

JavaScript is largely crippled in ePub, in many cases in the name of security (that whole "the web's security model doesn't cover portable documents" thing), but also because the behaviour of many JS APIs is completely unspecified in a paginated context.
Epub reading apps are incredibly buggy with frequent regressions between versions. The safely usable subset of the web stack in ePub is incredibly small. They also have no developer features to speak of and no developer documentation, which means ebook producers are forced to fumble around in the dark.
Most consumers never experience ePub, only ePub-derivatives, i.e. proprietary DRM-wrapped ePubs or custom formats converted from ePub (Amazon's formats and, up until recently, Kobo's). These derived formats can deviate even further from standard web behaviours.

Standardising a packaged PWP and getting browsers to implement them would solve the first two problems for publishers and they would love it if you solved it for them for free.

The ultimate cause of these problems is that the ecosystem chose very expensive solutions (a new packaged format, partially incompatible with the web, with a bunch of new features, requiring a new security model, and unspecified rendering) but has very little money and resources. If ebooks had taken over the world and had ended up with a 50% market share, flushing the ebook ecosystem with cash, none of this would be an issue and we wouldn't be having this conversation.

marcoscaceres commented 8 years ago

Standardising a packaged PWP and getting browsers to implement them would solve the first two problems

It doesn't follow that those problems would be solved if they have not already been solved by vendors of ebook readers. That is, there is no reason to believe browser vendors would do any better - or wouldn't lose interest in maintaining ebook software.

From a browser vendor's perspective: we are not interested in solving those problems for a particular class of publication (specially not for zip packaged things) - but the publishing industry is free to leverage the fantastic cross-browser interoperability, rich developer tools, extensive set of APIs, and web's security model that browsers already provide (so, in this sense, yes - browsers solve problems 1 and 2 - and we can negotiate on additional requirements, so long as they benefit the end-user's of the Web to the benefit of all).

for publishers and they would love it if you solved it for them for free.

We can solve those problems (already have and continue to!), but it comes with the tradeoff of ditching hopes of getting the packaging thing supported in browsers. That's not something Mozilla would be interested in supporting, ever.

However, others may continue to attempt to support the packaging thing... and suffer the downsides that @baldurbjarnason so kindly highlighted to the detriment of end-users.

We, who don't go down the packaging route, will be over here making awesome web-app based publications, with awesome dev tools, and funky new features coming down the pipe every 6 weeks. We won't have to pay any Big Corporation a pile of cash for crap developer/editing tools either, and we will have full control over the publication flow 🎶

mac2net commented 8 years ago

But Mozilla does now and will continue to support that pesky old File URL in the vacuum.

A web archive is not really an alternative as it is just a dump rather than something like a 5DOC which is a site owner defined abstraction of content. In one of the more complex samples, a 5DOC is 4.6 mb decompressed, 2.3 mb compressed (what we can't seem to convince you to offer 🤔) while a webarchive is 15.5 mb decompressed and 9.8mb compressed. And the web archive fails to reproduce the functionality of the page.

Also, the tools I use to make 5DOCS were developed from free open source WordPress code (the REAL WP👌) as well as Javascript libraries hosted right here on GitHub, but y'all can pretend I'm a big corporation and pay me tons of cash if you want 👍🏻.

It would be 5DOC's intention to continue using free wizbang tools from Mozilla, etc. Offline has the potential to at least double the scope of HTML.

lrosenthol commented 8 years ago

Updated the use cases in this section (in this branch) to better reflect reasons why this is needed for PWP.

TzviyaSiegman commented 8 years ago

New version of document separates Web Publication and Packaging into separate sections. The concept of single unit for WP is addressed in http://w3c.github.io/dpub-pwp-ucr/index.html#single-package.

w3c / dpub-pwp-ucr

single resource unit #99