w3c / pwpub

W3C packaged Web Publications
https://w3c.github.io/pwpub/
Other
15 stars 9 forks source link

Should the mimetype file be kept in the package? #32

Closed llemeurfr closed 5 years ago

llemeurfr commented 5 years ago

The current proposal (PR #30) removes the mimetype file (a.k.a signature file) from the package. In https://github.com/w3c/pwpub/pull/30#issuecomment-452088783 it is suggested to keep it.

As a reminder, this file comes from the ODF specification (OCF is heavily inspired by ODF) and is used for providing so called magic numbers to operating systems (as an alternative to file extensions). A web search didn't bring mention of a use of the OCF magic number by any OS. On the other side, many current EPUB reading systems stop reading an EPUB file if the mimetype file is absent in the package.

Let's discuss the pros of having this mimetype file in the new package for package processors (B2B processing nodes and reading systems).

iherman commented 5 years ago

many current EPUB reading systems stop reading an EPUB file if the mimetype file is absent in the package.

What I would like to understand is whether they do it because that is what the EPUB spec requires, and they simply make due diligence, or whether they find a real benefit in doing so...

mattgarrish commented 5 years ago

whether they do it because that is what the EPUB spec requires

The epub spec doesn't get into error recovery details like this. If the mimetype isn't found where it's supposed to be, the reading system can still attempt to load the file.

I would imagine that those reading systems that stop processing do so out of some combination of an abundance of caution and a desire not to waste time processing a file that won't turn out to be an EPUB.

Without a magic number or the entry page/manifest link, what tells the processor that the json file actually is a packaged (publication|audiobook|xyz), though? I don't believe we've defined anything terribly unique in the manifest, the closest thing perhaps being the context but is that immutable? If it relies on finding a piece of schema.org metadata, I would think that's going to be unreliable.

iherman commented 5 years ago

@mattgarrish

Without a magic number or the entry page/manifest link, what tells the processor that the json file actually is a packaged (publication|audiobook|xyz), though?

True enough. But the current OCF Lite draft does include a requirement of a separate JSON-LD file or an HTML entry page, both with a well specified name. Isn't this enough as a starting point?

HadrienGardeur commented 5 years ago

True enough. But the current OCF Lite draft does include a requirement of a separate JSON-LD file or an HTML entry page, both with a well specified name. Isn't this enough as a starting point?

I don't think that having index.html in a ZIP is unique enough to assert that you have a packaged publication.

llemeurfr commented 5 years ago

? I don't believe we've defined anything terribly unique in the manifest, the closest thing perhaps being the context but is that immutable?

Most desktop environment depends on a file extension (mapped internally to a mime-type) to infer the default app which will open a file. And the mime-type is the clue on the Web.

I'm not aware of a great success for type detection via magic numbers at the beginning of files. I used image magic numbers long time ago in developments based on satellite streams, not files.

@mattgarrish, the OCF mimetype file can be used as a file in the zip (as a reading system I open the zip, open the mimetype file and check its content) or as a magic number (as a reading system I read the X first bytes of the file and check their value). I'm pretty certain that reading systems use the first solution, and therefore to have this package type identifier in the first uncompressed file of the package in not useful.

So, if we decide we need a better package type identifier than a file extension and a mime-type, we can easily crate one in the manifest (but the JSON-LD context may be ok as you said).

So we should decide in order:

iherman commented 5 years ago

if no, is the JSON-LD context in the manifest sufficient?

Looking at the general case (ie, not only audio books) I can foresee situations where having a separate JSON-LD manifest might become a drag. E.g., we can think about a scholarly article, which typically consists of a single HTML page (plus a number of additional resources), which means that using the embedded manifest becomes the natural approach. (Packaging the scholarly article for offline usage makes a lot of sense, replacing the PDF dumps that we typically have today.)

mattgarrish commented 5 years ago

if we decide we need a better package type identifier than a file extension and a mime-type, we can easily crate one in the manifest

This is my question, yes. A packaged epub can be identified by its media type and extension, and yet some percentage of reading systems still look for the assurance of the mimetype.

A filename in a zip container isn't terribly reliable, so I'm asking if there should be something more concrete in the manifest that establishes that the json represents a publication for those who will still want something unique to look for.

If the entry page is the uniquely-identified file, and contains a link when the manifest is external, then you have an explicitly authored statement of what is in the container. I'm not advocating that's what should be done, as I already went over by email, but if you go the option of json without html, I'm just not sure what serves in its place. Maybe nothing, of course, but it seems like some more assurance is wanted otherwise why would anyone stop processing an EPUB?

iherman commented 5 years ago

How is the mime type file used? I mean

Because if the goal is to have something for the former, then the presence (or not) of a manifest.json file (or, depending on the outcome of #16, the presence of an index.html or manifest.json) can play the same role...

mattgarrish commented 5 years ago

The location and content of the mimetype file are its defining features. You don't have to know anything about the file or process it -- you can just look at a specific byte offset to discover whether you have an EPUB.

There's no specific reason why a zip file with a file called index.html in it, or even manifest.json, is a web publication. It's highly unreliable unless you dig deeper for something more unique.

iherman commented 5 years ago

@mattgarrish o.k., but what I would like to understand is what RS-s use this for?

If they are willing to go a little bit into the content, then there are possibilities to determine if it is a bona fide packaged WPUB:

In a way, these are part of the checks the RS would have to do no matter what, and it may just be enough.

Our goal should be to simplify the life of authors/publishers: adding that separate mimetype file is obviously a (of course minor) drag...

mattgarrish commented 5 years ago

In a way, these are part of the checks the RS would have to do no matter what, and it may just be enough.

I haven't argued for the mimetype, though, only that we're somewhat lacking in a clear identifier of what this particular zip package is in the absence of it and a manifest link. I don't like the reliance on file names, especially such generic ones.

But, in any case, what we lose is the quick identification/rejection of content. That's the trade-off of not having something like the mimetype. It's not a critical piece of information in EPUB, as you can also check that you have an EPUB via the reference to a package document in the container.xml file. It just forces everyone (the wider ecosystem, not just reading systems) to do all this parsing, too.

llemeurfr commented 5 years ago

In any case, it seems that the Readium SDK(s) (the original and the new one) don't check the magic number, or even the presence of the mime-type file in the zip.

Note also that OOXML, also a zip container, doesn't seem to spec any magic number.

GarthConboy commented 5 years ago

I lean toward keeping the mimetype file (with new content). But can live with either way.

Does seem as though not having to dig through the manifest to see whAT YOU HAVE would be a feature, and it would give audiobookcheck something to initially verify. :-)

llemeurfr commented 5 years ago

@GarthConboy there are two decisions to take: mime-type file y/n; and if yes positioned as the first content in the zip (= magic number) or undefined y/n. What is your take on the second part?

GarthConboy commented 5 years ago

I tend to think the case for the first "yes" is diminished if there second isn't also "yes".

geoffjukes commented 5 years ago

I found out that the Blackstone ePub team were not even aware of the 'magic number' mimetype requirement. Unless there is a specific use-case, I don't see why we would need a separate file outside of a well-defined manifest.

GarthConboy commented 5 years ago

FWIW, epubcheck clearly checks for this, so if you're getting a clean readout, you're doing it "correctly."

wareid commented 5 years ago

The confusion might be in referring to the mimetype as "magic numbers" ;).

I asked about how we use the mimetype and it's just for validation, mimetype is present? It's an epub, open the renderer, but we don't need it. I think a combination of "type" and the contents of the manifest should be clear enough, if the UA recognizes the content, why shouldn't it display it?

GarthConboy commented 5 years ago

It takes much more processing to find the manifest and look at its contents, than to just look at bytes 30 to 61 and see "mimetypeapplication/audiopub+zip" (or some such) and know you have an audiobook, as works similarly for EPUB:

  0: PK 0x03 0x04, 30: mimetype, 38: application/audiopub+zip

Again, not going to fall on my sword on this one, but there are some benefits.

iherman commented 5 years ago

My problem with "positioned as the first content in the zip (= magic number)" is that if WPUB, and the packaged version thereof, becomes more widespread, creation of a LWP out of a WPUB should be very easy for a lambda user. If I have the document on my machine, I would 'just' want to say "zip" on the sub-directory and have it (it is a built in command on a mac, for example, right-click on a folder logo and say "archive". With a separate position (possibly uncompressed as is in EPUB now) this becomes impossible. Essentially: creation of an LWP is should not only be easily doable to big publishers, but by everyone.

Case in point: I took a flight yesterday, and there was a spec draft I wanted to read on the plane. I would have liked to create an LWP with a click of that spec (provided it has the manifest, but suppose it has, like all the WPUB specs have it now, generated automatically).

GarthConboy commented 5 years ago

Like I said, I don't feel too strongly, and your (Ivan's) statement is the valid counterpoint.

dauwhe commented 5 years ago

creation of a LWP out of a WPUB should be very easy for a lambda user

What's a lambda user?

Are we now using LWP instead of PWP? Or is LWP a special case of PWP? I see lots of possible confusion here.

iherman commented 5 years ago

LWP: lightweight package, ie, the one we are talking about... (we need a good abbreviation...)

lambda user: none of us;-). and none of the established publishers.


Ivan Herman +31 641044153

(Written on my mobile. Excuses for brevity and frequent misspellings...)

On 6 Feb 2019, at 12:18, Dave Cramer notifications@github.com wrote:

creation of a LWP out of a WPUB should be very easy for a lambda user.

Forgive my ignorance, but what's an LWP and what's a lambda user?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

mattgarrish commented 5 years ago

This is going a bit cross-topic, but has any consideration been given to a crossover between the mimetype and container.xml files as a way of working around the one index.html, one manifest.jsonld approach?

In other words, to get out of bind we're getting into of enforced resource naming of files and limiting each directory to only one web publication, what if there was a specially-located and named file that instead identified the resource that contains the manifest?

So by way of example, the package requires a file called manifest at the root that contains the relative path to the manifest.

This file would only have to live in the package, and would be meaningless outside it. It then wouldn't matter if you exploded web pubs on top of each other, as overwriting the file (if it's extracted at all) would make no difference to the consumption of the publication. A user agent could simply regenerate the file whenever a packaged form is needed.

The only time the file might come into play is you wanted to package publication(s) manually, and stored them all together in one directory, in which case you'd just have to manage updating and zipping the file with the right contents manually.

iherman commented 5 years ago

This issue was discussed in a meeting.