w3c / epub-specs

Shared workspace for EPUB 3 specifications.
Other
304 stars 60 forks source link

consider specifying how EPUB interacts with the MIME sniffing standard #2491

Open rdeltour opened 1 year ago

rdeltour commented 1 year ago

The MIME Sniffing standard is quite central to how HTML defines the loading of resources in HTML.

Specifically, in the "Determining the type of a resource" section, HTML says that the Content-Type metadata and computed MIME type of a resource must be obtained in a manner consistent with the requirements of MIME Sniffing.

In turn, MIME Sniffing says that to handle a resource, a user agent must keep track of (among other things) a supplied MIME type, determined by the supplied MIME type detection algorithm. That algorithm looks at various cases to detect the supplied MIME type: if the resource is retrieved via HTTP, or from the file system, or via another protocol.

EPUB sits a bit in-between all this, since it does not specify how resources are loaded from the OCF container. Some RS will serve them over HTTP, some as files, some possibly with another protocol.

Are the MIME types defined in the package document meant to be the authoritative source of how a reading system MUST detect a resource’s type? or is it only informative content used for type support processing? (via the fallback mechanism).

EPUB could say something along these lines, in the RS spec:

Reading systems MUST (SHOULD?) ensure that resources retrieved form an OCF ZIP container have a supplied MIME type [MIMESNIFF] equal to the type of the resource defined in the corresponding package document item or link element, if any is found.

maybe somewhere in the OCF ZIP container section. Or in its own "Determining the type of a resource" section, à la HTML, with a similar language.

This is testable when scripting is available, by using the fetch() API to load an in-container resource of a custom unknown-to-the-RS MIME type (e.g. image/vnd.epub+test), and verify that the Content-Type header of the fetch Response object is the one declared in the package document.

iherman commented 1 year ago

[My obnoxious admin hat put on] If we do that, this means we introduce a new normative statement to the CR. This means, if I am not mistaken, that a new CR snapshot should be issued, and this would also trigger a round of minimally 28 days of comment deadline (see process doc) which means, in practice, that we cannot move to PR within that 28 days' time limit.

This is not a problem per se and, actually, I would think that issuing a CR snapshot with all the changes we have done on the CR would be a proper thing to do, but that means we most probably would have to ask for a charter extension. (Our charter runs out end of February.)

Just saying...

iherman commented 1 year ago

[My obnoxious admin hat put down] I must admit I was not familiar with the Mime Sniff Whatwg document.

Are the MIME types defined in the package document meant to be the authoritative source of how a reading system MUST detect a resource’s type? or is it only informative content used for type support processing? (via the fallback mechanism).

I am not familiar with history, so @mattgarrish @bduga should know better, but I suspect that the goal was more on the informative side. But adding a normative of the sort you propose may improve the spec indeed...

mattgarrish commented 1 year ago

I am not familiar with history, so @mattgarrish @bduga should know better, but I suspect that the goal was more on the informative side.

I can't say I recall ever discussing what a reading system has to do with the media types in the manifest, but my memory only goes back to 2011. I'd always assumed they were informative, since they're easily faked. They're meant, as @rdeltour says, for things like checking that CMTs are used, and fallbacks provided when not.

It's been more of a security consideration that the resources in the package may not be what the manifest says they are, but defining how to ensure that hasn't been attempted. If we can leverage the WHATWG spec defines then it doesn't seem like a bad thing to add, but maybe only as a recommendation since we're late to the game on this.

bduga commented 1 year ago

I am a little worried about adding any conformance statements here. I expect quite a few existing EPUBs will be considered broken, since these often do not match. Sometimes it is even unavoidable, for instance where the MIME type has drifted over the years (eg fonts). I am not sure what is gained by checking these. I guess fallback chain handling would be a little more reliable, but since that is fairly under-implemented as it is I don't see that as a compelling argument. My initial feeling is to leave this as-is, are there any strong reasons for wanting this change?

rdeltour commented 1 year ago

are there any strong reasons for wanting this change?

One reason would be better interoperability between reading systems.

Another is that there seems to be security considerations involved. HTML has this warning:

It is imperative that the rules in MIME Sniffing be followed exactly. When a user agent uses different heuristics for content type detection than the server expects, security problems can occur. For more details, see MIME Sniffing. [MIMESNIFF]

But this is beyond my area of expertise…

rdeltour commented 1 year ago

To be clear, I'm not aware of any concrete problem that fixing this would solve.
I went down that spec rabbit hole when trying to answer the question "what is the actual MIME type of a resource?", and noticed EPUB isn't very specific here. Basically, the RS is free to decide what MIME type it retrieves resources with, as it is both the "server" and the "UA".

bduga commented 1 year ago

I am not sure this would really help with interop. I would at least want a concrete example of an interop problem before making a change here. And even if we wanted to require (or suggest) that reading systems make sure the MIME type given in the package doc actually matches the MIME type of the resource, how do we do that? Won't they have to apply the MIME sniffing algorithm? Which in turn means having a proper supplied type.

On Tue, Nov 29, 2022 at 12:04 PM Romain Deltour @.***> wrote:

To be clear, I'm not aware of any concrete problem that fixing this would solve. I went down that spec rabbit hole when trying to answer the question "what is the actual MIME type of a resource?", and noticed EPUB isn't very specific here. Basically, the RS is free to decide what MIME type it retrieves resources with, as it is both the "server" and the "UA".

— Reply to this email directly, view it on GitHub https://github.com/w3c/epub-specs/issues/2491#issuecomment-1331230666, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA246ZFAATXN3M6TFN3DSG3WKZOTJANCNFSM6AAAAAASOSHJ74 . You are receiving this because you were mentioned.Message ID: @.***>

iherman commented 1 year ago

What about adding more or less the same text as the one quoted for HTML to the security consideration section of the RS? That would not be a conformance issue for existing RS-s, but would draw attention on possible issues.

mattgarrish commented 1 year ago

What about adding more or less the same text as the one quoted for HTML to the security consideration section of the RS

We switched everything in that section to normative recommendations, so would we do the same here?

It seems to fall under the malicious content category, so we could just add it as an example if we want to avoid normative statements:

Content processors — defined as entities that handle the ingestion of EPUB content for distribution, display, or sale — also need to be aware of the potential risks in ingestion. It is advised that content processors check publications for malicious content on ingestion, in addition to the validation steps that usually occur. This could include running virus scans, validating external links and remote resources, verifying the media types of resources listed in the manifest (e.g., using the techniques defined in [[?MIMESNIFF]]), and other precautions.

iherman commented 1 year ago

We switched everything in that section to normative recommendations, so would we do the same here?

yes, but all statements are SHOULD-s. No MUST-s.

That being said, I am perfectly happy with the extension of the text you propose, too.

rdeltour commented 1 year ago

FWIW, I ran a little experiment with the attached EPUB (mimetypes.zip).

It contains:

The single content document has the following markup:

<img id="img" src="jpeg.foreign" alt=""/>
<picture>
  <source srcset="jpeg.foreign" type="image/vnd.epubtest"/>
  <img src="png.png" alt=""/>
</picture>

I tested in both Apple Books and Thorium.

Not so surprisingly, the img element shows the JPEG image: the UA applies the rules for sniffing images specifically, likely with an unknown supplied MIME type (that's the part that is underspecified and RS-dependent), and execute byte pattern sniffing to correctly identify the content is JPEG. The picture element shows the PNG image: the UA processes the type attribute of the first source child and does not recognize the type, then falls back to the img sibling.

I'm wondering what happens when the computed MIME type is not dependent on byte pattern sniffing, but is defined as the supplied MIME type? That's the case in a script context. Although I'm not sure if/how the Fetch API works in an interoperable manner in EPUB…

Also, note that the RS spec says:

A reading system that does not support the MIME media type [rfc2046] of a given publication resource MUST traverse the manifest fallback chain [epub-33] until it identifies a supported publication resource to use in place of the unsupported resource.

Fortunately (?) this is a bit vague, in that "the MIME media type of a given publication resource" is not defined. I always thought that was the type declared in the package document. If that were true, rendering the fakely-foreign JPEG in the first img element of my example would be a violation of the spec.

But in practice I think "the MIME media type of a publication resource" is more the computed MIME type of the resource (which depends on the how the RS serves the resource and defines the supplied MIME type of the resource). If the RS detects the resource as JPEG and renders it, then no problem. That's what happens in the two RS I tested.

This is subtle, but I think at least adding a note explaining that (and how it is RS-dependent) would be helpful.

[edited for clarity / typos]

rdeltour commented 1 year ago

@bduga

And even if we wanted to require (or suggest) that reading systems make sure the MIME type given in the package doc actually matches the MIME type of the resource, how do we do that?

The idea is not so much to require RS to compute a MIME type equal to the one in the package doc, but rather to better define or describe what an RS does to determine the computed MIME type of a resource.

Most of it is defined in MIME Sniffing. The RS-dependent part is related to what is the supplied MIME type of the resource, and consequently specifically depends how the RS serves the resource:

All that to say that a lot depends on how the RS actually handles resources internally.

I'm not saying that we should enforce one particular way. I'm fine with it being RS-dependent as long as we do not identify a concrete interoperability issue.

But:

Won't they have to apply the MIME sniffing algorithm? Which in turn means having a proper supplied type.

Yes, they're required (by HTML) to apply the sniffing algorithms, and yes they need a proper supplied type. That was precisely my point: currently, that supplied type is not well defined in EPUB.

mattgarrish commented 1 year ago

Looking at this again, I was thinking the original prose was requiring reading systems to verify the media type, but if all this is asking is that reading systems set the supplied media type for resources, could we say:

Reading systems SHOULD set the MIME media type [[rfc2046]] for resources retrieved from the OCF container (i.e., their supplied media types [[?mimesniff]]) to the MIME media type defined in their corresponding package document item or link elements [[epub-33]].

NOTE This specification does not require how reading systems set media types for resources. Refer to the supplied MIME type detection algorithm [[mimesniff]] for more information about the methods used by web user agents to sniff media types.

Does that still capture the essence of what you're after, @rdeltour, without being too specific about how it's done?

bduga commented 1 year ago

I am fine with making this explicit, but I don't think the history of this property, both the intended use and the actual use, merits recommending its use in the sniffing algorithm. If anything, I would argue for the exact opposite - specify that Reading Systems SHOULD NOT use the mime type as the supplied media type for mimesniff. And maybe add some text around what we intend this to be used for (fallbacks? Anything else?).

rdeltour commented 1 year ago

Looking at this again, I was thinking the original prose was requiring reading systems to verify the media type, but if all this is asking is that reading systems set the supplied media type for resources (…) Does that still capture the essence of what you're after, @rdeltour, without being too specific about how it's done?

Let me try to clarify: what I am after is understanding and hopefully specifying how the MIME Sniffing standard works in EPUB, which implies how the supplied MIME type detection algorithm is applied in an EPUB context.

The key is that algorithm is basically a switch statement based on the protocol used to retrieve the resource. In EPUB, that protocol is implementation specific.

So in your proposal @mattgarrish:

Reading systems SHOULD set the MIME media type [[rfc2046]] for resources retrieved from the OCF container (i.e., their supplied media types [[?mimesniff]]) to the MIME media type defined in their corresponding package document item or link elements [[epub-33]].

I don't think we can even say that a reading system can set a MIME type , nor that a resource MIME type is its supplied media type. That would be a violation of the MIME Sniffing standard, which is normatively referenced by HTML. Conformance to MIME Sniffing means the MIME Sniffing algorithms are applied, which means the computed MIME type is determined from a combination of byte sniffing and the protocol-dependent switch statement mentioned above.

As far as I can tell (but I'm not sure), browser engines conforms to HTML/Fetch/MIME Sniffing in how they handle resource loading, in an interoperable manner.

In an EPUB context, one could argue OCF-processing is defining another protocol (point 4 in the supplied MIME type detection algorithm), which would allow us to say EPUB defines the supplied MIME type as the one authored in the package document. But in hindsight, I think that would be a stretch, and probably do not match what current RS are doing, like @bduga suggests.

So all in all, to summarize, what I am after is:

rdeltour commented 1 year ago

Here's a proposal:

Create a new section in 3. Publication resource processing, before all other sections, which would say something like:

3.1 Determining the type of a resource

When processing EPUB resources, Reading systems MUST determine the MIME type of resources in a manner consistent with the requirements of MIME Sniffing [MIMESNIFF].

Note: The MIME Sniffing standard [MIMESNIFF] specifies how web user agents compute the MIME type of resources based on a combination of content-sniffing algorithms and protocol-defined MIME type metadata. As EPUB does not specify the protocol with which resources are fetched from the OCF container, the supplied MIME type of a container resource, used in the MIME type computation algorithms, is implementation specific.

rdeltour commented 1 year ago

(the note above could also add, informatively:

Reading systems may use the MIME types provided by the author in the package document item elements as the supplied MIME type of the resource when applying the [MIMESNIFF] algorithms.

although I'm not confident it matches the reality of current implementations, nor that it is particularly helpful to implementing or understanding the specification).

mattgarrish commented 1 year ago

MUST ... in a manner consistent with ...

Is there a way to make this more precise? "such that the result is consistent with"?

There are (deliberately obtuse) ways of reading "consistent with" that can make the requirement opaque (e.g., if I follow my own algorithm that leads to a different result, is that "consistent" since they both require following algorithms?).

bduga commented 1 year ago

Do we really need to add anything, though? Presumably any resources loaded by the UA of the reading system will implement mimesniff, since it is required via html. Though, I suppose knowing that you have an html document in the first place requires knowing the mime type. Ugh. This is a bit of rats nest - do reading systems need to run the mimesniff algorithm at ingestion time, to make sure the epub is valid? So, for instance, if a spine item claims to be html, but applying mimesniff calculates the type as json, should the reading system never even try to display the spine item?

mattgarrish commented 1 year ago

Do we really need to add anything, though?

Given where we are in the revision, I'm fine if we want to defer the issue. Last minute additions have a way of needing to be fixed later. But I'll leave it to the chairs to decide.

iherman commented 1 year ago

(My administrative comment...)

Do we really need to add anything, though?

Given where we are in the revision, I'm fine if we want to defer the issue. Last minute additions have a way of needing to be fixed later. But I'll leave it to the chairs to decide.

That would be my option. We are at the point when we want to seriously look at the implementation reports and, hopefully, move ahead to PR and then to a Rec. Adopting https://github.com/w3c/epub-specs/issues/2491#issuecomment-1340092302 (which, content-wise, sounds o.k. to my non-expert eyes) means that:

  1. We have to republish a CR snapshot, adding at least another month to our schedule
  2. We have to create a set of tests
  3. We have to get implementers (who may have already finalized their implementation reports) to add some extra round of testing

We will have to discuss how we will process with EPUB 3.3 maintenance. One option is that the document will be turned into a "living standard", i.e., a simple maintenance WG may make such changes, possible one-by-one, to republish new versions. I think we do not run into any danger of interoperability today if we defer this issue to those times...

cc @shiestyle @wareid @dauwhe

rdeltour commented 1 year ago

@mattgarrish

MUST ... in a manner consistent with ...

Is there a way to make this more precise?

I've taken this language from HTML.

@bduga

Do we really need to add anything, though? Presumably any resources loaded by the UA of the reading system will implement mimesniff, since it is required via html. Though, I suppose knowing that you have an html document in the first place requires knowing the mime type. Ugh. This is a bit of rats nest - do reading systems need to run the mimesniff algorithm at ingestion time, to make sure the epub is valid? So, for instance, if a spine item claims to be html, but applying mimesniff calculates the type as json, should the reading system never even try to display the spine item?

These are all good questions, and it is similar questions that led me to open this issue…

The fact that EPUB does not specify how resources are loaded from the container is the real issue here. I think we nailed something to be clarified in EPUB 3.4 😁

Do we really need to add anything, though?

Given where we are in the revision, I'm fine if we want to defer the issue. Last minute additions have a way of needing to be fixed later. But I'll leave it to the chairs to decide.

Fine by me. Can we still add an informative note, in the spirit of the one I proposed above? I think it would be relevant to acknowledge the issue, and clarify that it is currently implementation specific.

mattgarrish commented 1 year ago

I've taken this language from HTML.

Which says you must "obtain and interpret". That's the precision lacking here. "Determine" speaks to process, not outcome.

iherman commented 1 year ago

The issue was discussed in a meeting on 2022-12-08

List of resolutions:

View the transcript ### 2. consider specifying how EPUB interacts with the MIME sniffing standard (issue epub-specs#2491) _See github issue [epub-specs#2491](https://github.com/w3c/epub-specs/issues/2491)._ **Dave Cramer:** this was pointed out by Romain that as part of the HTML it says you have to determine what kind of resource it is because HTTP headers might be wrong. … browsers realized that they might have to figure out what resources really are. … every browser did that differently, which resulted in interop issues. … so a algorithm was written in a spec. … epub doesn't currently say how this should happen. … but it doesn't seem like there's a problem we need to solve here. … it might be good to clean this up, but adding normative language about this now would delay going to PR. … we would need tests, could invalidate implementations. … leaning towards leaving it for now. **Brady Duga:** agree. There's not a problem.. … it is specified what should happen, because we have UA, and UAs are supposed to used MIMESNIFF per HTML. … one of the inputs to the algorithm is 'specified MIME type', so question is should that be the value from the manifest?. … and it seems that that value will get ignored anyway, that's no why it's there. … that manifest value is there for fallbacks. **Dave Cramer:** given the varieties of approaches that can be taken (e.g. HTTP server, etc.), and do we have to mess with that?. … feels like we should not go there unless we're solving an actual interop problem. **Matt Garrish:** don't like the idea of pointing out that there is no standard for this until we come up with a solution. **Dave Cramer:** i've been working on epub for a decade and I didn't realize the MIMESNIFF was an issue until it was raised. … my inclination is not to make any changes now. **Matt Garrish:** defer it. **Dave Cramer:** if someone can come up with a test that behaves different in different RS, or finds an example in the real world, definitely bring it up in the next round. > **Proposed resolution: Defer issue 2491 until evidence of an issue is found.** *(Wendy Reid)* > *Brady Duga:* +1. > *Dave Cramer:* +1. > *Shinya Takami (高見真也):* +1. > *Wendy Reid:* +1. > *Matthew Chan:* +1. > *Matt Garrish:* +1. > *Toshiaki Koike:* +1. > *David Hall:* +1. > ***Resolution #1: Defer issue 2491 until evidence of an issue is found.*** > *Dave Cramer:* Makoto: +1.
rdeltour commented 1 year ago

Dave Cramer: if someone can come up with a test that behaves different in different RS, or finds an example in the real world, definitely bring it up in the next round.

FWIW, I did find some cases where different RS behaved differently.

With the following items declared in the package doc:

<item id="svg-unknown-extension" href="svg.unknown" media-type="image/svg+xml"/>
<item id="svg-foreign" href="svg.foreign" media-type="image/unknown+xml" fallback="png"/>
<item id="png" href="png.png" media-type="image/png"/>

And used in HTML with img elements:

<img src="svg.unknown" alt=""/>
<img src="svg.foreign" alt=""/>

Tested with Apple Books 1.19, Thorium v2.10, ADE v4.0.x and v4.5.11.

Reading System SVG with unknown extension SVG with foreign type
Apple Books ✅ rendered ❌ not rendered
Thorium ✅ rendered ❌ not rendered
ADE 4.0 ❌ not rendered ❌ not rendered
ADE 4.5 ✅ rendered ✅ fallback rendered

I attached a test EPUB with more cases and examples: mimetypes.zip

I know it's edge cases, and is unlikely to pop up in real-word content. But it does how that RS behaves differently in how they load resources and handle MIME types.

Matt Garrish: don't like the idea of pointing out that there is no standard for this until we come up with a solution.

I believe an informative note would not hurt. But if it's just me, it won't prevent me to sleep at night :)