"IRI of the Package Document": what is this exactly?

w3c / epub-specs

Shared workspace for EPUB 3 specifications.

Other

305 stars 60 forks source link

"IRI of the Package Document": what is this exactly? #1374

Closed iherman closed 2 years ago

iherman commented 4 years ago

We may not be in position of providing an absolutely clean definition (so maybe some editorial hand waving would be necessary) but…

§4.2.3.4.2 The item Element says:

Each item element in the manifest identifies a Publication Resource by the IRI [RFC3987] provided in its href attribute. The IRI MAY be absolute or relative. In the case of relative IRIs, the IRI of the Package Document is used as the base when resolving to absolute IRIs. The resulting absolute IRI MUST be unique within the manifest scope.

The intention is clear but what is the "IRI of the Package Document"? After all, the package document is part of a ZIP file, it is not really on the Web, ie, it is not clear what its IRI is.

Can we say something more precise about this?

llemeurfr commented 4 years ago

This issue is why we introduced the notion of "root directory" in the LPF format. We still used urls (we decided not using the IRI term) in "Contents within the Package MUST reference these resources [those in the package] via relative-URL strings [url]."

iherman commented 4 years ago

Ah yes, indeed. We may want to get inspired and bring this over to the EPUB spec...

mattgarrish commented 4 years ago

Is there a difference from the OCF abstract container's root directory? Doesn't it already establish the virtual directory structure within the zip?

iherman commented 4 years ago

I think the difference is not in the terms/concept but the surrounding explanation. I find the few extra terms in the LPF spec on "virtual in nature", etc, helpful. And a reference to this from the item element definition may also be helpful.

iherman commented 3 years ago

The issue was discussed in a meeting on 2021-04-01

List of resolutions:

Resolution No. 1: Merge PR 1468

View the transcript

### 2. Clarify base IRI _See github pull request [#1468](https://github.com/w3c/epub-specs/pull/1468)._ **Dave Cramer:** this is a PR about base IRI in package documents _See github issue [#1374](https://github.com/w3c/epub-specs/issues/1374), [#1456](https://github.com/w3c/epub-specs/issues/1456)._ **Matt Garrish:** basically all the PR does is define base IRI for package because it wasn't clear how that was to be calculated … it is defined for container.xml, but for package doc there was just a stray sentence that absolute IRIs are to be calculated from base IRI of document … Ivan wanted some clarification … PR pulls out that statement and elaborates on it … Laurent suggested that maybe we define everything in core docs as paths rather than IRIs … not sure why we'd want to do that since we already define abstract container to allow us to use IRI language … how far do we want to get into relative paths vs absolute paths … can we just clean up what we already have, or do we want to take up this IRI vs path question at this point? **Dave Cramer:** i'm happy with PR … worried about Laurent's idea because not sure we want to start talking about paths when all the other specs that we rely upon are already happy about how we define things … also, this issue didn't come out of a concrete problem with a RS or similar, it came from abstract issue about spec language **Matt Garrish:** the PR seemed to make Ivan happy … from the perspective of what we need to describe here, I think we've done enough … the question about changing to path language leads off into other areas **Dave Cramer:** i think we should accept the PR, and then if Laurent wants to raise the other question, maybe he can come back with a more detailed rationale **Matt Garrish:** there was an issue about whether relative IRIs MUST be resolved, but it was only ever the intention that it be possible if you need to do it > **Proposed resolution: Merge PR 1468** *(Wendy Reid)* > *Dan Lazin:* +1 > *Ben Schroeter:* +1 > *Matt Garrish:* +1 > *Toshiaki Koike:* +1 > *Wendy Reid:* +1 > *Matthew Chan:* +1 > *Masakazu Kitahara:* +1 > *Brady Duga:* +1 > *Shinya Takami (高見真也):* +1 > ***Resolution #1: Merge PR 1468***

dauwhe commented 3 years ago

@iherman are you OK closing this now after #1468 ?

iherman commented 3 years ago

Yes :-)

rdeltour commented 3 years ago

Sorry to come late to the party. After looking at an issue in EPUBCheck (see the ref linked above), I'm thinking this is still underspecified.

I think the intention is that:

path-relative-scheme-less-URL strings are resolved relatively to the path of the Package Document in the container
path-absolute-URL strings are resolved relatively to the root of the container
not sure about other kind of relative URLs.

But our issue is that since the URL of the Package Document is not defined, its path within the container can be described either in the path part of the URL (e.g. https://example.org/moby-dick/EPUB/content.opf, or somewhere else like in a fragment (e.g. file://epub.zip#path=/EPUB/content.opf). The result of parsing a relative URL based on these URLs will be very different.

Consider the following container content:

├── META-INF
│   └── container.xml
├── EPUB
│   ├── content.opf
│   └── xhtml
│       ├── content.xhtml
│       └── …
└── …

If reading system A expands the container to the local file system, the URL of the package document can be file://path/to/epub/EPUB/content.opf.
if reading system B places it on the Web, the URL can be https://example.org/EPUB/content.opf
if reading system C uses a proprietary URL to describe the Zip internals, the URL can be file://path/to.epub#path=/EPUB/content.opf
if reading system D uses a "jar URL", the Url can be jar:file:/path/to.epub!/EPUB/content.opf

Now, for the following manifest item:

<item href="xhtml/content.xhtml" media-type="application/xhtml+xml" />

Examples A and B would resolve the item URL to file://path/to/epub/EPUB/xhtml/content.xhtml and https://example.org/EPUB/xhtml/content.xhtml. That's expected and reasonable.

But example C would resolve the item URL to file://path/xhtml/content.xhtml. And I cannot tell about example D without rereading carefully the URL parsing algorithm. A cursory read tells me that jar URL is invalid (what if the URL parsing algo returns null?).

In other words, all we have is Schrödinger’s URLs: we can't tell if they’re conforming or non-conforming (i.e. identifying container resources or not) until we open a reading system. 🤔

Shouldn't we reopen this issue?

iherman commented 3 years ago

@rdeltour

But example C would resolve the item URL to file://path/xhtml/content.xhtml.

Shouldn't this be: file://path/to.epub/xhtml/content.xhtml ?

And I cannot tell about without rereading carefully the URL parsing algorithm. A cursory read tells me that jar URL is invalid (what if the URL parsing algo returns null?).

You mean "cannot tell about D", right? (not that I have an answer).

Shouldn't we reopen this issue?

Yes. I will do so (through this comment).

iherman commented 3 years ago

The question I have: are C and D examples of real Reading Systems, or are they imaginary? In other words, can we allow ourselves to say that the RS must behave, conceptually, like A or B (and how they do that is up to them)?

rdeltour commented 3 years ago

example C would resolve the item URL to file://path/xhtml/content.xhtml.

Shouldn't this be: file://path/to.epub/xhtml/content.xhtml ?

No, see the live URL viewer (JSDOM implementation of the URL standard, closely following the spec).

You mean "cannot tell about D", right? (not that I have an answer).

Right. Edited, thx 😊

The question I have: are C and D examples of real Reading Systems, or are they imaginary? In other words, can we allow ourselves to say that the RS must behave, conceptually, like A or B (and how they do that is up to them)?

Yes, I would really like RS folks to chime in! But even if they use something like C and D, I assume they "resolve" like A and B (which is what has been implicitly assumed historically, as far as I'm aware).

mattgarrish commented 3 years ago

Isn't the problem the lack of clarity about the root of the container as in #1681? I opened that because root-relative paths are never going to be consistent, but it's the lack of a common root dir that's the big problem.

The current definition of the OCF root directory explicitly says the root is optional, even if I don't believe this is formalized normatively (but then how to unpack isn't normatively described, either):

an EPUB Reading System may or may not generate a physical root directory for the contents of the OCF Abstract Container if it unzips the contents

If there were consistency here, would it matter what scheme you use to reference/resolve the resources?

How the content is served doesn't matter, in my understanding, since this is only a check of the abstract container's integrity. But this also seems to make this statement problematic:

All relative-URL-with-fragment strings [URL] MUST, after parsing to URL records [URL], identify resources within the OCF Abstract Container

On the one hand, we're talking about an "abstract" container, but then on the other we're parsing for physical URL records. The root directory may have disappeared in the meantime, but we're implicitly assuming it is kept for checking this requirement.

We've run into this problem in the past with multiple renditions not working when shared content is not below the package document(s) in a common content directory, even though the epub is valid. Some reading systems don't create a root to match the zip, so you can't have sibling directories in the zip root, one for each rendition.

All we appear to want to know is a) is the resource inside the zip file, and b) is it not a duplicate of another entry. Resolving to URLs doesn't appear to have any other function.

You can't test a) unless you explicitly create a root that matches the zip root. And b) should sort itself out since the application will have internal consistency in however it then resolves the urls.

So do reading systems verify the manifest entries, or is this only an epubcheck problem to fix? If it's the latter, can't you assume a root directory that is the root of the zip and work from there? If the content doesn't work on any given reading system... well, that's sort of the state of the world now.

iherman commented 3 years ago

example C would resolve the item URL to file://path/xhtml/content.xhtml.

Shouldn't this be: file://path/to.epub/xhtml/content.xhtml ?

No, see the live URL viewer (JSDOM implementation of the URL standard, closely following the spec).

Oops, you're right. Sorry for the noise.

rdeltour commented 3 years ago

Isn't the problem the lack of clarity about the root of the container as in #1681? I opened that because root-relative paths are never going to be consistent, but it's the lack of a common root dir that's the big problem.

The problem is that we're making conformity statements based on an undefined object, and the nature of this object can very much change the interpretation of these statements. Specifically, in EPUB RS "3.1 Parsing Relative URLs":

To parse relative-URL-with-fragment strings [URL] in the Package Document, Reading Systems MUST use the URL of the Package Document as the base URL [URL].

This statement is problematic for a validator. And it could be problematic for reading systems too (in which case they would just ignore it).

All we appear to want to know is a) is the resource inside the zip file, and b) is it not a duplicate of another entry. Resolving to URLs doesn't appear to have any other function.

For EPUBCheck, yes. For a reading system: what should happen if the resource is outside the ZIP file? What if it's a duplicate? Do we specify the mandated behavior?

In any case, to specify how a UA assesses (a) and (b), the current wording doesn't work. We might work around the current issue by assigning an arbitrary URL to the container root. I'm not sure.

mattgarrish commented 3 years ago

The problem is that we're making conformity statements based on an undefined object

Right, we let this slip under the radar in the past by not bringing in absolute URLs at all. All we said was that the relative paths had to resolve to resources in the container without explaining at all how that should happen. It intuitively makes sense until you get into the problem of an unzipped epub and an "abstract container" not having a common root dir.

To parse relative-URL-with-fragment strings [URL] in the Package Document, Reading Systems MUST use the URL of the Package Document as the base URL [URL].

This statement is problematic for a validator.

We might work around the current issue by assigning an arbitrary URL to the container root. I'm not sure.

Ya, I'm not saying what we have is helpful at all, but if we were to always assume for validation the root dir is the root of the ocf can't you parse the relative URLs and determine if they are below that directory?

But it's worth questioning why we try to block access this way but if you put in an aboslute url that references the file system that doesn't raise any concern. You just have to call it a remote resource. Maybe that's all we should require for resources outside the container and leave it to reading systems to similarly determine whether they want to use these?

what should happen if the resource is outside the ZIP file?

Okay, you got me here! I wasn't thinking about the unwritten rules... 😄

But how do we solve this without a fixed path to the package document, and isn't it too late for that? Files inside the container may appear outside the container to reading systems depending on how they unpack them, too.

That's why I'm not sure we can solve this outside of epubcheck, at least not beyond recommendations that accommodate both possibilities. There can only be internal consistency within any given application processing the epub.

What if it's a duplicate?

We tend to tolerate mistakes like this. The caveat to authors is if you don't follow these kinds of rules then bad things will happen. In this case, the reading system may get conflicting information about the resource. It's possible it might break the spine, too, as if you navigate to the resource I imagine it will complicate reading systems looking up which spine item you're in if there's more than one entry that matches the resource.

I guess we could recommend that reading systems ignore duplicate entries, though that wouldn't solve the problem of inaccuracies between the listings.

iherman commented 3 years ago

I am still trying to get around the problem (also for #1681).

Looking at https://github.com/w3c/epub-specs/issues/1374#issuecomment-847978623 from @rdeltour, and using

<item href="/xhtml/content.xhtml" media-type="application/xhtml+xml" />

(ie, a path-absolute-URL) and looking at options A and B I get the URLs file://path/xhtml/content.xhtml and https://example.org/xhtml/content.xhtml, respectively. Which are, again, what one would expect.

What this means is that the A and B cases, i.e.,

the reading system A expands the container to the local file system; or
the reading system B places it on the Web

are consistent with the expectations as well as the URL spec.

Isn't it possible then to say (in our spec) that a Reading System MUST treat these URLs as if its implementation followed one of these two models? This does not mean that it must implement it exactly this way, but it must, when using a different approach, emulate one of the two options. Wouldn't that be enough for the spec? After all, if we consider an EPUB instance as a 'frozen website', those two options do represent a perfectly consistent mental model, so I do not think that approach would feel to strange...

mattgarrish commented 3 years ago

and looking at options A and B I get the URLs file://path/xhtml/content.xhtml and https://example.org/xhtml/content.xhtml, respectively. Which are, again, what one would expect.

How can you be sure this is what you will get when you don't know if the root directory of the epub will be preserved, though?

Nothing says it is wrong to expand it to:

https://example.org/nameOfEPUB/EPUB/xhtml/content.html

because you might also have in the same container:

https://example.org/nameOfEPUB/EPUB2/xhtml/content.html

in which case the / no longer refers to where you think it does.

iherman commented 3 years ago

Ouch, you are right: for root-relative paths https://github.com/w3c/epub-specs/issues/1374#issuecomment-848662667 does not work. And, having gone through that, I think I agree with the proposal in #1681 that those should be disallowed.

But it does for path-relative-scheme-less-URLs, doesn't it?

mattgarrish commented 3 years ago

But it does for path-relative-scheme-less-URLs, doesn't it?

It all depends on how the EPUB is structured and then unpacked. If you put all your content in a single directory and have the package document at the root, there shouldn't be an issue.

If you refer to files across sibling directories in the root, then you're back in trouble again. For example, if you had this:

├── META-INF
│   └── container.xml
├── EPUB1
│   ├── content.opf
│   └── …
├── EPUB2
│   ├── content.opf
│   └── …
├── shared
│   ├── img
│        └── …

If you have a path from EPUB1 like '../shared/img/photo1.jpg', the file may not be there after the reading system unpacks the EPUB. It may only extract the directory where the first package file is.

(I'm using multiple renditions, but even a single-rendition epub doesn't have to be self-contained in a subdirectory. It's generally the norm, though, because of this problem.)

And to make things even more fun, if you put the package documents in the root, there probably isn't an issue (but I haven't test this):

├── META-INF
│   └── container.xml
├── content1.opf
├── content2.opf
├── EPUB1
│   └── …
├── EPUB2
│   └── …
├── shared
│   ├── img
│        └── …

But these are the problem of having a requirement that requires checking the zipped content using URL records when we're not completely clear what will be unpacked. Epubcheck can make easier assumptions to check the validity that reading systems cannot, at least as I understand @rdeltour's concern.

What happens after extraction is technically a separate problem, but it's also one we're quiet on. As @rdeltour has noted, what the reading system determines is available is not going to always be the same as what epubcheck does. But that's true now, too.

So we probably also have a few other issues to look at:

why do we care if relative paths refer to resources outside the container? what is a reading system supposed to do with them?
can we allow relative paths to refer to resources outside the container if they're marked as a remote-resource? (not that this solves the problem of checking if they're outside but makes processing consistent with any remote resource)
what does a reading system do if two manifest entries resolve to the same resource?
do we need stronger guidance on structuring the zip container and/or unpacking it?

rdeltour commented 3 years ago

Right, we let this slip under the radar in the past by not bringing in absolute URLs at all. All we said was that the relative paths had to resolve to resources in the container without explaining at all how that should happen.

Exactly! And my bad for not having spotted that earlier (during the URL PR review or EPUBCheck dev) 😅

Some comments on @mattgarrish’s list of questions:

So we probably also have a few other issues to look at:

why do we care if relative paths refer to resources outside the container? what is a reading system supposed to do with them?

This question is specifically depending on this current issue, which will define how relative URLs must be parsed. Depending on the approach, it could result in URLs that can refer to something outside the container, or not.

It all boils down to what base URL must be used to parse relative URLs. We have two high-level options:

the URL of the Package Document, as defined by the RS: we saw earlier that the result is unpredictable, depending on how the RS locates resources inside the container.
an arbitrary URL: in other words, an "as if" case, like @iherman said.

With the second option, given we already say "as if", we may go further than what @iherman proposed and assign an arbitrary URL to the root of the container. For instance https://epub.example.org/ or https://epub.w3.org/ or whatever. That way, we're sure relative URLs will never go outside the container (they cannot go below the root level). The benefit is that we have an unambiguous way to tell what resources are identified by all relative URLs, and that they always are in the container. The drawback of course is that it is arbitrary, and can give different results compared to how URLs would be parsed if the EPUB container was simply unpacked and served on the Web or local file system (but then, only for path-absolute relative URLs, and for relative URLs going below the root level).

can we allow relative paths to refer to resources outside the container if they're marked as a remote-resource? (not that this solves the problem of checking if they're outside but makes processing consistent with any remote resource)

I cannot see a use case where an author would intentionally do that, so it might not help much to require marking it as remote-resource?

In any case, if we make it so that relative URLs are never parsed to something identifying a location outside the container, this point is moot.

what does a reading system do if two manifest entries resolve to the same resource?

Yeah, good question 😊 (it's typically the kind of issue that was underspecified in previous author-centric EPUB spec, but that is worth some well-specified RS guidance). That can be discussed in a separate issue.

do we need stronger guidance on structuring the zip container and/or unpacking it?

I'm not sure I understand the issue there. Or your concern with the shared directory use case. I guess this is related to when you said earlier "Files inside the container may appear outside the container to reading systems depending on how they unpack them, too.", which I'm not sure I understood either 😅.

For me, the spec is rather clear on this one. It explicitly says that any location descendant from the root directory can be used in the publication. So it's the RS responsibility to keep everything under the root when unpacking the container, no?

Finally, while we're at it, we could add another question:

what to do with absolute file URLs?

file URLs have always been in a weird place in EPUB. Not strictly forbidden, but obviously not a good practice (and often used by mistake, when they happen at all).

Would it be reasonable to say only absolute URLs with a special scheme that is not file are conforming? And we allow RS to still process any non-conforming URL if they want.

@mattgarrish I didn't create new issues for the separate questions we brought, I'll let you decide what you prefer as an editor. But if you'd like me to create these, let me know!

mattgarrish commented 3 years ago

That way, we're sure relative URLs will never go outside the container (they cannot go below the root level).

Right, I was inarticulately trying to suggest that the dev will know what the root corresponds to however they resolve URLs, so they should be able to determine this even though the parsed URL will be detached from the physical zip, but if we want to define a more formal method for checking that works for me.

The process just can't depend on the actual unpacking of the zip as variably done by reading systems, as we know that won't lead to consistent results.

I cannot see a use case where an author would intentionally do that, so it might not help much to require marking it as remote-resource?

I can contrive one if you want... 😄

But it's flaky at best and probably unrealistic outside of a very rigid content delivery system and setup (e.g., it might work in an internal documentation system if the author knew of information they could access on any corporate machine).

I just wonder why we explicitly ban these and not, as you say, absolute file urls. Banning the latter also works for me, as it's more a measure of consistency I'm after.

do we need stronger guidance on structuring the zip container and/or unpacking it?

I'm not sure I understand the issue there.

There isn't a requirement to preserve the zip root directory or any descendant content that isn't in the same directory as the package document. You'd think reading systems would preserve it, but past experience in multiple renditions showed that couldn't be relied on. The result is that some reading systems won't let you reach across sibling folders in the zip root because they don't appear to preserve them.

That's why we put this note in the multiple renditions spec: https://www.w3.org/TR/epub-multi-rend-11/#h-note

I know most EPUBs have a single "EPUB" directory where the content is stored, but that's not a requirement. If you don't follow that pattern, and don't have the package document in the root, bad things can happen (i.e., reading systems won't display the content that never got unpacked).

So who's at fault in this scenario? Should the reading system be required to unpack all content and ensure that all content below the root directory is available, even if it doesn't create an extra folder for the root directory? Should authors be more strongly warned not to rely on being able to access across sibling folders that are not below the package document?

But if you'd like me to create these, let me know!

Feel free to have a go at them! I can add this latter issue if you don't want it attributed to you... 😉

iherman commented 3 years ago

If you refer to files across sibling directories in the root, then you're back in trouble again. For example, if you had this:
├── META-INF
│   └── container.xml
├── EPUB1
│   ├── content.opf
│   └── …
├── EPUB2
│   ├── content.opf
│   └── …
├── shared
│   ├── img
│        └── …
If you have a path from EPUB1 like '../shared/img/photo1.jpg', the file may not be there after the reading system unpacks the EPUB. It may only extract the directory where the first package file is.

I guess I do not understand what I am seeing here. Are we seeing the content of the ZIP file, after being unpacked onto the processor's file system? If so, if the file ../shared/img/photo1.jpg is not available, then we have a bug, but why is this dependent on the issue at hand?

mattgarrish commented 3 years ago

Are we seeing the content of the ZIP file, after being unpacked onto the processor's file system?

No, that's the packed EPUB. It seems some reading systems look up the location of the package document and only unpack the directory it's in. So in the above case, any content in /EPUB2 and /shared are not available to the publication in /EPUB1.

That doesn't seem like it should be valid, as the definition only say the root dir is optional to create. But that's also only a definition and we don't say anything about what has to be unpacked or made accessible.

but why is this dependent on the issue at hand?

I only brought that in because you can't rely on checking if a file is "within the abstract container" after the reading system has unpacked the content. Files that were in the zip container are gone at that point.

rdeltour commented 3 years ago

That way, we're sure relative URLs will never go outside the container (they cannot go below the root level).

Right, I was inarticulately trying to suggest that the dev will know what the root corresponds to however they resolve URLs, so they should be able to determine this even though the parsed URL will be detached from the physical zip, but if we want to define a more formal method for checking that works for me.

Yes, exactly. Again the idea is to use spec language to unambiguously define what the relative URLs identify. An RS is of course free to implement that as they please!

The process just can't depend on the actual unpacking of the zip as variably done by reading systems, as we know that won't lead to consistent results.

Sure.

I cannot see a use case where an author would intentionally do that, so it might not help much to require marking it as remote-resource?

I can contrive one if you want... 😄

But it's flaky at best and probably unrealistic outside of a very rigid content delivery system and setup (e.g., it might work in an internal documentation system if the author knew of information they could access on any corporate machine).

I just wonder why we explicitly ban these and not, as you say, absolute file urls. Banning the latter also works for me, as it's more a measure of consistency I'm after.

Interesting use case 😊. That use case is feasible even if out-of-container relative URLs are made impossible, as long as file URLs are possible (SHOULD NOT being my recommendation: warn about them, but still allow them for edge cases like this).

do we need stronger guidance on structuring the zip container and/or unpacking it?

I'm not sure I understand the issue there.

There isn't a requirement to preserve the zip root directory or any descendant content that isn't in the same directory as the package document. You'd think reading systems would preserve it, but past experience in multiple renditions showed that couldn't be relied on. The result is that some reading systems won't let you reach across sibling folders in the zip root because they don't appear to preserve them.

I wasn't aware of RS interop issues there. The EPUB container spec says ("File and directory" section):

EPUB Creators MAY locate all other files within the OCF Abstract Container in any location descendant from the Root Directory, provided they are not within the META-INF directory.

which quite unambiguously says this is allowed? And that RS should theoretically handle that fine?

So who's at fault in this scenario? Should the reading system be required to unpack all content and ensure that all content below the root directory is available, even if it doesn't create an extra folder for the root directory? Should authors be more strongly warned not to rely on being able to access across sibling folders that are not below the package document?

In the current spec, that's an RS bug in my book. That said, if we want to better align with the real-world implementation practices, then we can certainly add more restrictions to the current statements!

But if you'd like me to create these, let me know!

Feel free to have a go at them! I can add this latter issue if you don't want it attributed to you… 😉

OK will do! (later today, kid ill at home 😅)

iherman commented 3 years ago

Are we seeing the content of the ZIP file, after being unpacked onto the processor's file system?

No, that's the packed EPUB. It seems some reading systems look up the location of the package document and only unpack the directory it's in. So in the above case, any content in /EPUB2 and /shared are not available to the publication in /EPUB1.

That doesn't seem like it should be valid, as the definition only say the root dir is optional to create. But that's also only a definition and we don't say anything about what has to be unpacked or made accessible.

Yes, that sounds absolutely wrong. We should say that the full ZIP package content should be available. I am actually surprised this is not the case...

mattgarrish commented 3 years ago

I am actually surprised this is not the case...

Ya, I don't know why we didn't log an issue when we were developing the MR spec. I guess it got forgotten after we wrote the note.

which quite unambiguously says this is allowed? And that RS should theoretically handle that fine?

Definitely allowed, but it's the theoretical part that always does us in. The spec doesn't disallow extracting only the file where the package document is located, and it probably works fine for the vast majority of EPUBs.

Wish I could remember which reading systems we got tripped up by, but in any case we need a proper requirement.

OK will do! (later today, kid ill at home 😅)

I've got a post-vaccine queasy adult at home today, and sadly it's me, so I'm in no better a boat... 🤢

iherman commented 3 years ago

The issue was discussed in a meeting on 2021-06-10

List of resolutions:

Resolution No. 2: Absolute URLs for manifest items should have a special scheme that is not file:, close issue 1688

View the transcript

### 2. URLs and the package document _See github issue [#1681](https://github.com/w3c/epub-specs/issues/1681), [#1374](https://github.com/w3c/epub-specs/issues/1374), [#1688](https://github.com/w3c/epub-specs/issues/1688), [#1686](https://github.com/w3c/epub-specs/issues/1686)._ **Dave Cramer:** this is a bunch of issues that revolve around how you interpret URLs in the package document, especially if they're absolute URLs … came from an issue in epubcheck … and there's also an older issue about what the IRI of the package document is … or what if there are file scheme URLs in the manifest … and what happens if two URLs resolve to the same item in the manifest? **Matt Garrish:** in epubcheck there was a root-relative URL that caused an error, and that spawned all of this … e.g. "/something/thing" … so what is the root of the epub? … to me it doesn't make sense that we even allow these root-relative URLs … the root differs based on the RS … and Romain mentioned that we require that all resources resolve to something inside container, but depending on what RS does, there is even ambiguity about what that even is **Dave Cramer:** in issue 1688 Romain he suggests that manifest items should have one of the special schemes (_**except**_ `file:`) **Matt Garrish:** there are edge cases where file scheme items make sense, but not generally for epub **Dave Cramer:** it goes against epub as a portable format, and the file scheme ties the epub to a specific file system … how much out there does have file URLs on purpose, not by accident? **Matt Garrish:** never heard of one … and they'd end up being remote resources **Dave Cramer:** okay, so what if we just say no file URLs in epub? … what is the risk that we break something? … maybe this is something where we try to enforce it and see if anyone complains **Matt Garrish:** most RS probably won't do anything with file URL … probably security concern **Wendy Reid:** depending on platform you might not even be able to access parts of the file system (e.g. iOS apps) **Dave Cramer:** can we start by resolving on this point from 1688? > **Proposed resolution: Absolute URLs for manifest items should have a special scheme that is not `file:`, close issue 1688** *(Wendy Reid)* > *Dave Cramer:* +1 > *Matthew Chan:* +1 > *Matt Garrish:* +1 > *Wendy Reid:* +1 > *Toshiaki Koike:* +1 > *Shinya Takami (高見真也):* +1 > *Ben Schroeter:* +1 > ***Resolution #2: Absolute URLs for manifest items should have a special scheme that is not `file:`, close issue 1688*** **Dan Lazin:** is there a use case for some of these other schemes? Why would you have an FTP in your epub? **Matt Garrish:** if we go too far, do we prevent future stuff? will we have to come back and re-add this in the future? … FTP kind of fits within the web framework … maybe we just leave it to authors to stick with HTTP, HTTPS, etc. **Ben Schroeter:** is the idea that if we disallow file scheme, then we also disallow "slash URLs"? **Matt Garrish:** not sure those are the same … i think 1681 is contingent on us forcing RS to unpack epub in a certain way … otherwise we can't say there is a single consistent root that can be referenced … and we don't tell RS how/where to unpack right now … this kind of came up 5 years ago with multiple rendition, but we left it buried in the discussions we had **Dave Cramer:** what would be the consequences of forbidding root-relative paths? **Matt Garrish:** not sure there are any, because epubcheck had forbidden these until a recent update … we're reasonably safe from backwards compatibility point of view **Dave Cramer:** and this is just for href on manifest? **Matt Garrish:** no, this would be anywhere, e.g. in content docs too … all the "../" stuff would still be okay … i proposed somewhere that we say all content must be below the packat document … if we could enforce an authoring requirement that made a root, then we could enforce these relative paths … but maybe its cleaner to just disallow them **Dan Lazin:** do we support the base tag? … and does that have implications for the handling of these issues? **Dave Cramer:** we've been phasing out `xml:base`, its been forbidden from package file for example **Dan Lazin:** the base tag allows you to define what the relative path is relative to … so if we're allowing or disallowing certain types of URLs, maybe we should take a stance on base too … not sure what stance though **Matt Garrish:** base would force you to have all external resources, right? It exists, but I don't imagine anyone really going there > *Dan Lazin:* [https://developer.mozilla.org/en-US/docs/Web/HTML/Element/base](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/base) **Marisa DeMeglio:** there was a resolution a few weeks ago about dumping `xml:base` from the spec **Dave Cramer:** and that's separate from the HTML base element … i think i just want to say no root-relative URLs **Dan Lazin:** if you set base to some website, and then use root-relative URLs, your URLs would appear to be relative, when they are actually absolute … but maybe that's too far of a stretch **Dave Cramer:** but can we really say anything about base because its part of HTML? **Matt Garrish:** so you must not use root-relative URLs unless you use a base? … but it also applies to SVGs, to the package document... **Dan Lazin:** what was the harm in not banning root-relative? **Matt Garrish:** because the RS might treat zip root as the root, but they could also treat location of package doc as root … so no consistent root **Dan Lazin:** maybe permit it, but use _SHOULD NOT_? … is it acceptable for an author to write an epub for a specific RS? … and where it has undefined behavior for other RS … probably acceptable, right? **Dave Cramer:** yes, e.g. with books that only work with iBooks because of scripting support **Matt Garrish:** maybe just a note that root-relative could cause issues if authors use it? **Dave Cramer:** so does that mean that there are epubs that could be built to work in some RS, but expose an interop issue if opened in another RS? **Matt Garrish:** right … usually this happens in epubs that try to go from one folder to a sibling folder … but when all content is below the package document its fine … but we don't specify that right now, only that content must be below the root **Dave Cramer:** not sure what the right course of action is, but maybe we can continue this another time with Romain present **Wendy Reid:** we need RS people here on next call that know exactly what RSes are doing right now **Marisa DeMeglio:** one of the github threads has a sample, but I wasn't able to download it … maybe if we wrote to the mailing list Romain could provide samples … would also love to have a list of epubs that must absolutely continue to work > *Dan Lazin:* I have filed [https://github.com/w3c/epub-specs/issues/1699](https://github.com/w3c/epub-specs/issues/1699) **Matt Garrish:** also, there's not much hand authoring, and most tools will put all the content into one folder … we only ran into an issue with this with multiple renditions, and that hasn't really gone anywhere … so is this maybe more of a theoretical issue

iherman commented 3 years ago

The issue was discussed in a meeting on 2021-06-18

no resolutions were taken

View the transcript

### 2. What is the relationship between URLs and the package doc (what is home?) _See github issue [#1681](https://github.com/w3c/epub-specs/issues/1681), [#1374](https://github.com/w3c/epub-specs/issues/1374), [#1687](https://github.com/w3c/epub-specs/issues/1687), [#1686](https://github.com/w3c/epub-specs/issues/1686)._ **Wendy Reid:** we started this discussion last week. Core question is: Where is home (given we allow both relative and absolute URLs) in the epub context **Romain Deltour:** we have to keep in mind: 1) what things have to be put in epub core spec, and 2) what are the rules for epub RS spec … later is more important because we can say whatever we want in core, but authors may deviate, and then it is up to RS to decide how to react … also, i think we should look into question of what is home first, and that will inform what to do with root-relative URLs **Wendy Reid:** okay, so what is the IRI of the package document then? **Ivan Herman:** we can't really answer what the IRI of the package is, and i'm not sure we should try … rather, what do we expect RS to do conceptually? … who epub structure relies on the idea that epub is kind of a frozen website … i think we say this is the conceptual model within which epub exists, and we should not say exactly how RS can do that … just as long as the observable behavior is identical … so as long as after epub is unpacked there is a root that we can refer to, it is fine … and whether this root is the same IRI of the package or not is none of our business **Matt Garrish:** we have 2 issues, 1) are these resources within the container and how do we determine that? 2) what happens when you unpack, and where do these resources go? … so I don't think there can be a consistent root unless we start to enforce these things … inside epub resources can be within the container, but that might not be true once the epub is unpacked … e.g. do you have to unpack everything in the zip? Or just whatever is in the epub under the package? **Brady Duga:** so absolute URIs are not allowed, and what relative IRI is interpreted by the language in question (e.g. HTML, or CSS, depending on what type of document it is) … so why do we have to define what root is if we don't allow absolute URIs? **Matt Garrish:** i think the issue is root-relative is still a relative path, so do we have to say "all relative is allowed, except _root_ relative" **Romain Deltour:** even with regular relative URLs, the spec is silent on what happens if the relative URL tries to go below the container root? … and is it possible to look at RSes today and test what they do? **Ivan Herman:** i was surprised to find that some RS don't automatically unpack the whole zip … i thought this was obvious … but then what if there is a relative URL that is not on manifest, but also happens to be in zip? **Matt Garrish:** we have requirement in OCF that all relative resources must resolve to something in container … i don't think that was the issue **Gregorio Pellegrino:** i know that Colibrio streams files out of zip without unzipping **Wendy Reid:** yes, there are more examples of RS doing that beyond that **Ivan Herman:** but conceptually an RS unpacks the whole zip file onto a domain (as if it were a file system). If we do that then all these concepts become clear … but i'm not sure if a streaming based solution meets that conceptual model **Hadrien Gardeur:** streaming from zip is what Readium does by default … unzipping is a problem for DRM. Some expectation that you keep the epub zipped. And we've done some optimizations with this in mind **Romain Deltour:** i'm surprised that resources that are not in the same directory tree as the OPF would not be accessible in the epub … going back to the point about defining what should happen conceptually, the spec could say that we define a URL that must be used as the base when resolving relative URLs (e.g., [https://ocf.example.org)](https://ocf.example.org)) > *Ivan Herman:* +1 to romain **Romain Deltour:** this defines unambiguously how relative URLs are to be resolved … and we can say this URL is the root of the OCF … this makes it so that relative URLs cannot go outside of the container … and then RSes know what relative URLs point to **Wendy Reid:** going back to romain's point about testing, there are a variety of ways that RSes handle these URLs … we are especially unsure what happens when files are outside the container … so this is good reason to do some testing **Ivan Herman:** would some sort of conceptual model clash with how things are implemented? **Hadrien Gardeur:** we treat OPF as base, and that seems to work in most cases. Seems to make more sense to us than treating zip as base … but these two are most common implementations **Matt Garrish:** this originally came up in multiple renditions when we had issues referencing across sibling directories … not sure if this is still an obstacle, worth testing **Romain Deltour:** drawback of conceptual solution is that sometimes adding this layer of abstraction makes spec harder to use … so we want to respect people who are actually having to implement it **Wendy Reid:** is the best way forward at this point for us to do some sort of testing? (e.g. OPF as base, zip as base, examples of files living outside when OPF is base) **Ivan Herman:** i think we should also test environment where multiple renditions is implemented … if we end up with something that makes multiple renditions impossible, then we should just remove the multiple rendition note **Wendy Reid:** do we know if a functioning implementation of multiple renditions? **Hadrien Gardeur:** barnes and noble were using multiple renditions for newspapers and magazines … not sure if they still use it **Wendy Reid:** okay, so maybe we test on Nook app … okay, so for now we test. Will have to ask Dan and the rest of the testing folk to help … for now we don't have consensus on any sort of language, right?

iherman commented 3 years ago

The issue was discussed in a meeting on 2021-07-02

no resolutions were taken

View the transcript

### 2. Are root-relative paths valid? _See github issue [#1681](https://github.com/w3c/epub-specs/issues/1681), [#1374](https://github.com/w3c/epub-specs/issues/1374)._ _See github pull request [#1725](https://github.com/w3c/epub-specs/pull/1725)._ **Dave Cramer:** What more needs to happen or can happen in the spec for root-relative paths? **Ivan Herman:** one problem we need to address is that we have a problem with iBooks and others that rely on Adobe ADE, namely that they rely on a specific way of organizing the files, which is not in the standard. … Matt's test was done according to the standard, but iBooks and others get it wrong. We can either acknowledge that problem as a warning and keep the standard as is (iBooks doesn't conform), or we reverse-engineer and put into standard a restricted version of how files can be organized, in order to conform with iBooks. We need to decide if this will harm current eBooks. … I personally would hate to put restrictions in the standard, but that's just me **Romain Deltour:** the test was done with valid ePub with shared resources - there is still the issue of root-relative URL paths and paths that would go outside the container. I think we need the spec to address that. … some kind of language defining the root is likely necessary. … and review interoperability with reading systems. **John Foliot:** Is an unintended consequence that a publisher would have to create two versions, one for iBooks and another for other reading systems? **Dave Cramer:** I don't see huge problems around interoperability because EPUBs are consistent with folder structure, generally. **Ivan Herman:** Whatever works for iBooks works for others - but there are perfectly valid ePubs that iBooks doesn't take. … As for the questions of Romain, we have decided that path relative URLs shouldn't be used, and paths shouldn't go outside the package. We need to make this clear in the documentation but there is not a fundamental technical problem with this. **Romain Deltour:** these are edge cases, we don't see this problem often if ever. … What we have is a recommendation for authors, but we need a recommendation for reading systems on how to process URLs. … How should a reading system deal when authors don't follow recommendation. **Ivan Herman:** it would be helpful to have a clearly-worded proposal for reading systems. Hoping Romain's help with this. **Dave Cramer:** everyone seems to agree that having `../..` etc. to outside the package is not a good practice. **Hadrien Gardeur:** from a reading system perspective, they need to resolve URIs, and expose the HTML resource (or any resource) to web view. … reading systems have different ways of doing this, but you need to get the web view to do what you want, and how this is achieved can impact what we are discussing. **Ivan Herman:** What precisely should the recommendation in the reading system spec be to cover all implementations? **Hadrien Gardeur:** we don't know how each RS works behind the scenes, we can only speculate. **Ivan Herman:** If we put something in the spec, it's up to RS how to implement … we don't have to define that. … Whatever we do, the author of an EPUB should have a clear mental model of what's happening. The RS implementation is not under the author's influence. If we are saying EPUb is a website in a box, we should be able to clearly define the root, and stop there. **Hadrien Gardeur:** On the web, we don't think about files and root containers. For reading systems, we are deciding how an EPUB behaves. So weary of this conceptual approach. **Dave Cramer:** we are really talking about edge cases here. Hoping that we can build some tests based on the write-up and what we are trying to achieve. … hoping we can get clear enough to cover our edge cases without restricting RS implementation. **Hadrien Gardeur:** difficult to test everywhere - gets tricky when you have to consider different CSS, etc **Dave Cramer:** let's get some proposals down with Romain's help, and get Matt to take a look at them, and proceed from there. **Ivan Herman:** Must have a clear statement somewhere whether we intend to restrict EPUB content and define organization of EPUB package.

rdeltour commented 3 years ago

I reframed the issue in #1888, along with a (non-exhaustive) list of possible solutions.

iherman commented 3 years ago

The issue was discussed in a meeting on 2021-10-29

List of resolutions:

Resolution No. 1: Merge PR #1725.

View the transcript

#### 2.3. "IRI of the Package Document": what is this exactly? (issue epub-specs#1374) _See github issue [epub-specs#1374](https://github.com/w3c/epub-specs/issues/1374)._ > *Dave Cramer:* See [more detailed explanation](https://github.com/w3c/epub-specs/issues/1374#issuecomment-847978623). **Romain Deltour:** I may summarize. … the big problem is defining how to resolve relative URLs in an EPUB. … most of the URLs we use are relative URLs. … but an URL object is something which is parsed from an URL string. … to make it absolute. … it is done by the parsing algorith. … I make an example. > `parse("doc.xhtml", "https://example.org") == "https://example.org/doc.xhtml"`. **Romain Deltour:** for using this algorith we have to now the base URL (https://example.org). … the problem is that our spec doesn't define what is the URL of the EPUB (because it may be used in different locations: online, offline, ecc.). > *Romain Deltour:* e.g., [http://example.org/publisher/mobydick.epub#/EPUB/package.opf](http://example.org/publisher/mobydick.epub#/EPUB/package.opf). **Romain Deltour:** I'm going to show other examples. > `parse("doc.xhtml", "http://example.org/publisher/mobydick.epub#/EPUB/package.opf") == "http://example.org/publisher/doc.xhtml"`. > `parse("../../doc.xhtml", "http://example.org/publisher/mobydick.epub#/EPUB/package.opf") == "http://example.org/doc.xhtml" // ⚠️ OUTSIDE OF CONTAINER`. **Romain Deltour:** in this case I'm going outside of the EPUB. > `parse("/doc.html", "http://example.org/publisher/mobydick.epub#/EPUB/package.opf") == "http://example.org/doc.xhtml" // ⚠️ OUTSIDE OF CONTAINER`. **Romain Deltour:** that's why I think we should define which is the base URL, also for security issues. … the solution should be unambiguious. … the resulting URL should not go outside the container. … resolving two relative URLs in two different EPUBs they should not resolve in the same absolute URL. … the URL of the EPUB should not share the same origin. … these are the 4 objectives of the ideal solution. **Ivan Herman:** I remember that one solution may be to consider an EPUB as a localhost (with a unique port). … so the `localhost:port` is what represents the root for the EPUB. … but if the RS works in a streaming way, it may not work (because the EPUB is not decompressed). … and if it goes out of the EPUB, the user gets a 404. **Romain Deltour:** yes, there are different approches. One is to use domains, another is to use a custom protocol scheme:. > `parse("/", "epub:/") == "epub:/"`. > `parse("../../doc.xhtml", "epub:/EPUB/package.opf") == "epub:/doc.xhtml"`. **Romain Deltour:** I don't know which one is better. … from a RS point of view. **Ivan Herman:** I think defining a URI scheme for that is not a good idea. **Romain Deltour:** I don't think we'll come with a solution that will be used by the end user. **Brady Duga:** I think there are 4 cases: local URLs, online URLs, jar URLs. … I think is the last URL the problem. … isn't it? **Romain Deltour:** yes, but also referencing to resources outside the package. **Brady Duga:** do we need to tell people how to display URLs inside on EPUBs (using fragments)?. … I would propose to remove it. > *Romain Deltour:* somewhat related, a gist from `@annevk` about ZIP URLs (from 8 years ago): [https://gist.github.com/annevk/6174119](https://gist.github.com/annevk/6174119). **Hadrien Gardeur:** referencing everything outside the archive is problematic specially for the content document. … I don't think we should get to a specific resolution here, because the RSs have different solutions. **Romain Deltour:** removing that paragraph about the URL of the package document won't work. … we need to tell people how to build them. **Romain Deltour:** at a minimium, we should base everything on the assumption that there is a url for the root of the container. … and we leave it for the reading system to define. … I'm not sure that will work. … and we are back to the discussion that the RS spec should say something. _See github issue [epub-specs#1843](https://github.com/w3c/epub-specs/issues/1843)._ **Dan Lazin:** there is another issue: [https://github.com/w3c/epub-specs/issues/1843](https://github.com/w3c/epub-specs/issues/1843). … about URIs for EPUBs. … how do specify epub in cors/iframe policy?. … I don't know how this is managed today. … you need some way to say, hi, I am aware of this epub can it can iframe my content. **Romain Deltour:** this might not answer entirely. … RS spec says, for scripting, reading system must associate a unique origin to the script. … a similar mechanism could be used to answer that issue about CORS/CSP. **Dan Lazin:** is it a predicable url?. **Romain Deltour:** this scripting mechanism is only about an origin--could be an opaque origin, doesn't have to be a url. … opaque origin serializes to null. **Ivan Herman:** where do we go from here?. **Dave Cramer:** do we ask for help?. **Ivan Herman:** we have tried and failed before. … we have been discussing these things. … if we come up with a concrete proposal. … and then check whether that solution is acceptable to the TAG or whoever. … my knowledge is not good enough to write a proposal. **Romain Deltour:** I was supposed to come up with a proposal. … I can write a summary of issue with possible approaches "paths" to solutions. … I don't know enough about URLs and security to know all the plusses and minuses. **Ivan Herman:** we can't go to CR with this stuff open. … it's unfortunate that Tess is not around any more, we might ask the TAG. … and the TAG takes time. … we have time pressure. **Dave Cramer:** could we talk to ping?. **Romain Deltour:** could we liase with Anne at WhatWG?. **Ivan Herman:** I worry about that. **Tzviya Siegman:** talking to Tess would be good. **Ivan Herman:** if we have a proposal that romain can put together. … my first option would be to involve Tess. **Romain Deltour:** I can summarize the problem statement. **Laurent Le Meur:** tests will take time. … why don't we just say that path-absolute URLs are illegal. … and just update epubcheck?. … to post an error if there's a slash at the beginning of URL. > *Romain Deltour:* path-absolute URLs are a red herring. the issue is with *any* relative URL really.. **Laurence Zaysser:** could we have a fifth objective, easy to move to web publication?. **Romain Deltour:** it's about any relative urls. Just dealing with path-relative won't solve the issue. _See github pull request [epub-specs#1725](https://github.com/w3c/epub-specs/pull/1725)._ **Matt Garrish:** we have 1725 PR, which forbids path-absolute URLs. Is there any reason we shouldn't merge that?. … should we close that? Or integrate it because it deals with part of the question?. **Wendy Reid:** have we exhausted this?. **Ivan Herman:** to answer matt, that one can go in. > *Romain Deltour:* +1. **Ivan Herman:** using root-relative IRIs is a bad idea for something like epub, where the root url is unclear. > **Proposed resolution: Merge PR #1725.** *(Wendy Reid)* > *Romain Deltour:* +1. > *Ben Schroeter:* +1. > *Ivan Herman:* +1. > *Gregorio Pellegrino:* +1. > *Matt Garrish:* +1. > *Shinya Takami (高見真也):* +1. > *Dave Cramer:* +1. > *Brady Duga:* +1. > *Matthew Chan:* +1. > *Tzviya Siegman:* +1. > *John Roque:* +1. > *Bill Kasdorf:* +1. > *Wendy Reid:* +1. > *Toshiaki Koike:* +1. > *Laurent Le Meur:* +1. > *Charles LaPierre:* +1. > *Hadrien Gardeur:* +1. > *Dan Lazin:* +1. > ***Resolution #1: Merge PR #1725.***

iherman commented 3 years ago

(just housekeeping): Is it o.k. if this issue is closed? #1725 has been merged and, thanks to @rdeltour, the remaining (and core) technical solution has been transferred to #1888...

rdeltour commented 3 years ago

Is it o.k. if this issue is closed?

works for me!