Closed iherman closed 2 years ago
This issue is why we introduced the notion of "root directory" in the LPF format. We still used urls (we decided not using the IRI term) in "Contents within the Package MUST reference these resources [those in the package] via relative-URL strings [url]."
Ah yes, indeed. We may want to get inspired and bring this over to the EPUB spec...
Is there a difference from the OCF abstract container's root directory? Doesn't it already establish the virtual directory structure within the zip?
I think the difference is not in the terms/concept but the surrounding explanation. I find the few extra terms in the LPF spec on "virtual in nature", etc, helpful. And a reference to this from the item element definition may also be helpful.
The issue was discussed in a meeting on 2021-04-01
List of resolutions:
@iherman are you OK closing this now after #1468 ?
Yes :-)
Sorry to come late to the party. After looking at an issue in EPUBCheck (see the ref linked above), I'm thinking this is still underspecified.
I think the intention is that:
But our issue is that since the URL of the Package Document is not defined, its path within the container can be described either in the path part of the URL (e.g. https://example.org/moby-dick/EPUB/content.opf
, or somewhere else like in a fragment (e.g. file://epub.zip#path=/EPUB/content.opf
). The result of parsing a relative URL based on these URLs will be very different.
Consider the following container content:
├── META-INF
│ └── container.xml
├── EPUB
│ ├── content.opf
│ └── xhtml
│ ├── content.xhtml
│ └── …
└── …
file://path/to/epub/EPUB/content.opf
.https://example.org/EPUB/content.opf
file://path/to.epub#path=/EPUB/content.opf
jar:file:/path/to.epub!/EPUB/content.opf
Now, for the following manifest item:
<item href="xhtml/content.xhtml" media-type="application/xhtml+xml" />
Examples A and B would resolve the item URL to file://path/to/epub/EPUB/xhtml/content.xhtml
and https://example.org/EPUB/xhtml/content.xhtml
. That's expected and reasonable.
But example C would resolve the item URL to file://path/xhtml/content.xhtml
. And I cannot tell about example D without rereading carefully the URL parsing algorithm. A cursory read tells me that jar URL is invalid (what if the URL parsing algo returns null
?).
In other words, all we have is Schrödinger’s URLs: we can't tell if they’re conforming or non-conforming (i.e. identifying container resources or not) until we open a reading system. 🤔
Shouldn't we reopen this issue?
@rdeltour
But example C would resolve the item URL to file://path/xhtml/content.xhtml.
Shouldn't this be: file://path/to.epub/xhtml/content.xhtml
?
And I cannot tell about without rereading carefully the URL parsing algorithm. A cursory read tells me that jar URL is invalid (what if the URL parsing algo returns null?).
You mean "cannot tell about D", right? (not that I have an answer).
Shouldn't we reopen this issue?
Yes. I will do so (through this comment).
The question I have: are C and D examples of real Reading Systems, or are they imaginary? In other words, can we allow ourselves to say that the RS must behave, conceptually, like A or B (and how they do that is up to them)?
example C would resolve the item URL to file://path/xhtml/content.xhtml.
Shouldn't this be:
file://path/to.epub/xhtml/content.xhtml
?
No, see the live URL viewer (JSDOM implementation of the URL standard, closely following the spec).
You mean "cannot tell about D", right? (not that I have an answer).
Right. Edited, thx 😊
The question I have: are C and D examples of real Reading Systems, or are they imaginary? In other words, can we allow ourselves to say that the RS must behave, conceptually, like A or B (and how they do that is up to them)?
Yes, I would really like RS folks to chime in! But even if they use something like C and D, I assume they "resolve" like A and B (which is what has been implicitly assumed historically, as far as I'm aware).
Isn't the problem the lack of clarity about the root of the container as in #1681? I opened that because root-relative paths are never going to be consistent, but it's the lack of a common root dir that's the big problem.
The current definition of the OCF root directory explicitly says the root is optional, even if I don't believe this is formalized normatively (but then how to unpack isn't normatively described, either):
an EPUB Reading System may or may not generate a physical root directory for the contents of the OCF Abstract Container if it unzips the contents
If there were consistency here, would it matter what scheme you use to reference/resolve the resources?
How the content is served doesn't matter, in my understanding, since this is only a check of the abstract container's integrity. But this also seems to make this statement problematic:
All relative-URL-with-fragment strings [URL] MUST, after parsing to URL records [URL], identify resources within the OCF Abstract Container
On the one hand, we're talking about an "abstract" container, but then on the other we're parsing for physical URL records. The root directory may have disappeared in the meantime, but we're implicitly assuming it is kept for checking this requirement.
We've run into this problem in the past with multiple renditions not working when shared content is not below the package document(s) in a common content directory, even though the epub is valid. Some reading systems don't create a root to match the zip, so you can't have sibling directories in the zip root, one for each rendition.
All we appear to want to know is a) is the resource inside the zip file, and b) is it not a duplicate of another entry. Resolving to URLs doesn't appear to have any other function.
You can't test a) unless you explicitly create a root that matches the zip root. And b) should sort itself out since the application will have internal consistency in however it then resolves the urls.
So do reading systems verify the manifest entries, or is this only an epubcheck problem to fix? If it's the latter, can't you assume a root directory that is the root of the zip and work from there? If the content doesn't work on any given reading system... well, that's sort of the state of the world now.
example C would resolve the item URL to file://path/xhtml/content.xhtml.
Shouldn't this be:
file://path/to.epub/xhtml/content.xhtml
?No, see the live URL viewer (JSDOM implementation of the URL standard, closely following the spec).
Oops, you're right. Sorry for the noise.
Isn't the problem the lack of clarity about the root of the container as in #1681? I opened that because root-relative paths are never going to be consistent, but it's the lack of a common root dir that's the big problem.
The problem is that we're making conformity statements based on an undefined object, and the nature of this object can very much change the interpretation of these statements. Specifically, in EPUB RS "3.1 Parsing Relative URLs":
To parse relative-URL-with-fragment strings [URL] in the Package Document, Reading Systems MUST use the URL of the Package Document as the base URL [URL].
This statement is problematic for a validator. And it could be problematic for reading systems too (in which case they would just ignore it).
All we appear to want to know is a) is the resource inside the zip file, and b) is it not a duplicate of another entry. Resolving to URLs doesn't appear to have any other function.
For EPUBCheck, yes. For a reading system: what should happen if the resource is outside the ZIP file? What if it's a duplicate? Do we specify the mandated behavior?
In any case, to specify how a UA assesses (a) and (b), the current wording doesn't work. We might work around the current issue by assigning an arbitrary URL to the container root. I'm not sure.
The problem is that we're making conformity statements based on an undefined object
Right, we let this slip under the radar in the past by not bringing in absolute URLs at all. All we said was that the relative paths had to resolve to resources in the container without explaining at all how that should happen. It intuitively makes sense until you get into the problem of an unzipped epub and an "abstract container" not having a common root dir.
To parse relative-URL-with-fragment strings [URL] in the Package Document, Reading Systems MUST use the URL of the Package Document as the base URL [URL].
This statement is problematic for a validator.
We might work around the current issue by assigning an arbitrary URL to the container root. I'm not sure.
Ya, I'm not saying what we have is helpful at all, but if we were to always assume for validation the root dir is the root of the ocf can't you parse the relative URLs and determine if they are below that directory?
But it's worth questioning why we try to block access this way but if you put in an aboslute url that references the file system that doesn't raise any concern. You just have to call it a remote resource. Maybe that's all we should require for resources outside the container and leave it to reading systems to similarly determine whether they want to use these?
what should happen if the resource is outside the ZIP file?
Okay, you got me here! I wasn't thinking about the unwritten rules... 😄
But how do we solve this without a fixed path to the package document, and isn't it too late for that? Files inside the container may appear outside the container to reading systems depending on how they unpack them, too.
That's why I'm not sure we can solve this outside of epubcheck, at least not beyond recommendations that accommodate both possibilities. There can only be internal consistency within any given application processing the epub.
What if it's a duplicate?
We tend to tolerate mistakes like this. The caveat to authors is if you don't follow these kinds of rules then bad things will happen. In this case, the reading system may get conflicting information about the resource. It's possible it might break the spine, too, as if you navigate to the resource I imagine it will complicate reading systems looking up which spine item you're in if there's more than one entry that matches the resource.
I guess we could recommend that reading systems ignore duplicate entries, though that wouldn't solve the problem of inaccuracies between the listings.
I am still trying to get around the problem (also for #1681).
Looking at https://github.com/w3c/epub-specs/issues/1374#issuecomment-847978623 from @rdeltour, and using
<item href="/xhtml/content.xhtml" media-type="application/xhtml+xml" />
(ie, a path-absolute-URL) and looking at options A and B I get the URLs file://path/xhtml/content.xhtml
and https://example.org/xhtml/content.xhtml
, respectively. Which are, again, what one would expect.
What this means is that the A and B cases, i.e.,
are consistent with the expectations as well as the URL spec.
Isn't it possible then to say (in our spec) that a Reading System MUST treat these URLs as if its implementation followed one of these two models? This does not mean that it must implement it exactly this way, but it must, when using a different approach, emulate one of the two options. Wouldn't that be enough for the spec? After all, if we consider an EPUB instance as a 'frozen website', those two options do represent a perfectly consistent mental model, so I do not think that approach would feel to strange...
and looking at options A and B I get the URLs
file://path/xhtml/content.xhtml
andhttps://example.org/xhtml/content.xhtml
, respectively. Which are, again, what one would expect.
How can you be sure this is what you will get when you don't know if the root directory of the epub will be preserved, though?
Nothing says it is wrong to expand it to:
https://example.org/nameOfEPUB/EPUB/xhtml/content.html
because you might also have in the same container:
https://example.org/nameOfEPUB/EPUB2/xhtml/content.html
in which case the / no longer refers to where you think it does.
Ouch, you are right: for root-relative paths https://github.com/w3c/epub-specs/issues/1374#issuecomment-848662667 does not work. And, having gone through that, I think I agree with the proposal in #1681 that those should be disallowed.
But it does for path-relative-scheme-less-URLs, doesn't it?
But it does for path-relative-scheme-less-URLs, doesn't it?
It all depends on how the EPUB is structured and then unpacked. If you put all your content in a single directory and have the package document at the root, there shouldn't be an issue.
If you refer to files across sibling directories in the root, then you're back in trouble again. For example, if you had this:
├── META-INF
│ └── container.xml
├── EPUB1
│ ├── content.opf
│ └── …
├── EPUB2
│ ├── content.opf
│ └── …
├── shared
│ ├── img
│ └── …
If you have a path from EPUB1 like '../shared/img/photo1.jpg', the file may not be there after the reading system unpacks the EPUB. It may only extract the directory where the first package file is.
(I'm using multiple renditions, but even a single-rendition epub doesn't have to be self-contained in a subdirectory. It's generally the norm, though, because of this problem.)
And to make things even more fun, if you put the package documents in the root, there probably isn't an issue (but I haven't test this):
├── META-INF
│ └── container.xml
├── content1.opf
├── content2.opf
├── EPUB1
│ └── …
├── EPUB2
│ └── …
├── shared
│ ├── img
│ └── …
But these are the problem of having a requirement that requires checking the zipped content using URL records when we're not completely clear what will be unpacked. Epubcheck can make easier assumptions to check the validity that reading systems cannot, at least as I understand @rdeltour's concern.
What happens after extraction is technically a separate problem, but it's also one we're quiet on. As @rdeltour has noted, what the reading system determines is available is not going to always be the same as what epubcheck does. But that's true now, too.
So we probably also have a few other issues to look at:
Right, we let this slip under the radar in the past by not bringing in absolute URLs at all. All we said was that the relative paths had to resolve to resources in the container without explaining at all how that should happen.
Exactly! And my bad for not having spotted that earlier (during the URL PR review or EPUBCheck dev) 😅
Some comments on @mattgarrish’s list of questions:
So we probably also have a few other issues to look at:
- why do we care if relative paths refer to resources outside the container? what is a reading system supposed to do with them?
This question is specifically depending on this current issue, which will define how relative URLs must be parsed. Depending on the approach, it could result in URLs that can refer to something outside the container, or not.
It all boils down to what base URL must be used to parse relative URLs. We have two high-level options:
With the second option, given we already say "as if", we may go further than what @iherman proposed and assign an arbitrary URL to the root of the container. For instance https://epub.example.org/
or https://epub.w3.org/
or whatever. That way, we're sure relative URLs will never go outside the container (they cannot go below the root level).
The benefit is that we have an unambiguous way to tell what resources are identified by all relative URLs, and that they always are in the container.
The drawback of course is that it is arbitrary, and can give different results compared to how URLs would be parsed if the EPUB container was simply unpacked and served on the Web or local file system (but then, only for path-absolute relative URLs, and for relative URLs going below the root level).
- can we allow relative paths to refer to resources outside the container if they're marked as a remote-resource? (not that this solves the problem of checking if they're outside but makes processing consistent with any remote resource)
I cannot see a use case where an author would intentionally do that, so it might not help much to require marking it as remote-resource
?
In any case, if we make it so that relative URLs are never parsed to something identifying a location outside the container, this point is moot.
- what does a reading system do if two manifest entries resolve to the same resource?
Yeah, good question 😊 (it's typically the kind of issue that was underspecified in previous author-centric EPUB spec, but that is worth some well-specified RS guidance). That can be discussed in a separate issue.
- do we need stronger guidance on structuring the zip container and/or unpacking it?
I'm not sure I understand the issue there. Or your concern with the shared directory use case. I guess this is related to when you said earlier "Files inside the container may appear outside the container to reading systems depending on how they unpack them, too.", which I'm not sure I understood either 😅.
For me, the spec is rather clear on this one. It explicitly says that any location descendant from the root directory can be used in the publication. So it's the RS responsibility to keep everything under the root when unpacking the container, no?
Finally, while we're at it, we could add another question:
- what to do with absolute
file
URLs?
file
URLs have always been in a weird place in EPUB. Not strictly forbidden, but obviously not a good practice (and often used by mistake, when they happen at all).
Would it be reasonable to say only absolute URLs with a special scheme that is not file
are conforming?
And we allow RS to still process any non-conforming URL if they want.
@mattgarrish I didn't create new issues for the separate questions we brought, I'll let you decide what you prefer as an editor. But if you'd like me to create these, let me know!
That way, we're sure relative URLs will never go outside the container (they cannot go below the root level).
Right, I was inarticulately trying to suggest that the dev will know what the root corresponds to however they resolve URLs, so they should be able to determine this even though the parsed URL will be detached from the physical zip, but if we want to define a more formal method for checking that works for me.
The process just can't depend on the actual unpacking of the zip as variably done by reading systems, as we know that won't lead to consistent results.
I cannot see a use case where an author would intentionally do that, so it might not help much to require marking it as
remote-resource
?
I can contrive one if you want... 😄
But it's flaky at best and probably unrealistic outside of a very rigid content delivery system and setup (e.g., it might work in an internal documentation system if the author knew of information they could access on any corporate machine).
I just wonder why we explicitly ban these and not, as you say, absolute file urls. Banning the latter also works for me, as it's more a measure of consistency I'm after.
- do we need stronger guidance on structuring the zip container and/or unpacking it?
I'm not sure I understand the issue there.
There isn't a requirement to preserve the zip root directory or any descendant content that isn't in the same directory as the package document. You'd think reading systems would preserve it, but past experience in multiple renditions showed that couldn't be relied on. The result is that some reading systems won't let you reach across sibling folders in the zip root because they don't appear to preserve them.
That's why we put this note in the multiple renditions spec: https://www.w3.org/TR/epub-multi-rend-11/#h-note
I know most EPUBs have a single "EPUB" directory where the content is stored, but that's not a requirement. If you don't follow that pattern, and don't have the package document in the root, bad things can happen (i.e., reading systems won't display the content that never got unpacked).
So who's at fault in this scenario? Should the reading system be required to unpack all content and ensure that all content below the root directory is available, even if it doesn't create an extra folder for the root directory? Should authors be more strongly warned not to rely on being able to access across sibling folders that are not below the package document?
But if you'd like me to create these, let me know!
Feel free to have a go at them! I can add this latter issue if you don't want it attributed to you... 😉
If you refer to files across sibling directories in the root, then you're back in trouble again. For example, if you had this:
├── META-INF │ └── container.xml ├── EPUB1 │ ├── content.opf │ └── … ├── EPUB2 │ ├── content.opf │ └── … ├── shared │ ├── img │ └── …
If you have a path from EPUB1 like '../shared/img/photo1.jpg', the file may not be there after the reading system unpacks the EPUB. It may only extract the directory where the first package file is.
I guess I do not understand what I am seeing here. Are we seeing the content of the ZIP file, after being unpacked onto the processor's file system? If so, if the file ../shared/img/photo1.jpg
is not available, then we have a bug, but why is this dependent on the issue at hand?
Are we seeing the content of the ZIP file, after being unpacked onto the processor's file system?
No, that's the packed EPUB. It seems some reading systems look up the location of the package document and only unpack the directory it's in. So in the above case, any content in /EPUB2 and /shared are not available to the publication in /EPUB1.
That doesn't seem like it should be valid, as the definition only say the root dir is optional to create. But that's also only a definition and we don't say anything about what has to be unpacked or made accessible.
but why is this dependent on the issue at hand?
I only brought that in because you can't rely on checking if a file is "within the abstract container" after the reading system has unpacked the content. Files that were in the zip container are gone at that point.
That way, we're sure relative URLs will never go outside the container (they cannot go below the root level).
Right, I was inarticulately trying to suggest that the dev will know what the root corresponds to however they resolve URLs, so they should be able to determine this even though the parsed URL will be detached from the physical zip, but if we want to define a more formal method for checking that works for me.
Yes, exactly. Again the idea is to use spec language to unambiguously define what the relative URLs identify. An RS is of course free to implement that as they please!
The process just can't depend on the actual unpacking of the zip as variably done by reading systems, as we know that won't lead to consistent results.
Sure.
I cannot see a use case where an author would intentionally do that, so it might not help much to require marking it as
remote-resource
?I can contrive one if you want... 😄
But it's flaky at best and probably unrealistic outside of a very rigid content delivery system and setup (e.g., it might work in an internal documentation system if the author knew of information they could access on any corporate machine).
I just wonder why we explicitly ban these and not, as you say, absolute file urls. Banning the latter also works for me, as it's more a measure of consistency I'm after.
Interesting use case 😊.
That use case is feasible even if out-of-container relative URLs are made impossible, as long as file
URLs are possible (SHOULD NOT being my recommendation: warn about them, but still allow them for edge cases like this).
- do we need stronger guidance on structuring the zip container and/or unpacking it?
I'm not sure I understand the issue there.
There isn't a requirement to preserve the zip root directory or any descendant content that isn't in the same directory as the package document. You'd think reading systems would preserve it, but past experience in multiple renditions showed that couldn't be relied on. The result is that some reading systems won't let you reach across sibling folders in the zip root because they don't appear to preserve them.
I wasn't aware of RS interop issues there. The EPUB container spec says ("File and directory" section):
EPUB Creators MAY locate all other files within the OCF Abstract Container in any location descendant from the Root Directory, provided they are not within the META-INF directory.
which quite unambiguously says this is allowed? And that RS should theoretically handle that fine?
So who's at fault in this scenario? Should the reading system be required to unpack all content and ensure that all content below the root directory is available, even if it doesn't create an extra folder for the root directory? Should authors be more strongly warned not to rely on being able to access across sibling folders that are not below the package document?
In the current spec, that's an RS bug in my book. That said, if we want to better align with the real-world implementation practices, then we can certainly add more restrictions to the current statements!
But if you'd like me to create these, let me know!
Feel free to have a go at them! I can add this latter issue if you don't want it attributed to you… 😉
OK will do! (later today, kid ill at home 😅)
Are we seeing the content of the ZIP file, after being unpacked onto the processor's file system?
No, that's the packed EPUB. It seems some reading systems look up the location of the package document and only unpack the directory it's in. So in the above case, any content in /EPUB2 and /shared are not available to the publication in /EPUB1.
That doesn't seem like it should be valid, as the definition only say the root dir is optional to create. But that's also only a definition and we don't say anything about what has to be unpacked or made accessible.
Yes, that sounds absolutely wrong. We should say that the full ZIP package content should be available. I am actually surprised this is not the case...
I am actually surprised this is not the case...
Ya, I don't know why we didn't log an issue when we were developing the MR spec. I guess it got forgotten after we wrote the note.
which quite unambiguously says this is allowed? And that RS should theoretically handle that fine?
Definitely allowed, but it's the theoretical part that always does us in. The spec doesn't disallow extracting only the file where the package document is located, and it probably works fine for the vast majority of EPUBs.
Wish I could remember which reading systems we got tripped up by, but in any case we need a proper requirement.
OK will do! (later today, kid ill at home 😅)
I've got a post-vaccine queasy adult at home today, and sadly it's me, so I'm in no better a boat... 🤢
The issue was discussed in a meeting on 2021-06-10
List of resolutions:
file:
, close issue 1688The issue was discussed in a meeting on 2021-06-18
The issue was discussed in a meeting on 2021-07-02
I reframed the issue in #1888, along with a (non-exhaustive) list of possible solutions.
The issue was discussed in a meeting on 2021-10-29
List of resolutions:
(just housekeeping): Is it o.k. if this issue is closed? #1725 has been merged and, thanks to @rdeltour, the remaining (and core) technical solution has been transferred to #1888...
Is it o.k. if this issue is closed?
works for me!
We may not be in position of providing an absolutely clean definition (so maybe some editorial hand waving would be necessary) but…
§4.2.3.4.2 The item Element says:
The intention is clear but what is the "IRI of the Package Document"? After all, the package document is part of a ZIP file, it is not really on the Web, ie, it is not clear what its IRI is.
Can we say something more precise about this?