What base URLs to use for URL parsing in EPUB?

w3c / epub-specs

Shared workspace for EPUB 3 specifications.

Other

303 stars 60 forks source link

What base URLs to use for URL parsing in EPUB? #1888

Closed rdeltour closed 2 years ago

rdeltour commented 2 years ago

(or: The Big Mystery of Spooky EPUB Relative URLs 👻🎃)

TL;DR: EPUB 3.3 now normatively references the URL Standard. But URL parsing is ambiguous in some cases, because base URLs are not clearly defined.

⚠️ EPUB references in the problem statement below point to a dated version of the EPUB 3.3 working draft (Oct 29, 2021). Do not copy out of context! 😉

Current situation

In an EPUB, files reference each other via relative URL strings (see Relative URLs, in Open Container Format). In the URL standard, to parse a relative URL string into URL records, the URL parser needs a base URL.

The base URL used to parse a URL string is defined by host languages (like in CSS, or HTML). Typically, it is the URL of the document containing the URL string.

EPUB defines what base URL to use for URL parsing in two cases:

relative URL strings found in documents located in the META-INF directory
relative URL strings in the Package Documents

Parsing a URL in documents located in the `META-INF` directory

For documents in the META-INF directory, URL strings must be parsed using the root directory as the base URL (see Relative URLs, in Open Container Format).

The problem is that Root Directory is not defined as a URL, but quite abstractly as "the base of the OCF Abstract Container". The spec also says the root directory is "virtual in nature". In fact, RS may or may not generate a physical directory for the root directory (see OCF ZIP Container RS processing).

Parsing a URL in the Package Document

For Package Documents, URL strings must be parsed uses the URL of the Package Document as the base URL (see Parsing Relative URLs, in Package Documents RS processing).

Here again, the URL of the Package Document is not well-defined. But the spec says (in the same section) that for zipped EPUBs, the URL of the package document is obtained "from the URL of the EPUB Container together with a fragment identifier that specifies the path to Package Document (relative to the Root Directory)".

Problems

The URL of the container’s root directory is undefined

The current specification leaves many questions unanswered:

What is the URL of the root directory? Is it the URL of the ZIP file? or extracted directory? or constructed based on the URL of the ZIP file? how? or it's up to the RS to define it?
The RS may generate a physical directory for the container's Root Direcotry if it unzips the EPUB. What if the RS doesn't unzip the root but only a subdirectory? What if the EPUB is not unzipped as a whole? (but streamed on demand).

The current way to obtain the URL of the Package Document is flawed

Parsing a relative URL in the Package Document always results in a URL of a resource outside the container.

Examples:

For instance, for an EPUB mobydick.epub located at https://example.org/acme-publishing/mobydick.epub , the URL of the Package Document would be something like https://example.org/acme-publishing/mobydick.epub#path=/EPUB/package.opf. So this is how a few relative URL string examples are parsed:

#	URL string	Base EPUB	Resulting URL
1	`nav.xhtml`	`https://example.org/acme/mobydick.epub#path=/EPUB/package.opf`	`http://example.org/acme/nav.xhtml`
2	`nav.xhtml`	`https://example.org/acme/tomsawyer.epub#package-doc=/EPUB/package.opf`	`https://example.org/acme/nav.xhtml`
3	`../video/cat.mp4`	`https://example.org/acme/mobydick.epub#package-doc=/EPUB/package.opf`	`https://example.org/video/cat.mp4`
4	`/secret`	`https://example.org/acme/mobydick.epub#package-doc=/EPUB/package.opf`	`https://example.org/secret`
5	`../../../secret`	`https://example.org/acme/mobydick.epub#package-doc=/EPUB/package.opf`	`https://example.org/secret`

example 1 shows that the parsed URL of a navigation document identifies a (possibly existing) resource outside the EPUB.
example 1 and 2 show that the URLs of two documents from two different EPUBs are parsed into the same URL.
example 3 shows that a legit relative URL of an in-container video resource is parsed as the URL that:
- may conflict with the URL of another legit remote resource (remote resources are allowed for video content).
- leaks outside the container, and points to a space possibly owned by another publisher
example 4 and 5 show that it is very easy to forge URL strings that are parsed to arbitrary files on a server or file system. This is true not only for path-absolute URL strings like 4, but also of for path-relative URL strings like 5.

To summarize:

the current way Package Document URLs are defined is flawed (potential conflicts between 2 legit URL strings)
the current way Package Document URLs is possibly a security or privacy vulnerability

Possible Solutions

The ideal solution would ensure parsed URLs would be:

unambiguous: the results of parsing two URL strings should not be two identical URLs for one processor and two different URLs for another processor.
- Why? because otherwise it is impossible to tell if an EPUB is conforming (it may be for a processor and not for another)
contained: the result of parsing a relative URL string should not be the URL of a resource outside of the container. At least, a URL string representing a legit in-container resource should not be parsed to a URL of a remote resource.
- Why? To avoid conflicts between publication resources and remote resources. To avoid possible vulnerabilities.
unique: the result of parsing two relative URL strings from two different EPUBs should not be two identical URLs.
- Why? To avoid conflicts within a RS implementation (to be confirmed)
origin-safe: the URLs parsed from two relative URL strings from two different EPUB instances should not be same-origin. If possible, the URLs parsed from two relative URL strings in the same EPUB should be same-origin.
- Why? resources within the same publication share the same trusted authority, resources within different publicaitons (or copies of the same publication) do not.

Note: the ideal solution might not exist, or might not be practical to use, to implement, or to specify. But the goals listed above may help us evaluate a solution.

Possible solutions will be listed below as individual comments, for easier referencing in the discussion.

Comments and ideas welcome! 😊
I may have missed important things…

rdeltour commented 2 years ago

Solution 0: Current situation

Description

The URL of the root directory is largely undefined, and the URL of the package document is defined –for zipped EPUBs– to the URL of the ZIP + a fragment.

Examples

see the problem statememt above.

Features

unambiguous	contained	unique	origin-safe
No ❌	No ❌	No ❌	No ❌

rdeltour commented 2 years ago

Solution 1: Leave the definition of base URLs up to the Reading System

Description

Remove what the spec currently says about how to obtain the Package URL. Leave it up to Reading Systems to define the base URLs of the Root Directory and the Package Document.

This is only a mild improvement to the current spec. It doesn't prevent the existing flawed approach.

Examples

anything can happen, depending on the implementation-defined appraoch.

Features

unambiguous	contained	unique	origin-safe
Not necessarily ❌	Not necessariy ❌	Not necessarily ❌	Not necessarily ❌

rdeltour commented 2 years ago

Solution 2: Use a reserved special URL as the container root's base URL

Description

The URL of the container root is a well-chosen special URL. Possibly, the host can include a string unique to each EPUB instance.

One downside is that parsed URLs look like Web resources, but they are not. If we control the domain, we can ensure it will not conflict to actual resources.

Examples

The URL of the root directory of an EPUB is arbitarily defined as https://<some-instance-unique-string>.epub.w3.org

#	URL string	Base EPUB	Resulting URL
1	`/`	`https://1234.epub.w3.org`	`https://1234.epub.w3.org/`
2	`doc.xhtml`	`https://1234.epub.w3.org/EPUB/package.opf`	`https://1234.epub.w3.org/EPUB/doc.xhtml`
3	`doc.xhtml`	`https://4242.epub.w3.org/EPUB/package.opf`	`https://4242.epub.w3.org/EPUB/doc.xhtml`
4	`../../../secret`	`https://1234.epub.w3.org/EPUB/package.opf`	`https://1234.epub.w3.org/secret`
5	`/secret`	`https://1234.epub.w3.org/EPUB/package.opf`	`https://1234.epub.w3.org/secret`

Features

unambiguous	contained	unique	origin-safe
Yes ✅	Yes ✅	Yes (with unique instance ID) ✅	Yes (with unique instance ID) ✅

rdeltour commented 2 years ago

Solution 3: Use a proprietary non-special scheme

Description

Define an EPUB-specific URL scheme (for example epub:). Possibly, use a string unique to each EPUB instance as a the host.

One downside is that registering a new scheme may not be a good practice (scheme squatting).

Examples

The URL of the root directory of an EPUB is defined as epub://<some-instance-unique-string>/

#	URL string	Base EPUB	Resulting URL
1	`/`	`epub://1234/`	`epub://1234/`
2	`doc.xhtml`	`epub://1234/EPUB/package.opf`	`epub://1234/EPUB/doc.xhtml`
3	`doc.xhtml`	`epub://4242/EPUB/package.opf`	`epub://4242/EPUB/doc.xhtml`
4	`../../../secret`	`epub://1234/EPUB/package.opf`	`epub://1234/secret`
5	`/secret`	`epub://1234/EPUB/package.opf`	`epub://1234/secret`

Features

unambiguous	contained	unique	origin-safe
Yes ✅	Yes ✅	Yes (with unique host) ✅	No (non-special URLs have unique opaque origins) ❌

rdeltour commented 2 years ago

Solution 4: use a `file` URL

Description

The URL of the root directory is arbitrarily defined as a file URL (regardless of how the EPUB is accessed). Possibly, use a string unique to each EPUB instance as a the host.

One downside is that using a file scheme may not reflect how the resources are accessed internally by the RS. Also, there is a theretical possibility of conflict with an existing host.

Examples

The URL of the root directory of an EPUB is defined as file://<some-instance-unique-string>/

#	URL string	Base EPUB	Resulting URL
1	`/`	`file://1234/`	`file://1234/`
2	`doc.xhtml`	`file://1234/EPUB/package.opf`	`file://1234/EPUB/doc.xhtml`
3	`doc.xhtml`	`file://4242/EPUB/package.opf`	`file://4242/EPUB/doc.xhtml`
4	`../../../secret`	`file://1234/EPUB/package.opf`	`file://1234/secret`
5	`/secret`	`file://1234/EPUB/package.opf`	`file://1234/secret`

Features

unambiguous	contained	unique	origin-safe
Yes ✅	Yes ✅	Yes (almost) ✅	Not-defined (unique opaque origin is recommended) ❌

rdeltour commented 2 years ago

Solution 5: use a special syntax for ZIP entries URLs

Description

The URL of container resources is defined as <URL-to-zip-without-fragment>!<path-absolute-relative-URL-of-entry>

This is not a standard practice, but there is precedent (like jar: URLs, although they are based on a custom URL scheme).

Examples

The URL of the root can be for example file:///path/to/mobydick.epub!/ or https://example.org/mobydick.epub!/

#	URL string	Base EPUB	Resulting URL
1	`/`	`file:///path/to/epub.epub!/`	`file:///`
2	`doc.xhtml`	`file:///path/to/epub.epub!/EPUB/package.opf`	`file:///path/to/epub.epub!/EPUB/doc.xhtml`
3	`doc.xhtml`	`https://example.org/mobydick.epub!/EPUB/package.opf`	`https://example.org/mobydick.epub!/EPUB/doc.xhtml`
4	`../../../secret`	`file:///path/to/epub.epub!/EPUB/package.opf`	`file:///path/secret`
5	`/secret`	`https://example.org/mobydick.epub!/EPUB/package.opf`	`https://example.org/secret`

Features

unambiguous	contained	unique	origin-safe
only for path-relative URL strings to in-container resources ⚠	No ❌	only for path-relative URL strings to in-container resources ⚠	Not always (depending on the EPUB file's URL) ❌

rdeltour commented 2 years ago

Tip: you can use this standard-conforming live URL viewer to check the result of parsing a URL string with a base URL.

P5music commented 2 years ago

One thing is that usually epub publications have internal metadata that uniquely identify the book but the file of the epub can have any name. Very often it is not the real title but a strange non-standard form like book_title_1st_ed.epub

This is why searching for an ePub publication, even in urls, should be in a query form IMHO.

Also there is the ! redirection symbol that could be useful, as a standard mean of showing that some "jump" has to be done from that point, even with a different method of retrieval.

For example, the [chapter01]! inside the epubcfi fragment means that there is an explicit operation of "jumping" has to be done with the methods that are available on the reader or the system, like loading the effective file with corresponding idref. And it also works inside the same XHTML file. I am using a epubcfi library (from epub.js) that has not assertion parsing yet, so maybe I have to "fork it".

I would like to point out the "active" nature of the redirection operation could be useful when changing "domain", like from filesystem to zip, or to package, to chapter, to paragraph. And not all passages have to be present. For example an epub resource could be on the internet, on a filesystem or inside the library of an app. (see also this issue: About URI scheme for ePub reader app with fragment identifiers linking into the publication from external link ). Does it make sense?

rdeltour commented 2 years ago

Solution 6: use a `localhost` HTTP URL with a unique port

Description

The URL of the root directory is defined as an http URL using localhost as the host and a port number unique to the EPUB instance.

To avoid conflicts with registered ports, the port number should be in the range of dynamic ports (from 49152 to 65535).

Is that approach (using a port number as the way to differentiate an EPUB instance) OK or is it not considered good practice?

Examples

The URL of the root directory of an EPUB is defined as http://localhost:<some-instance-unique-string>/

#	URL string	Base EPUB	Resulting URL
1	`/`	`http://localhost:49152/`	`http://localhost:49152/`
2	`doc.xhtml`	`http://localhost:49152/EPUB/package.opf`	`http://localhost:49152/EPUB/doc.xhtml`
3	`doc.xhtml`	`http://localhost:50505/EPUB/package.opf`	`http://localhost:50505/EPUB/doc.xhtml`
4	`../../../secret`	`http://localhost:49152/EPUB/package.opf`	`http://localhost:49152/secret`
5	`/secret`	`http://localhost:49152/EPUB/package.opf`	`http://localhost:49152/secret`

Features

unambiguous	contained	unique	origin-safe
Yes ✅	Yes ✅	Yes (limited to 16k instances per machine) ✅	Yes ✅

rdeltour commented 2 years ago

One thing is that usually epub publications have internal metadata that uniquely identify the book but the file of the epub can have any name. Very often it is not the real title but a strange non-standard form like book_title_1st_ed.epub

This is why searching for an ePub publication, even in urls, should be in a query form IMHO.

It seems this is slightly out of topic for the current issue: we're not trying to define how should a URL to an existing EPUB look like. This is not our responsibility. We're trying to identify if we can define the URL of the root directory of the EPUB OCF container in such a way that parsing relative URL strings becomes unambiguous.

Also there is the ! redirection symbol that could be useful, as a standard mean of showing that some "jump" has to be done from that point, even with a different method of retrieval.

I understand this is similar to solution 5 define above?

I would like to point out the "active" nature of the redirection operation could be useful when changing "domain", like from filesystem to zip, or to package, to chapter, to paragraph. And not all passages have to be present. For example an epub resource could be on the internet, on a filesystem or inside the library of an app. (see also this issue: About URI scheme for ePub reader app with fragment identifiers linking into the publication from external link ). Does it make sense?

I'm not sure I fully understand your proposal. Maybe examples would help? 😊

P5music commented 2 years ago

The problem is that Root Directory is not defined as a URL, but quite abstractly as "the base of the OCF Abstract Container". The spec also says the root directory is "virtual in nature". In fact, RS may or may not generate a physical directory for the root directory (see OCF ZIP Container RS processing).

You seem to want to merge the internal "world" of an ePub publication, with the URL standard, but this can be obtained only with explicit role of the ! redirection symbol, IMHO.

The quoted text clearly states that an unique solution is not available, and it would be preventing a lot of uses in fact.

And also it is not uniquely defined on systems themselves, for example there are https:// or file:/// that can both used in a browser, or in an application. Or the application can have its own scheme, or it can respond to the epub: scheme.

I think our two issues are not unrelated, and yes it is necessary to create a standard scheme to open the ePub, view or to access its internal content, it seems that we are saying the very same thing. You call it "the base URL", I call it "the scheme". It also seems that my issue is more general as to the possible uses.

The virtual nature of the nodes that are in a URL demands for a generalization.

A book could be identified by an URN but all happens today on the internet or devices so URL does make sense. It seems to demand the introduction of the "virtual nodes" of domain specification with the ! symbol.

Because even if some implementations could yield the internal content when a complete https:// URL is used, still it is the system that decides to allow that "jump", as it was an internet resource access. Maybe that URL form is linear and plain but it would be misleading for other cases.

For example my app does not do that. Also my app is on a device, not on the internet.

eBooks are possibily located in different locations, according to what system can search, deliver or view them.

The redirection symbol ! could be useful to solve the conundrum of different uses of the URLs, different origins and destinations, asking the system to find or open the virtual root that best suites the user's request according to the system features.

A standard form should allow the retrieval of the ePub with definite metadata, like isbn or HTML-encoded title, and epubcfi, even in the case the user just wants to open it in any app that's available on a device.

It's like a sort of search function that is embedded in the URL, because in most cases the user has an app where the ePub can be found in the library.

So the common case would be a legit and simple URL with a scheme that informs the device an app has to be launched to open that ePub at that cfi location, or that it has to be searched on a webservice, or directly opened in a server folder on the internet. These cases have increasingly specific URLs as to host, authority and path.

It seems that there is some differences between the major OSs, like Android and iOS just to mention, so https:// is sort of universal. "Hyper text transfer protocol" is not far at all from the ePub type of content. And it seems that it already used as a de-facto standard for exchangeable operativity between browsers and apps or to provide a fallback.

But of course there are other simple or complex forms of URL that can be used.

First and foremost the concern about the authority and host parts of an URL should be addressed. It should allow both a specific app and a generic app to handle the request.

I put down many URL forms, like for example: https:///library!/book?title=thetitle&isbn=theisbn&epubcfi=/6/4[chapter01]!/4/2 (would it be handled?) or https://reader!/library!/book?title=thetitle&isbn=theisbn&epubcfi=/6/4[chapter01]!/4/2 (would it be handled?) https:///library!/open?title=thetitle&isbn=theisbn&epubcfi=/6/4[chapter01]!/4/2 (would it be handled?) or also epub:library!/book?title=thetitle&isbn=theisbn&epubcfi=/6/4[chapter01]!/4/2 epub:library!?title=thetitle&isbn=theisbn&epubcfi=/6/4[chapter01]!/4/2 epub:reader!/library!?title=thetitle&isbn=theisbn&epubcfi=/6/4[chapter01]!/4/2

mattgarrish commented 2 years ago

What if we replace the fragment identifier part with the ZIP path, so we end up with something like:

"When an EPUB Publication is zipped, the base URL of the Package Document is obtained by combining the base URL for the EPUB Container with the ZIP path to Package Document. The resulting absolute URLs obtained by combining the Package Document's base URL with the relative URL references in the Package Document MUST resolve at or below the base URL for the EPUB Container."

Would that still require us to get into schemes? Or am I missing another problem?

rdeltour commented 2 years ago

What if we replace the fragment identifier part with the ZIP path (…) am I missing another problem?

The main problem with this approach is that we enable Schrödinger’s EPUBs, which are both conforming and not-conforming at the same time, until a processor decides how if effectively combines the EPUB URL and package path.

EPUBCheck cannot tell if an EPUB is conforming. An author cannot tell if their EPUB will work in an interoperable manner.

mattgarrish commented 2 years ago

until a processor decides how if effectively combines the EPUB URL and package path

What decision making is left, though? If you treat the path to the package document as a relative URL instead of a fragment identifier then what complicates generating a absolute URL from that?

I get having the EPUB file in the base URL makes a mess, but what if the base URL excludes that on the assumption that you're unpacking that file into the directory where it's currently located?

So from: https://example.org/acme/mobydick.epub

The base URL of the EPUB is: https://example.org/acme/

Then from parsing 'https://example.org/acme/' with 'EPUB/package.opf' you get https://example.org/acme/EPUB/package.opf

From that the base URL of the package document is 'http://example.org/acme/EPUB/'

And with that why can't you continue to check the relative URLs in the package document to ensure that are at 'https://example.org/acme/' or below?

There's still the problem of the root directory of the EPUB being virtual, but that's something we can only warn reading systems that if they don't maintain it they may allow references to leak outside the EPUB.

P5music commented 2 years ago

There's still the problem of the root directory of the EPUB being virtual, but that's something we can only warn reading systems that if they don't maintain it they may allow references to leak outside the EPUB.

This should be enough to dismiss this approach, because you are mixing two "worlds".

Outside an EPUB container there is "nothing", it is not connected with other files that are around in that directory where that ePub and other ones were unpacked. It's wrong IMHO. Because it is not like ../../../ Wrong paths should not be able to "go outside" the ePub folder (leak). Zip files have not something "outside" because that should be like opening and unzipping another file in memory in active manner, when they are not effectively unpacked.

What is the URL of the root directory? Is it the URL of the ZIP file? or extracted directory? or constructed based on the URL of the ZIP file? how? or it's up to the RS to define it?

I think the root directory is just a virtual limit, cannot be combined with an external path, even if some implementations just unpack the folder and then the content is internally accessed as files in a folder. My app does so but it's only internal access to filesystem. You have to use a complete path from the / to the private app folder and then the library and the ePub folder.

But it cannot be a specification. Instead it should be avoided and forbidden. Am I wrong?

This is why I asked for a way to access the ePub content that is modern and flexible with special "virtual node" features for those systems that can be compatible with them.

dauwhe commented 2 years ago

@mattgarrish wrote:

So from: https://example.org/acme/mobydick.epub

The base URL of the EPUB is: https://example.org/acme/

Then from parsing 'https://example.org/acme/' with 'EPUB/package.opf' you get https://example.org/acme/EPUB/package.opf

From that the base URL of the package document is 'http://example.org/acme/EPUB/'

Should I worry that this means that https://example.org/acme/mrsdalloway.epub (and every other EPUB on example.org would be same-origin with Moby-Dick?

rdeltour commented 2 years ago

What decision making is left, though? If you treat the path to the package document as a relative URL instead of a fragment identifier then what complicates generating a absolute URL from that?

It seems I didn't understand your proposal. I thought it was intentionally open to any kind of combination. But instead you're saying that the URL of the Package Document is the result of applying the URL parser to the path of the Package Document relative to the root directory, with the URL of the EPUB publication as the base URL. Correct?

I can see several issues with that. See below.

I get having the EPUB file in the base URL makes a mess, but what if the base URL excludes that on the assumption that you're unpacking that file into the directory where it's currently located?

So from: https://example.org/acme/mobydick.epub

The base URL of the EPUB is: https://example.org/acme/

Then from parsing 'https://example.org/acme/' with 'EPUB/package.opf' you get https://example.org/acme/EPUB/package.opf

From that the base URL of the package document is 'http://example.org/acme/EPUB/'

OK for the examples.

Just some nitpicking to be perfectly clear: speaking of base URL of a document is not standard terminology. For the URL standard, any URL can be called "base URL" in some context. For the URL parser specifically, base URL is an (optional) argument that is used to parse the input argument. Passing https://example.org/acme/mobydick.epub or https://example.org/acme/ as the base URL argument of the URL parser is functionally equivalent.

And with that why can't you continue to check the relative URLs in the package document to ensure that are at 'https://example.org/acme/' or below?

One issue is editorial. It may not be easy to formally specify "being at some URL or below".

Another issue is that this approach is not contained (as defined in the problem statement) and can potentially create conflicts.

To take your example with the EPUB at https://example.org/acme/mobydick.epub. Say I happen to have a remote resource located at https://example.org/acme/video/cat.mp4. And a package item with href="../video/cat.mp4". The item is "at https://example.org/acme/ or below`. Yet it conflicts with an existing Web resource. We cannot reasonably start requiring that EPUB publication are only located in some places that do not contain or serve other resources.

I'm not even considering URL that leak outside the container. Even if we put some spec work to forbit it, authors will do it so we have to handle that case in the RS spec.

Finally, the approach is not unambiguous (as defined in the problem statement). Suppose your EPUB has a package item with href="../../acme/EPUB/doc.xhtml. If your EPUB is at https://example.org/acme/mobydick.epub, the item URL resolves to https://example.org/acme/EPUB/doc.xhtml which (if I understand correctly) is conforming. But if the EPUB is at https://example.org/epub/acme/mobydick.epub, the item URL also resolves to https://example.org/acme/EPUB/doc.xhtml, but that is then not conforming (not at the publication-level or below).

There's still the problem of the root directory of the EPUB being virtual, but that's something we can only warn reading systems that if they don't maintain it they may allow references to leak outside the EPUB.

In fact, can we even assume that an EPUB publication has a URL? 🤔 (serious question).

rdeltour commented 2 years ago

Should I worry that this means that https://example.org/acme/mrsdalloway.epub (and every other EPUB on example.org would be same-origin with Moby-Dick?

In fact the approach pretty much shares the characteristics of solution 5 (the "zip URL + !" approach). Two EPUBs can be same-origin indeed.

I'd be interested to hear more about what people think about the objectives listed in the problem statement. So I second @dauwhe's question 👀😊.

dauwhe commented 2 years ago

I kind of like the localhost with unique port idea, largely because it also gives us the origin properties we want--an EPUB is same-origin with itself, but cross-origin with all other EPUBs. Since it's in some sense fiction, do we need to worry about only having the 16k ports?

rdeltour commented 2 years ago

@P5music

There's still the problem of the root directory of the EPUB being virtual, but that's something we can only warn reading systems that if they don't maintain it they may allow references to leak outside the EPUB.

This should be enough to dismiss this approach, because you are mixing two "worlds".

Outside an EPUB container there is "nothing", it is not connected with other files that are around in that directory where that ePub and other ones were unpacked. It's wrong IMHO. (…) Am I wrong?

I think you're right: any solution based on the actual URL of the EPUB publication will not be contained and probably not be totally unambiguous (using this issue's terminology).

mattgarrish commented 2 years ago

One issue is editorial. It may not be easy to formally specify "being at some URL or below".

That was only for example. The abstract container is a virtual file system, so it doesn't really matter what scheme/url constitutes the base of the EPUB. You assign whatever base URL you want for the root directory and you can operate on the ZIP paths as though they are relative URLs.

That's how this can't be a conflict:

Say I happen to have a remote resource located at https://example.org/acme/video/cat.mp4. And a package item with href="../video/cat.mp4". The item is "at https://example.org/acme/ or below`.

The video is not in the abstract container. You can't generate URLs for the zipped content and then go looking at things around the zipped file. (If you're going to unzip the content where resources already exist, well then you've made your own mess we can't solve.)

The generated URLs don't correspond to physical resources, so all they can tell you is whether the resource would conceptually fall within an actual file system representation of the abstract container. If you want to know if the resource is in the container, you still have to take the path segment corresponding to the root directory and below and see if there's a matching resource in the ZIP container.

rdeltour commented 2 years ago

The abstract container is a virtual file system, so it doesn't really matter what scheme/url constitutes the base of the EPUB. You assign whatever base URL you want for the root directory and you can operate on the ZIP paths as though they are relative URLs.

That's how this can't be a conflict:

Say I happen to have a remote resource located at https://example.org/acme/video/cat.mp4. And a package item with href="../video/cat.mp4". The item is "at https://example.org/acme/ or below`.

The video is not in the abstract container. You can't generate URLs for the zipped content and then go looking at things around the zipped file. (If you're going to unzip the content where resources already exist, well then you've made your own mess we can't solve.)

This is a conflict to evaluate conformance to the spec saying that "EPUB Creators (…) MUST ensure each URL is unique within the manifest scope after resolution to an absolute URL".

Unless we work around it by saying "only for URLs defined as relative URL strings" maybe? but that sounds flimsy. And it doesn't entirely solve the ambiguity (see below).

The generated URLs don't correspond to physical resources, so all they can tell you is whether the resource would conceptually fall within an actual file system representation of the abstract container. If you want to know if the resource is in the container, you still have to take the path segment corresponding to the root directory and below and see if there's a matching resource in the ZIP container.

Yeah, I'm not convinced that works, see the last example in my previous comment with the ../.. URLs, which resolve to in-container resources or not after parsing, depending on the base URL.

That the container is virtual makes me prefer a solution which uses by design a virtual space (like solution 2 (reserved space on w3.org or another safe domain), solution 3 (epub: URL), solution 4 (file: URL with virtual host), or solution 6 (localhost with unique port).

P5music commented 2 years ago

I think that the ePub publication is considered a "system" so the folder structure refers to a root, like / on a Linux system.

Base url is root.

Then, connecting the external world URL to the internal ePub system is like connecting two computers. It is like reaching out to a server folder with a web URL, and then finding a sort of symlink to another computer. You do not access it directly, unless the ePub is "mounted". And there is not a default way of doing it, so putting it in the ePub specifications is not good.

Moreover, the fact that the ePub publication is a self-contained "mini-PC" has the only purpose, I think, to help ePub readers to handle the content with a WebView component, like WebKit, that relies on the filesystem.

And it is very likely that an ePub reader can manage the ePub publication as a zipped file, or as an unpacked zipped file.

If the ePub is unpacked it is just for internal convenience, it does not become part of an "official" filesystem structure. It is so in fact, but the user knows nothing about it, and external systems know nothing too, cannot even perform a request, it should be forbidden, the reader should not respond to such requests.

If the ePub is not unpacked it is accessed in memory, special API methods have to be used to intercept the WebView (browser) resource requests, like an image, a css file and so on, because AFAIK the WebView does not access directly the zip as the filesystem.

Considering the EPUB folder at the same level of https://example.org/acme/mobydick.epub is flawed because if acme is a sort of library, a folder where many ePubs are unpacked, and mobydick.epub is a sort of reminder of its EPUB folder, it is possible that other ePubs use the same EPUB name for their corresponding folder, at the same level.

So IMHO there is no need to find a way to connect the ePub "mini-PC" with a global system with the "URL way". URL standard compliance requisites are internal to the ePub "mini-system".

The URLs in ePubs have just not to exceed the root level, readers just should check it and ignore leaking URLs, presenting an error or informative dialog about that occurrence.

I think that the ! symbol is like "mounting" but I think it could be more general and useful, stopping there the path parsing, and starting a sort of query part of the url. (But that is another issue, maybe I am to solve it by myself by asserting my way of doing it with my app, maybe you can also participate to the issue I created)

The zip + ! approach seems to be reasonable here, although this issue seems to be about internal ePub URL validation and not how to have meaningful URLs for apps.

mattgarrish commented 2 years ago

This is a conflict to evaluate conformance to the spec saying that "EPUB Creators (…) MUST ensure each URL is unique within the manifest scope after resolution to an absolute URL".

Oh, that's right. I forgot about that part.

Ya, I've been focusing on interpretation solely within the abstract container (i.e., let developers pick any method that works for what they need to know and how they're obtaining the content).

Hm... I'm coming around to your point now... 😄

iherman commented 2 years ago

First, thanks to @rdeltour for putting all this together!

Just some random thoughts while reading through the proposed alternatives

(Current situation): I think we should forget about it:-) We must, in my view, sort this out.
(Leave it to the RS): as you said yourself, only a mild improvement; we just wash our hands. This is not what interoperability should look like...
(Use a reserved special URL): I have a certain level of problem with creating and using a URL that looks like a full-blown Web URL, but which would be dereferenced to a 404 (in spite of the fact that the URL is never seen in real life). I realize this is not a very strong argument against it, though…
(Use a proprietary schema): this would require properly registering the epub: scheme with IANA. Apart from the fact that this would be an administrative burden, there may be some issues whether we can produce an official submission using RFC 7595 that IANA would accept (worth looking at that RFC to see what it entails). We could of course go down that line if we do not have any other acceptable solution, which, in my view, is not the case.

Leaving in undefined may be the source of other issues; there may be JS libraries out there that check the URI scheme for validity. This may become a problem if we want to keep to standard and widely used tools out there. Let alone the fact that there is no guarantee that a different community would not formally register an epub: scheme for something totally different.
(Use a file URL): are we sure that all standard tools out there accept to handle file URL-s properly? Isn't it possible that the user may inadvertently use a JS library that might fall on its face? Also, in the current spec we do say that href values SHOULD NOT use a file URI scheme (see §2.3.2.1) which is a bit inconsistent with us using it anyway…
(Use a special syntax): using a special syntax means that we would have to specify it, test it, etc., which may become a pain. After all, the usage of '!' is, afaik, highly non-standard. I see that as a major problem.

(I realize that '!' is used in epucfi but, at this moment, we do not intend to put epubcfi on a standard track, i.e., that is not relevant here)
(Use a localhost): a bit like with the file approach: are we sure that all standard tools out there work well with localhost? I think the danger may be (even) less than with a file scheme but, nevertheless, we should check.

In a later comment, you say:

That the container is virtual makes me prefer a solution which uses by design a virtual space

Absolutely, +1 to that

(like solution 2 (reserved space on w3.org or another safe domain), solution 3 (epub: URL), solution 4 (file: URL with virtual host), or solution 6 (localhost with unique port).

As far as I am concerned, I would prefer to drop solution 3, which leaves us with reserved domain, file:, or localhost:. My (mild) preference goes to localhost...

dauwhe commented 2 years ago

(I would propose that we add something roughly like the following to the start of 6.1.3, and change the name of the section:)

URLs and the OCF Abstract Container

In order to explain the behavior of EPUB with respect to URLs and the web security model, we find it useful to imagine that the Root Directory of the OCF Abstract Container has a defined URL. EPUB Reading Systems will not present the contents of an EPUB to users with such a URL; it is merely a concise way of describing a complex set of behaviors.

The URL of the Root Directory is defined as follows:


scheme	`http` or `https`
host	`localhost`
port	a unique dynamic port is assigned to each individual EPUB Publication

This has the following implications:

All the local resources of a given EPUB are same-origin.
All local resources of any other EPUB are not same-origin to the first EPUB.
The URL of the root directory serves as the base URL for all files within the META-INF directory.
The URL of the package file is constructed by finding the relative path from the root directory to the package file and appending it to the URL of the root directory.

Example:


URL of root container	http://localhost:49152/
path to package file	OPS/package.opf
URL of package file	http://localhost:49152/OPS/package.opf

dlazin commented 2 years ago

I think (?) the dynamic port doesn't help solve https://github.com/w3c/epub-specs/issues/1843. If you are a webserver admin and you want to let a given (known) ebook iframe your site, but prohibit other ebooks from doing so, how can you specify a dynamic origin?

bduga commented 2 years ago

I don't think there are enough ports to have a unique port per book. The number of available dynamic ports on Unix is around 16K or so, so even if this is just local to a single user a large library fail (and yes, there are some very large libraries out there). Is there even a spec we can reference for dynamic ports?

dauwhe commented 2 years ago

I don't think there are enough ports to have a unique port per book.

I was thinking of this metaphorically. "As if the URL were..." rather than "The URL is...".

dlazin commented 2 years ago

I was assuming that a new port would be assigned when you open a book, and that the assumption we're then making is that you have fewer than 16,000 books simultaneously open. The next time you open the same book, you probably get a different port.

But that still runs into the predictability problem I mentioned above. Also I might be misunderstanding.

bduga commented 2 years ago

I don't think there are enough ports to have a unique port per book.

I was thinking of this metaphorically. "As if the URL were..." rather than "The URL is...".

Leave it to an ebook spec to use metaphor! :) I think though, that this runs into the same issue as xml namespaces. People expect these URLs to work, though perhaps localhost would be enough to dissuade them. On the other hand, maybe not. But if we use ports to localhost in a URL, it seems like people might expect them to be ports to a host in a URL, and be subject to those rules. We could instead not use ports, eg http://localhost-unique-id, or even just http://unique-id. Or maybe http://this-is-not-a-custom-scheme-we-promise-unique-id.

I was assuming that a new port would be assigned when you open a book

That might be fine, but we need to say that. And then we need to define what it means for a book to be "open", which is probably trickier than it sounds, and it already sounds tricky to me.

[Edit to remove things that looked like custom tags]

rdeltour commented 2 years ago

I think defining the container’s URL "as if" is reasonable for interpreting the conformance statements in the core spec.

Specifically, the "as if" approach (like in @dauwhe's proposal / solution 6) allows us to:

make and evaluate statements on URL uniqueness ("all items must be unique")
unambiguously map a URL string to a file in the OCF, or none.
avoid conflicts with absolute URLs (*)

In the RS spec, we could give more leeway on the implementation, as long as the RS must:

consider all container resources as same-origin
consider cross-container resources as not same-origin
resolve relative URL strings to container resources as defined in the core spec

(*) the only difficultly is to be 100% safe with authored absolute URLs. Say I write an EPUB which purposely contains an exhaustive list of localhost URLs with dynamic ports. It is theoretically hard to totally avoid conflicts, unless we forbid "localhost" absolute URLs.

rdeltour commented 2 years ago

@bduga

Is there even a spec we can reference for dynamic ports?

RFC6335

rdeltour commented 2 years ago

@dlazin

I think (?) the dynamic port doesn't help solve #1843.

Correct, as far as I understand.

If you are a webserver admin and you want to let a given (known) ebook iframe your site, but prohibit other ebooks from doing so, how can you specify a dynamic origin?

The issue with non-dynamic origins is that it is no-longer unique per instance. If you and I have a copy of the same EPUB, and they share the same origin, does it create vulnerabilities? Let's discuss this in #1843, and possibly come back here if #1843 implies new requirements for the current issue #1888?

bduga commented 2 years ago

Ok, I have stared at it a bit more and it seems reasonable. Maybe we can find a way to emphasis the imaginary nature of these constructs? Maybe a unique dynamic port is assigned to each individual EPUB Publication could be individual EPUB Publications behave as if a unique dynamic port has been assigned to it. Hmm... that doesn't really work. But somehow make explicit that there is no actual port assignment.

rdeltour commented 2 years ago

I made a little EPUB to test the internal URLs used in some JS-supporting readers: my-url.zip (rename the zip extension to epub).

The logic is based on a (very naïve) javascript run in a content document (EPUB/content_001.xhtml file in the container). I assume that the URL of the root of the container is obtained from parsing the URL ../ with that document's URL as a base. The script then populates a table with a few URLs, include the package document's URL, but also the absolute root (/) and a possibly leaky URL (../../secret).

I only tested in iBooks (which uses URLs with a custom ibooks-epub: scheme), and Thorium (which uses URLs with a custom httpsr2: scheme).

Unfortunately, none seem to behave like the proposed "as if" case.

That experimentation is kinda flawed and limited, but it may be informative 😊

P5music commented 2 years ago

What follows is just IMHO. I could not read past issues but just browse some referenced ones, I see that many of them revolve around the same problem: the root of the ePub publication.

When creating a modern ePub reader (RS) I think that developers or engineers rely on the powerful features of a filesystem or a WebView (like WebKit). Modern EPUB3 RS do not rely on parsing the paths and resources on their own. So when the ePub directory structure is available the best thing is to rely on the WebView feature to manage relative paths or URLs its natural way. So developers are not parsing paths themselves or checking where the relative path lead, they just rely on the WebView picking the right file according to the path, as it happens in HTML pages. Once the WebView or the browser is fed with the XHTML page, images inside it, for example, are loaded according to the relative path. If a page is linked by another page in the ePub publication it is by means of normal relative URLs, even in different directories (going up some levels with ../).

So the ePub was created as a mini-website, likely to allow easily creation of RS and to exploit the WebKit or browser features directly and in a straightforward way. But it is also a book, so that has to stop at a certain level.

Connecting with a virtual URL for validating is not compatible with every way the ePub content is handled (unpacking or reading from zip are very different from each other). Not even mentioning the localhost:port URL form (really?).

And leaks are leaks, they are errors, they cannot be avoided by means of creating a complete URL that is validated, or even corrected.

I found this in another thread #1688,

Manifest items can identify resources with absolute URLs. So an EPUB can theoretically use local file system resource with file URLs.

Using file URLs in the manifest is not a good practice, but has never been strictly forbidden (as far as I know). I think it's probably often used by mistake rather than intentionally. But there might be legit use cases, like an internal documentation system (as @mattgarrish pointed out in #1374).

Would it be reasonable to say that absolute URLs SHOULD have a special scheme that is not file? In parallel we can also make it explicit what Reading Systems can/should do with absolute URLs as manifest entries.

You seem to deal both with validating the ePub content and with providing a way of handling the files in practice. I think the two goals should stay distinct. If you try to determine the practical way how the RS works I think you will end up creating constraints because you have not all possible cases in mind.

As it was also said "what's an open ePub?", and what's an archived ePub?, and is its content accessible from outside or has it to be "asked" to the RS or archive to "open" it or "provide" it?

As you can see many uses and cases are possible, so consider just the above mentioned case: if an ePub publication has references to a common documentation system, I think the publication per se is a closed "world" and it cannot reasonably be allowed or persuaded to read content from a filesystem, unless it is a special system but then it is not within the ePub specifications. So it should be possible that still the external documentation is accessible, and web urls like https:// are commonly used.

But if the local system is where the external content has to be accessed, and https:// URLS are not wanted, then here it is where the ! redirection would be useful, having a way of "opening" an ePub that could or could not be available in the local system but could be searched elsewhere as a fallback. So, in addition to relative URLs, some of the URLs in the ePub publication itself could be modern URLs like with a special scheme that is read, for example, on mobile devices, but also from another system in general, or http scheme, with special optional syntax also with a query part and an epubcfi part.

I see that

The asterisk ("*", ASCII 2A hex) and exclamation mark ("!" , ASCII 21 hex) are reserved for use as having special signifiance within specific schemes. from https://www.w3.org/Addressing/URL/4_URI_Recommentations.html

So is the epubcfi fragment invalid when it is used in URLs?

iherman commented 2 years ago

Ok, I have stared at it a bit more and it seems reasonable. Maybe we can find a way to emphasis the imaginary nature of these constructs? Maybe a unique dynamic port is assigned to each individual EPUB Publication could be individual EPUB Publications behave as if a unique dynamic port has been assigned to it. Hmm... that doesn't really work. But somehow make explicit that there is no actual port assignment.

I think this is the main point: it is "imaginary", "virtual", and user provided script should not rely on the existence of those localhost URLs. I do not have an issue with the sentence you propose, but I think we should also add something describing these restrictions on scripts.

iherman commented 2 years ago

I made a little EPUB to test the internal URLs used in some JS-supporting readers: my-url.zip (rename the zip extension to epub).

The logic is based on a (very naïve) javascript run in a content document (EPUB/content_001.xhtml file in the container). I assume that the URL of the root of the container is obtained from parsing the URL ../ with that document's URL as a base. The script then populates a table with a few URLs, include the package document's URL, but also the absolute root (/) and a possibly leaky URL (../../secret).

I only tested in iBooks (which uses URLs with a custom ibooks-epub: scheme), and Thorium (which uses URLs with a custom httpsr2: scheme).

Unfortunately, none seem to behave like the proposed "as if" case.

That experimentation is kinda flawed and limited, but it may be informative 😊

@rdeltour, see also my comment in https://github.com/w3c/epub-specs/issues/1888#issuecomment-958829747: I would think it should be made explicit that such scripts may be unpredictable in an EPUB environment. May that be the problem with the test?

For testing, I believe what should be done is to write tests along the lines of the requirement you have put in the issue itself (unambiguous, contained, unique, and origin-safe) (and not relying on scripts). The goal is

to clearly specify what we mean by those, and the description of @dauwhe in https://github.com/w3c/epub-specs/issues/1888#issuecomment-958132163 goes there, with lots of "as if"-s along the way, and
to test these features in terms of their behavior. After all, how a RS system achieves that is none of our business as long as the tests are passed...

rdeltour commented 2 years ago

@rdeltour, see also my comment in #1888 (comment): I would think it should be made explicit that such scripts may be unpredictable in an EPUB environment. May that be the problem with the test?

That little experimentation is limited for sure. But if the two RS I tested do not implement custom URL parsing logic and if they rely on the JS URL API, then at least in those two cases, their implementation does not 100% fit the behavior of the "as if" case.

For testing, I believe what should be done is to write tests along the lines of the requirement you have put in the issue itself (unambiguous, contained, unique, and origin-safe) (and not relying on scripts).

I agree with the principle. I don't know if/how that's practically testable 😊 (especially for the same-origin requirement).

The goal is

to clearly specify what we mean by those, and the description of @dauwhe in What base URLs to use for URL parsing in EPUB? #1888 (comment) goes there, with lots of "as if"-s along the way, and

to test these features in terms of their behavior. After all, how a RS system achieves that is none of our business as long as the tests are passed…

Right.

For (1), let's keep in mind that there's the core and RS spec. All the criteria may not be needed in both specs. We may not need to define ore require them in the same place. For instance:

In the core spec, we essentially need to specify how a relative URL string resolves to an OCF file (i.e. solve the unambiguous and contained criteria).
In the RS spec, we can add the origin-safe and unique criteria (if they make sense).

(I'm thinking out loud here, exploring possibilities. I'm not quite sure yet how to best articulate this 😊).

rdeltour commented 2 years ago

@iherman

to clearly specify what we mean by [the requirement you have put in the issue itself]

I'm working on a proposal, stay tuned 😉

rdeltour commented 2 years ago

So, here's a proposal. Blockquotes starting with "📝 Comment:" are not part of the proposal, but my comments.

[In the EPUB 3.3 spec]

1.4 Terminology

📝 Comment:

Rename the term "Path Name" to "Path" (cosmetic change)

Replace the definition of Path Name with the more precise algorithmic definition below (based on Infra):

To get the Path of a file file in the OCF Abstract Container:

Let path be an empty list.
Prepend the File Name of file to path
Let parent be the parent directory of file
While parent is not the Root Directory:
1. Prepend the File Name of parent to path
2. Set the parent directory of parent to parent
Return the concatenation of path using U+002F SOLIDUS

6.1.3 URLs in the OCF Abstract Container

📝 Comment:

Replaces and rename the "Relative URLs for Referencing Other Components" section

I would find it more natural to switch the order of sections 6.1.3 (URLs) and 6.1.2 (File paths and names)

The container root URL is the URL [URL] of the Root Directory. It is implementation specific, but MUST verify the following:

the result of parsing [URL] "/" with the [container root URL]() as base is the container root URL
the result of parsing [URL] ".." with the [container root URL]() as base is the container root URL

The container URL of a file or directory in the OCF Abstract Container is the result of parsing the file's Path with the [container root URL]() as base.

📝 Comment:

I'm not sure if these definitions instead belong to the terminology section. (I generally prefer when terms are defined in their topic sections, like in CSS or HTML specs, but it may be a bikeshed vs respec thing and not the point of the current issue ;-)

Applying URL parsing to a File Name may produce validation errors (since a path string is not URL-encoded), but the algorithm will not fail and still return a URL record (if I understand correctly)

In the OCF Abstract Container, when a file uses a URL string to reference another file in the container, the string MUST be a path-relative-scheme-less-URL string, optionally followed by U+0023 (#) and a URL-fragment string.

EXAMPLE 45 copied as-is

NOTE The properties of the [container root URL]() are such that whatever the amount of double-dot path segments in a URL string (for example, ../../../secret), it will be parsed to a container URL (and not "leak" outside the container). However, for better interoperability with non-conforming or legacy Reading Systems, EPUB Creators should avoid to use more double-dot path segments than needed to reach the target container file.

📝 Comment:

we're requiring path-relative-scheme-less-URL strings (+ optional fragment), so that we don't have to explicitly call out the other kind of relative-URL strings (like path-absolute URL strings).

the note on the double-dot-abusing URLs intends to clarify that even if it is not normatively forbidden, it is a bad idea.

I removed the paragraph saying "All relative URLs MUST, after parsing, identify resources within the OCF Abstract Container (i.e., at or below the Root Directory)".

by construction, the resulting URLs cannot leak out of the root directory

I don't know if this was meant to explicitly require that resources exist? If that's the case, then we can reword it along the lines of: "All relative-URL-with-fragment strings MUST, after parsing, be equal to the [container URL]() of an existing file in the OCF Abstract Container."

all the normative text related to META-INF is moved to a subsection of 6.1.5, as it is not related all container URLs, but to the specific case of those in META-INF.

6.1.5.x Parsing URLs in the `META-INF` directory

📝 Comment:

This is an adaptation of the META-INF text moved from 6.1.3

I prefer to put it in the core spec, because processors other than RS need this: authoring tools, checkers, etc.

To parse a URL string url used in files located in the META-INF directory, apply the URL Parser [URL] to url, with the [container root URL]() as base.

📝 Comment: HTML, for instance, makes provision for passing the character encoding of the document to the URL parser, for legacy reasons. I don't know if we need this, or if we can apply the URL parser directly as above?

EXAMPLE 46

For example, if META-INF/container.xml has the following content:
<?xml version="1.0"?>
<container version="1.0" xmlns="urn:oasis:names:tc:opendocument:xmlns:container">
    <rootfiles>
        <rootfile full-path="EPUB/Great_Expectations.opf"
            media-type="application/oebps-package+xml" /> 
    </rootfiles>
</container>
then the path EPUB/Great_Expectations.opf is relative to the root directory for the OCF Abstract Container and not relative to the META-INF directory.

📝 Comment: The example description is moved to the comment’s body, so it's not part of the normative content.

2.3.2 Parsing URLs in the Package Document

📝 Comment:

This section replaces and renames section 3.1 "Parsing Relative URLs" of the RS spec.

It is moved to the core spec, because processors other than RS need this: authoring tools, checkers, etc.

To a parse a URL string url used in the Package Document, apply the URL Parser [URL] to url, with the [container root URL]() as base

[In the EPUB Reading Systems 3.3 spec]

3.1 Parsing Relative URLs

📝 Comment: moved to the core spec (see above).

7.1.1 URL of the Root Directory

📝 Comment: This section replaces and renames the "Relative URLs for Referencing Other Components" section

Reading Systems MUST assign a URL [URL] to the Root Directory of the OCF Abstract Container. This URL is called the [container root URL](). The URL itself is implementation specific, but:

the result of parsing [URL] "/" with the [container root URL]() as base MUST be the [container root URL]()
the result of parsing [URL] ".." with the [container root URL]() as base MUST be the [container root URL]()
the origin of the [container root URL] MUST be unique for each EPUB Publication instance
the origins of the [container URLs]() of all the files in the OCF Abstract Container MUST be same origin one with another.

📝 Comment:

The two first bullets should be enough to ensure that the result of URL parsing is unambiguous and contained (as defined in the opening problem statement).

The third bullet probably overrides and replaces the specific origin-related requirement of the Scripting section?

Current reading systems may not respect these origin-related criteria. But I still prefer a MUST over a SHOULD, for security/privacy reasons. Or maybe SHOULD in the general case, and MUST for scripting-supporting RS?

These origin-related criteria may not be sufficient to ensure a safe usage of some APIs (e.g localStorage). For instance, it doesn't require the same origin to be be preserved every time a user opens the same EPUB instance. Or it doesn't say if the uniqueness requirement is at a point in time or stands over time. This is somehow tangential to the current issue (see also the discussion at #1156), but I think it should be clarified.

The localhost + unique port analogy is moved to a note (see below), so its virtual nature is more explicit.

NOTE The required properties of the [container root URL]() are such that it behaves similarly to a URL defined as follows:

URL component Value

scheme http or https

host localhost

port a dynamic port uniquely assigned to the EPUB instance

For example:

Container File File Path URL

Root Directory empty string http://localhost:49152/

Package Document OPS/package.opf http://localhost:49152/OPS/package.opf

📝 Comment:

The localhost + port solution has limitations. Notably, it cannot be used if we start requiring that the origin is preserved when opening the same EPUB another time; or it cannot guarrantee other EPUB instances opened at another time will not be same-origin. I still think it is relevant as an example; do we need to make these limitations explicit? or at least say it is known to be limited? or more strongly say it is a mere analogy and RS can very well use another solution?

Container File	File Path	URL
Root Directory	empty string	`http://localhost:49152/`
Package Document	`OPS/package.opf`	`http://localhost:49152/OPS/package.opf`

Comments warmly welcome, especially from RS folks 😊.

iherman commented 2 years ago

@rdeltour this sounds great to me. I particularly like the way you describe the root in abstract, leaving it to the implementation whether they use localhost or anything else. As a non-implementer it sounds perfect to me as some sort of mental model.

But, as you say, the real answer should come from the RS folks.

Would it help if I attempted to fold that into the spec in the form of a PR, so that people could see the changes as part of the spec (with also a diff file)? I am happy to make an attempt, although I might get it wrong here and there.

CC: @wareid @bduga @HadrienGardeur @hober @llemeurfr @danielweck @fchasen @rickj @mteixeira-wwn

iherman commented 2 years ago

MUST be unique for each EPUB Publication instance

I am not sure if it is clear what "EPUB Publication instance" means. Do we rely on the model whereby a RS makes some sort of (virtual or physical) copies of books it receives? I believe that is what happens with most of the Reading Systems even on a Mac, but I do not know whether this can be considered as a rule. If not, then what happens if two RS-s read the same EPUB file on my disc? I guess these should be considered as separate "instances"...

iherman commented 2 years ago

The third bullet probably overrides and replaces the specific origin-related requirement of the Scripting section?

From spec point of view, that is true; the new statements make the second bullet items of the second list obsolete. I might still prefer to leave something in the section referring to the new text.

iherman commented 2 years ago

I'm not sure if these definitions instead belong to the terminology section. (I generally prefer when terms are defined in their topic sections, like in CSS or HTML specs, but it may be a bikeshed vs respec thing and not the point of the current issue ;-)

For good or for worse the current spec puts the <def>-s into §1.4 Terminology, so we should probably follow this for consistency's sake.

rdeltour commented 2 years ago

@iherman

Would it help if I attempted to fold that into the spec in the form of a PR, so that people could see the changes as part of the spec (with also a diff file)? I am happy to make an attempt, although I might get it wrong here and there.

Yeah, I wondered that too. I started jotting this down in markdown, with interspersed comments, so it ended up as a comment and not in a PR. Also, I wanted to see if the group agreed with the direction. But feel free to turn that into a PR! (Or I can do it if you prefer, but not until next week).

I am not sure if it is clear what "EPUB Publication instance" means

Right, I agree. I copied that term from the existing "Scripting" section, but this is rather (intentionally?) vague. I think this would be worth clarifying, especially if we want to further think about (and ideally specify) the origin-related requirements.

iherman commented 2 years ago

But feel free to turn that into a PR! (Or I can do it if you prefer, but not until next week).

I am working on it. May be ready later today or tomorrow at the latest.

w3c / epub-specs

What base URLs to use for URL parsing in EPUB? #1888

Current situation

Parsing a URL in documents located in the META-INF directory

Parsing a URL in the Package Document

Problems

The URL of the container’s root directory is undefined

The current way to obtain the URL of the Package Document is flawed

Examples:

To summarize:

Possible Solutions

Solution 0: Current situation

Description

Examples

Features

Solution 1: Leave the definition of base URLs up to the Reading System

Description

Examples

Features

Solution 2: Use a reserved special URL as the container root's base URL

Description

Examples

Features

Solution 3: Use a proprietary non-special scheme

Description

Examples

Features

Solution 4: use a file URL

Description

Examples

Features

Solution 5: use a special syntax for ZIP entries URLs

Description

Examples

Features

Solution 6: use a localhost HTTP URL with a unique port

Description

Examples

Features

URLs and the OCF Abstract Container

Example:

1.4 Terminology

6.1.3 URLs in the OCF Abstract Container

6.1.5.x Parsing URLs in the META-INF directory

2.3.2 Parsing URLs in the Package Document

3.1 Parsing Relative URLs

7.1.1 URL of the Root Directory

Parsing a URL in documents located in the `META-INF` directory

Solution 4: use a `file` URL

Solution 6: use a `localhost` HTTP URL with a unique port

6.1.5.x Parsing URLs in the `META-INF` directory