w3c / epub-specs

Shared workspace for EPUB 3 specifications.
Other
304 stars 60 forks source link

What is origin in epub context? #873

Closed JayPanoz closed 3 years ago

JayPanoz commented 8 years ago

OK so this might be a security issue to some extent.

As far as I know, there’s nothing about “origin” in the EPUB spec.

Why is this an issue?

Because localStorage. See https://html.spec.whatwg.org/multipage/webstorage.html#the-localstorage-attribute and https://html.spec.whatwg.org/multipage/browsers.html#concept-origin

In other words, Reading System as the origin is valid, which means you can retrieve every item stored in the RS and not only the local storage area for one EPUB file.

Now, at the moment, it appears you can get items set in other EPUB files in some RS. See following screenshot (width was set in one file, we getItem using JavaScript in another file)

getitem

Here, we actually retrieve the whole storage using a loop (every item set in different files before running the script can be accessed)

capture d ecran 2016-09-11 a 20 00 39

I must admit I would be much more comfortable if origin = each EPUB file and not the whole RS.

If someone set sensitive data in localStorage at some point, you could theoretically access it from another file and it would be valid per spec.

mattgarrish commented 8 years ago

There is guidance to this effect under the security considerations section of the Content Documents specification:

Reading Systems need to behave as if a unique domain were allocated to each Content Document, as browser-based security relies heavily on document URLs and domains. Adopting this approach will isolate documents from each other and from other Internet domains, thereby limiting access to external URLs, cookies, DOM storage, etc.

and

If a Reading System allows persistent data to be stored, that data needs to be treated as sensitive. Scripts might save persistent data through cookies and DOM storage, but Reading Systems might block such attempts. Reading Systems that do allow data to be stored have to ensure that it is not made available to other unrelated documents (e.g., ones that could have been spoofed). In particular, checking for a matching document identifier (or similar metadata) is not a valid method to control access to persistent data.

http://www.idpf.org/epub/31/spec/epub-contentdocs.html#sec-scripted-content-security

JayPanoz commented 8 years ago

Well, it looks like guidance might not be enough since there was the same issue reported on the Readium repo in August: https://github.com/readium/readium-js-viewer/issues/559 (iframe not sandboxed).

Now, I would understand something more would be “out of scope”.

danielweck commented 8 years ago

Due to various technical constraints depending on target platforms (e.g. cloud / web browser -based reader vs. native app web view), Readium's core "engine" is not always capable to implement totally watertight sandboxing (in Readium's case: HTML resources displayed inside iframes). Content Documents are served from different origins, and not always through HTTP (e.g. custom URL protocols on Chrome extension or Electron or Cordova). In some cases, the reading system can only inject "behaviours" such as media overlays playback, highlights/annotations, etc. into EPUB content when both the app and the content are in the same domain. As for LocalStorage, there's also the inverse problem to "everybody can see my data": in some cases the content URLs' domains vary from one reading session to the next (e.g. random HTTP port number), resulting in a user's recorded data not being persistent (e.g. EPUBs that contain scripts that track some sort of activity progress, or that memorise user preferences).

danielweck commented 8 years ago

cc @rkwright

iherman commented 3 years ago

The issue was discussed in a meeting on 2021-02-11

View the transcript ### 2. Origin _See github issue [#873](https://github.com/w3c/epub-specs/issues/873)._ **Dave Cramer:** i don't feel like we're going to resolve on this today, but i think of this as one of the big issues around epub that we've been deferring … epub is a heavy user of html and web tech, the web model and scripting revolves around the concept of origins … but this is still poorly defined in epub. … we have a statement in the spec: > Reading Systems need to behave as if a unique domain were allocated to each Content Document, as browser-based security relies heavily on document URLs and domains. Adopting this approach will isolate documents from each other and from other Internet domains, thereby limiting access to external URLs, cookies, DOM storage, etc. **Dave Cramer:** first of all, i think the statement is problematic. Domain is not the right concept here … should probably be about origin … and the statement prevents different content docs in the same epub from using the same local storage … probably not what we intended … this is complicated. RS present content docs in really different ways … iframes, served via file schemes, webviews in mobile OS … depending on how RS is constructed, the rules could be different … can also try to define how things should work in epub, but that puts more burden on the RS … some of this already happens, i.e. the RS object must be injected into the doc in Readium so that scripts can get at it … wonder what is possible for epub 3.3 … is it possible to be better than we are now? Can we agree on some principles? … i.e. localStorage should work for all content documents in that epub, and that it should persist across OS **Ivan Herman:** do we have a kind of proper overview of what happens in RS today? > *Dave Cramer:* See [Issue comment by Daniel](https://github.com/w3c/epub-specs/issues/873#issuecomment-246765067). **Ivan Herman:** we should not reinvent the wheel if possible, just standardize or unify on what the practice is **Dave Cramer:** comment describes some of what happens in one RS, the type of complexity that we face **Ivan Herman:** is that in line with other RS? **Dave Cramer:** less certain on that point **Brady Duga:** preface, i am not an expert in this area … but it strikes me as weird to talk about origin without understanding transfer protocol … origin is where you got it from, but where that is could be any number of places … without knowing that, how do you express what the origin is? … we say content document because it is in keeping with individual file access for most browsers … so for most browsers, the content doc is its own origin … we would like the zip to be the origin, but most browsers don't even open zip files … the way we (Google) work is that we pull apart the epub … and then there are also per browser complications (e.g. in Safari) … on our android client we've also had issues with origin, e.g. where people make fonts apply to same origin … we mainly try to just fix issues as we break in different browsers … e.g. we don't use local storage to save annotations … there's no easy answer to how to make epub fit into the browser security model **Dave Cramer:** that was very helpful … in terms of understanding the scope of the problem … we have this web content, but it ends up in RS that we don't have much information about … how should we proceed here? … e.g. keep old language, and see what happens when we go through security and privacy horizontal review? **Tzviya Siegman:** i'm aware that the fetch API as supplanting CORS … its a living standard and don't know how widely adopted it is, and people are very excited about it **Dave Cramer:** i think fetch is how things work now … or at least how people write script now is oriented towards using fetch rather than older APIs **Brady Duga:** i thought fetch still relies on CORS? … like, a shared underlying mechanism? **Dave Cramer:** yes, i think the concepts of origin and cross-origin are still there … is this the kind of thing we can ask TAG about? … this is the classic question of how does our tech fit into the larger web ecosystem … exactly on topic for TAG **Ivan Herman:** first of all, yes. I think we should ask. Sooner the better. … when we ask, we have to be careful to be clear that per charter we are not to make radical changes to epub spec … also, that although epub does allow the use of scripts, it looks at scripts with suspicion … that may be where we end up … it will take them a while to reply … also, my mental model has always been that if i take an epub file, when i unzip it, i consider it as its own website on its own domain (e.g. localhost) … what is wrong with that mental model … if under this model, then origin is this localhost address, and things fall into place … it becomes part of some sort of web with a temporary website … where does this go wrong? … as a sort of mental model **Dave Cramer:** i'm not sure that mental model holds up when we are faced with the task of implementing an RS … implementation details are RS specific, there are complications to using this model **Ivan Herman:** at the moment we are silent on script behaviour aside from announcing scripts are present … this may simply mean that we need to advise on how this model may impact the use of some scripts … i would expect that most scripts would not be adversely affected **Dave Cramer:** I think next step is reaching out to TAG … this group doesn't have the expertise on this subject … i'll take the lead on that … reaching out to other members of the W3C **Tzviya Siegman:** i can also help with that **Gregorio Pellegrino:** question for ivan, if you open two epubs, do they run on the same host? … if they do, then one can access the other … or do you run different services for different open epubs? **Ivan Herman:** yes, the second **Brady Duga:** i think short answer, the mental model we're discussing here is right - epub is website in a box … when you open an epub, it is "served on its own domain" and another epub is its own other domain … the difficulty is how to specify that in spec appropriate language that will work across all RS **Ivan Herman:** i said localhost with a different port for each epub, so logically speaking, they would be independent … apart from that, i agree … instead of trying to specify all the details, we might want to say these are the consequences of this model in terms of scripting **Dave Cramer:** i wonder if each epub should be its own opaque origin **Brady Duga:** the other problem is that the epub often isn't local, and the RS might actually be making calls to a server, and may not have control over what the URL is, what the port is, etc. … not always true that you have a local file there **Dave Cramer:** to get the results we want RS may have to do a lot of trickery to disguise how they actually implement things in order to present a unified experience **George Kerscher:** do we need a normative/descriptive piece in our spec, and be silent on this otherwise … this seems to minimize the risk of breaking things **Ivan Herman:** we are pretty silent already … on the one hand, we'd like to enable more scripting to allow more interactive books … so we might not want to be completely silent … and I understand that there are problematic cases under the model, but we should not try to specify all possible cases. There is no end to this. … if it were the case that the model doesn't fit what most people are doing, that's another thing **Matt Garrish:** to date we've left it to the RS to decide whether to support scripting, and to what extent > *Tzviya Siegman:* +1 to mgarrish **Dave Cramer:** i don't want to go backwards, saying nothing is problematic … next step is still consulting with TAG > *Avneesh Singh:* +1 to consult to TAG now, to prevent problems at last minutes **Ivan Herman:** will you put together something for TAG? Might be worth sharing with some implementers before you go to TAG > ***Action #1: ask TAG about scripting (Dave Cramer)*** **Tzviya Siegman:** and I can help with that **Dave Cramer:** maybe we can open an issue in our repo with the problem statement, to collect insight from this WG before we formally contact TAG
danielweck commented 3 years ago

My comment above dates back to a few years ago. I wrote a more up to date analysis for Thorium (iframe, sandboxing, origin, etc.): https://github.com/edrlab/thorium-reader/issues/1375

iherman commented 3 years ago

The issue was discussed in a meeting on 2021-02-18

List of resolutions:

View the transcript ### 3. Origin, cont'd _See github issue [#873](https://github.com/w3c/epub-specs/issues/873), [#1156](https://github.com/w3c/epub-specs/issues/1156)._ **Wendy Reid:** this is continuing from last week's meeting **Dave Cramer:** i think most of the discussion is in issue 1153 … we've struggled with how to specify scripting in epub … we've gotten lots of questions from outside the group about how our security model ties in with the security model of the rest of the web … we have non-normative text in the spec … leonard has mentioned the concept of security boundaries, with origin being main boundary in Web world … my opinion is that the text we currently have is wrong … boundary should be around the epub and not the content document … e.g. where content documents within an epub want to share a resource … also, origin is more the concept we're going for, not domain (which is what we currently reference) … could we say that each epub should be an opaque/unique origin … even if not particularly testable, at least it is a stake in the ground … re. how we are trying to fit into the web security model **Leonard Rosenthol:** the thing that is most problematic is the difference between actually doing this in a browser with a content hosted on a real domain vs doing this on a device (mobile, desktop, etc.) … in the device scenario the RS can completely control the origin … the RS sets up the origin … so your statements about every epub being its own RS can be done on device, but you can't do that on the web … so controlling scripts within the context of an origin makes sense in the device scenario, but not in the web scenario … that's the main issue **Dave Cramer:** i hear you … in the RS i'm aware of that are web-based, there is a pretty big disconnect between what you see in the URL bar and what is actually happening inside the RS … is it reasonable to ask the RS to follow a stricter set of rules than would be required by the generic web security model … say Hachette decided to put all their books on a domain, I think its reasonable to say that if an RS were to do that they need to architect it so that all books aren't on the same origin as each other … i see this as adding requirements to implementation if the implementation happens to be web-based **Leonard Rosenthol:** the problem is that you can't do that … we tried to do that with sub-origins, but that hasn't been touched since 2017 … never implemented seriously … never made it through the webapp sec WG … in your example, all your epubs are originated off the same thing, they would all share the same local storage etc. … if those are all your books, that's fine, but once that content goes outside, there's no guarantee that books from different publishers won't be able to see each other **Dave Cramer:** could you solve that problem with different subdomains for each title? **Leonard Rosenthol:** yes, but only in a world where all the epubs come from the same publisher … e.g. an epub from patreaon uploaded to dropbox or onedrive … that book would have access to all of dropbox **Dave Cramer:** you're kind of creating a non-conforming RS in this example **Leonard Rosenthol:** that would make all web-based RS non-conforming **Wendy Reid:** I think dropbox actually does have an ebook reader.... **Leonard Rosenthol:** they're probably taking advantage of no scripting then **Wendy Reid:** i think the solution that most RS have come to is just to avoid scripting entirely … easiest way out of the origin problem **Leonard Rosenthol:** that doesn't solve other things, e.g. referencing … trying to reference a font or other resource inside that domain as a relative link … nothing prevents referencing outside the epub at that point (e.g. ../../) … and assuming this is served via HTTPS, that gives it a lot more privileges than an non-secure URL **Brady Duga:** this really seems like a scripting issue … you have to make sure you don't access things you don't own … but that isn't an origin issue, that a rights access issue … the real problem is storing cookies, and then someone else's book accessing it **Dave Cramer:** Jiminy has real world examples of this sort of stuff … e.g. an epub in ibooks that goes and finds info about other books … is there anything in the spec right now that says that's bad? **Brady Duga:** maybe? It depends on the RS and the content … e.g. RS for a school, where every student shares every book, that would be okay … one book might want to check how far a student got in another book … i.e. not a bad idea in ALL cases **Leonard Rosenthol:** if, say, you're building your own software and documents, and you control the entire system there's no reason why you wouldn't want to do it that way **Dave Cramer:** one thing to do is go back to our current language … do we still want to say that every content document in the same epub should belong to a different domain? **Leonard Rosenthol:** can probably change that so that each epub is its own origin, like you said earlier **Matt Garrish:** the original wording came at a time when we were just starting to open epub to scripting … we were designing it to be as restrictive as possible … we've tried to dodge this in the past by limiting where scripting is allowed **Dave Cramer:** to me i feels like a little bit of progress if we relax the current language to say "per epub" instead of "per content document" … this leaves us vulnerable to intra-epub security issues … but that really seems like more of an authoring problem than a problem with the spec **Brady Duga:** right now the spec is more restrictive, but we're already finding examples IRL where RS are not honoring it … from testing perspective, its not clear how this would be implemented **Matt Garrish:** depends where we are going with this … right now that section is only informative, so that's fine … if we change the section to be normative, then yes, that might be an issue **Dave Cramer:** given all that, should we take the baby step of updating the non-normative guidance that the boundary should be "per epub"? … consensus on this? > *Leonard Rosenthol:* +1 > *Matt Garrish:* +1 **Brady Duga:** does that include changing from "domain" to "origin"? **Dave Cramer:** yes, i think so … stuff about port randomization scares me a little bit as someone who wants to do something useful with scripting > **Proposed resolution: Update the informative statement in the core specification about origin from "content document" to "EPUB", and "domain" to "origin"** *(Wendy Reid)* > *Matt Garrish:* +1 > *Matthew Chan:* +1 > *Leonard Rosenthol:* +1 > *Wendy Reid:* +1 > *Brady Duga:* +1 > *Toshiaki Koike:* +1 > *Ben Schroeter:* +1 > ***Resolution #4: Update the informative statement in the core specification about origin from "content document" to "EPUB", and "domain" to "origin"*** **Wendy Reid:** that's everything that was on the agenda tonight **Dave Cramer:** i think i do have an action item to talk to TAG about the general ideas around epub security **Wendy Reid:** there is most likely going to be a special session at the business group next week about WCAG3 … silver is going to be presenting to business group about WCAG3 … extending the invitation here … i will send out meeting details on the mailing list … WCAG3 calls out epub as a standard several times … probably worth providing our feedback … meeting date is Tues 233d, noon Boston time … AOB? … no? Thank you everyone, and thank you leonardr! ---