w3c / epub-specs

Shared workspace for EPUB 3 specifications.
Other
303 stars 60 forks source link

Information Exposure and Fingerprintability #1872

Closed wareid closed 2 years ago

wareid commented 2 years ago

From the PING review:

What data on user device is revealed and what is the risk of fingerprintability?

This spec appears to define a reading system user agent string epubReadingSystem. Is there still a navigator.userAgent or similar? Is this a replacement or a duplication of that feature? There is also ongoing work to limit or deprecate user agent strings on the Web platform -- to make it an explicit opt-in rather than always disclosed in great detail. At the very least, we need to recommend that user agent strings have entropy that is strictly limited as necessary for debugging and compatibility. And it should be noted that epubReadingSystem reveals information about how the reader is reading the book, potentially back to the author/publisher of the book, unless scripting is more strictly limited.

As every EPUB is considered a separate origin, the threat model here is: can an author/publisher of multiple ebooks learn from these configuration characteristics that the same user is reading both of them? And on information disclosure (perhaps because the ebook publisher already knows the exact customer who purchased that particular copy of that book), does the publisher learn something about the customer's devices or software choices?

iherman commented 2 years ago

The issue was discussed in a meeting on 2021-10-26

View the transcript #### 1.1. epub and browsable web _See github issue [#1871](https://github.com/w3c/epub-specs/issues/1871), [#1872](https://github.com/w3c/epub-specs/issues/1872)._ **Wendy Reid:** right now there is not really a relationship between epub and browseable web. … you can't link from web into a specific part of an epub. … we are working on CFI but it has never gone into practice. … you usually cannot get to an epub from the open web without an intermediary. **Dave Cramer:** I thought npdoty comment was more about what happens when you go from epub into the open web via link. **Brady Duga:** i think it goes beyond that. … also the ability to put scripts into epub, e.g. that tracks progress of reader thru the book that reports this back to the publisher. … that would bypass the privacy policy of the RS. … and it would not be clear to the user that this is happening. … we try to get around that by making interactive content a special separate thing, but there is worry that this can be worked around via script. … other case is external resources, that when loaded could flag for publisher that progress has been made. … at Google we do things to make this safer, but what we do isn't part of the spec. … also not sure how we would spec this. **Nick Doty:** there are advantages to privacy and interop if you say that "content is proxied, scripts are disabled" etc.. **Samuel Weiler:** npdoty pointed to interop reasons why you would put that into spec, but there are other advantages to documenting those things as well. **Wendy Reid:** this is not something we've done before, but telling authors and RS what could possibly be done could give ideas for what to do (or NOT to do). > *Nick Doty:* yeah, I think the starting point is to document the threat model and how it applies to ebooks. > Currently only privacy of the content author is considered, but useful to consider the privacy of the reader, and privacy from whom. > *Charles LaPierre:* From George Kerscher: Make sure the Youth case of a teacher or parent wanting to track their students progress through a title as they read is an important use case here. (both George an I are in another meeting, sorry we couldn't attend this call). **Ivan Herman:** as one of the co-editors of some of these spec docs, it would be helpful to separate what needs to go into the privacy section from those issues that require a change in the normative spec. … keeping in mind that anything that goes into the normative spec must be testable. **Samuel Weiler:** agree, and most of those mitigations should be part of the normative text. … in general i expect those things to be part of the normative spec. > *Nick Doty:* right, there is a separate reading system conformance specification, yes?. **Ivan Herman:** not sure how that would be done. There are very few normative statements about what RS should do, and how. **Wendy Reid:** let's continue on with the recommendations, and we'll sort out which is normative vs informative later. … so information exposure and fingerprintability is next. … this is the section where the use-case of a teacher or parent wanting to monitor progress of student comes up. … epub is used by a number of parts of publishing, from general trade to education and academics. … here the trade use case will be different from education sphere. … for educational purposes, use cases may be more invasive than what we'd do at Kobo for example. **Rick Johnson:** you've got domains of understanding that need to be accomplished. Publisher may have good reason to know what is happening with book, without what is happening to individual users. … institution may want to know what is happening for a book in a course, but not outside of that. **Nick Doty:** i was a student and students have privacy interests as well. … i didn't expect my teachers to have proof of what reading I had done. … there might be a reason that learners want to share info about their reading habits, but they need transparency and control. … which would be consistent with both privacy and those use cases. **Rick Johnson:** right, we need to understand the use cases. There are some competency-based cases where the student gets their credential based on how they interact with the content, and that needs to be measured. **Tzviya Siegman:** i agree with both position, and the difference between grad school and first grade. Part of the way a kindergartener's ability to read is assessment of reading pace. … a tool might be able to do this better than a person, in a less biased way. … trading off against the need for privacy. **Aram Zucher-Scharf:** it seems likely that there will be bias in such tools, but more to the point, we've seen student lead movements against this sort of monitoring. > *Nick Doty:* I'd be happy to help make connections to colleagues working on student privacy. I don't think purchase/ownership is the most important distinction. **Aram Zucher-Scharf:** they're interested in not being tracked, whether the reading materials are being provided by the institution or not. … and recognizing that there may be edge cases here. … but students have actively resisted, especially in the college setting. **Deborah Kaplan:** i think its a mistake to get too bogged down in particular use cases. … there are obviously clear use cases (not collegiate) where tracking is useful, and a user could conceivably consent. … so #1, should that be part of epub?. … because a RS can always do whatever it wants outside of the framework of the spec. > *Aram Zucher-Scharf:* See [A good summary of resistance to educational surveillance](https://theconversation.com/online-exam-monitoring-can-invade-privacy-and-erode-trust-at-universities-149335). **Deborah Kaplan:** can we require that if you do tracking it is disclosed, and can we make it so that it must be blockable. **Wendy Reid:** I think its worth looking to make sure that we're not encouraging this, but a lot of this falls onto the RS. > *Nick Doty:* I'm not aware of having consented to surveillance of my reading habits of when I've purchased an ebook, for what it's worth :).
iherman commented 2 years ago

The issue was discussed in a meeting on 2021-10-28

View the transcript #### 4.2. Information Exposure and Fingerprintability (issue epub-specs#1872) _See github issue [epub-specs#1872](https://github.com/w3c/epub-specs/issues/1872)._ **Brady Duga:** this section would be important for RS developers too, like what security models to follow under different circumstances. **Dave Cramer:** other things they brought up - user agent strings, trying to figure out what RS was in use. … are people giving accurate user agent strings?. > *Wendy Reid:* See [IEEE EPUB Security review](https://www.computer.org/csdl/proceedings-article/sp/2021/893400a247/1mbmHAQitna). **Wendy Reid:** link to security review of EPUBs, behind a IEEE paywall. **Dave Cramer:** we need to look into user agent strings and how much information is in them for PING. … that will be homework..
iherman commented 2 years ago

The issue was discussed in a meeting on 2022-04-08

List of resolutions:

View the transcript ### 1. Close Privacy & Security Issues. **Dave Cramer:** the TAG has reappeared of making a couple comments, I am making a PR to mention that when using web APIs, which have the most dramatic privacy and security implications (geolocations, push notifications) then you should get user consent. _See github issue [epub-specs#1959](https://github.com/w3c/epub-specs/issues/1959)._ **Dave Cramer:** we have several issues where there was never much discussion in the issue (#1959 for example). … I think the PR i mentioned earlier would serve to close this issue. … agree/disagree? **Ivan Herman:** we had a lot of discussion with PING, good discussions, after which we made extensive additions to answer the issues they raised. … and we contacted them several times to get their acknowledgement. So at this point we consider these issues closed.. … they have the right to reopen issues if they like. … Amy from TAG has closed the issue of epub review on the TAG repo, so that is an indication of how they feel. **Gregorio Pellegrino:** so is this passed? it is okay? _See github issue [epub-specs#1872](https://github.com/w3c/epub-specs/issues/1872)._ **Ivan Herman:** yes, it is okay. **Dave Cramer:** risk of exposure and finger printability. … this was raised before we clarified the threat model, can we close this now? _See github issue [epub-specs#1873](https://github.com/w3c/epub-specs/issues/1873)._ **Dave Cramer:** obfuscation, which we've discussed extensively, followed by updates to the spec docs. _See github issue [epub-specs#1875](https://github.com/w3c/epub-specs/issues/1875)._ _See github issue [epub-specs#1876](https://github.com/w3c/epub-specs/issues/1876)._ **Dave Cramer:** interactivity, which we've addressed as best we can given that it's ambiguous. … self-contained packages, this is a case where its appropriate to close because epub is clear that it is largely self-contained, subject to exceptions enumerated in the spec. Not dramatically impacting privacy. _See github issue [epub-specs#1957](https://github.com/w3c/epub-specs/issues/1957)._ **Dave Cramer:** we enumerated the threat model, which deals with #1957. _See github issue [epub-specs#1958](https://github.com/w3c/epub-specs/issues/1958)._ **Dave Cramer:** permission prompts, we're dealing with this, strengthened text. _See github issue [epub-specs#1959](https://github.com/w3c/epub-specs/issues/1959)._ > **Proposed resolution: Close remaining privacy and security issues.** *(Wendy Reid)* **Dave Cramer:** broad user expectations issues, which is covered by the other changes we've made. > *Ivan Herman:* +1. > *Matthew Chan:* +1. > *Shinya Takami (高見真也):* +1. > *Bill Kasdorf:* +1. > *Dave Cramer:* +7. > *Wendy Reid:* +1. > *Matt Garrish:* +1. > *Murata Makoto:* +1. > *Dan Lazin:* +1. > *Charles LaPierre:* +1. > *Ben Schroeter:* +1. > *Masakazu Kitahara:* +1. > ***Resolution #1: Close remaining privacy and security issues.*** > *Ivan Herman:* clap, clap. **Dave Cramer:** I think the spec is now much more informative/clear about some of these issues, so thanks everyone. > *GeorgeK:* +1.
npdoty commented 2 years ago

I don't believe this issue is resolved yet. Providing an additional User-Agent string adds substantially to fingerprintability at a time that we are trying to reduce User-Agent entropy. At a minimum, we should note the fingerprinting risk in the rs spec. Normatively, we should be precise about its severity, recommendations to minimize unnecessary entropy, and clarify whether it's necessary in addition to the existing User-Agent string. (The core spec notes the risk and suggests non-normatively to content authors not to use it for tracking purposes.)

danielweck commented 2 years ago

Hello Nick,

recommendations to minimize unnecessary entropy

Unless I am mistaken, in security / cryptography parlance "low entropy" is synonym with increased predictability (order), while conversely "high entropy" indicates randomness. Isn't the latter a desirable quality with respect to minimizing fingerprintability / trackability?

npdoty commented 2 years ago

"Entropy" was a bit of technical shorthand here. Researchers in this area refer to entropy as the level of variability of the characteristics about a user or device. To the extent that there is high variability, the characteristics will represent a more unique and stably identifiable fingerprint, which has privacy implications in enabling tracking without transparency or control.

This section of the Mitigating Browser Fingerprinting draft describes entropy and other characteristics of severity of fingerprintability: https://www.w3.org/TR/fingerprinting-guidance/#identifying-fingerprinting-surface-and-evaluating-severity

danielweck commented 2 years ago

Thank you for the clarification, Nick.

mattgarrish commented 2 years ago

I'm curious if publishers have found the epubReadingSystem object at all useful?

We've talked about deprecating it in the past.

If it's underimplemented in reading systems and underused by publishers, is there any great loss if we move on from it?

iherman commented 2 years ago

If it's underimplemented in reading systems and underused by publishers, is there any great loss if we move on from it?

Per my first (but not exhaustive) test results, it is implemented in the sense of the CR requirement. Both Apple Books and Thorium implements it afaik.

I cannot answer the underused aspect. Isn't there a more general question about it in browsers, though? What the epubReadingSystem object adds is a minor addition to what browsers already reveal...

mattgarrish commented 2 years ago

it is implemented in the sense of the CR requirement

Sure, I'm speculating on whether it's realistic to ever see it widely implemented when it's been required to support it for almost a decade already.

We could make it an optional support feature, too, which would at least make it more reasonable to warn about the potential privacy issues that come with it. As it is, requiring support while warning about its security implications sounds contradictory.

iherman commented 2 years ago

Sure, I'm speculating on whether it's realistic to ever see it widely implemented when it's been required to support it for almost a decade already.

We could make it an optional support feature, too, which would at least make it more reasonable to warn about the potential privacy issues that come with it. As it is, requiring support while warning about its security implications sounds contradictory.

I have no information on whether it is implemented in general or not. Note that it is already optional, in the sense that scripting support is optional in the first place (and the question on whether it is implemented means whether it is present if a RS allows for scripting).

Allowing scripting opens up the flood gates for many potential issues, including fingerprinting through the facilities of the relevant WebView, and epubReadingSystem might just be a minor additional point in the overall picture. My feeling is that having an optional feature "within" an optional feature is slightly over the top...

(But this is not an issue I would lie down the road for...)

mattgarrish commented 2 years ago

Note that it is already optional, in the sense that scripting support is optional in the first place

That's not really making it optional, though, since it's only relevant if scripting is supported. If you support scripting, you must support the object.

The question in making any change is whether we tell reading systems to abandon it, which is what deprecating would do since no rendering ever depends on this, or whether we leave it as a feature of the specification but don't require its implementation anymore.

It'd be interesting to do a survey of publishers and see if any use it. That would maybe help shed some light on whether it's a dusty corner of the spec or not.

iherman commented 2 years ago

Note that it is already optional, in the sense that scripting support is optional in the first place

That's not really making it optional, though, since it's only relevant if scripting is supported. If you support scripting, you must support the object.

Yes. What we should find out (that is why testing is done...) is how frequent is to have an RS that supports javascript for authors but not supporting epubReadingSystem. We may find out that this number is actually very low, ie, RS-s that support javascript already support the object. If so, I do not think we should change the spec...

npdoty commented 2 years ago

Per my first (but not exhaustive) test results, it is implemented in the sense of the CR requirement. Both Apple Books and Thorium implements it afaik.

Does the epubReadingSystem value in those cases duplicate or add to what's in the navigator.userAgent?

There might be less entropy added if it's largely the same entropy as what's in the existing UA string. But in that case, it's not clear what the use is for content authors.

iherman commented 2 years ago

Per my first (but not exhaustive) test results, it is implemented in the sense of the CR requirement. Both Apple Books and Thorium implements it afaik.

Does the epubReadingSystem value in those cases duplicate or add to what's in the navigator.userAgent?

There might be less entropy added if it's largely the same entropy as what's in the existing UA string. But in that case, it's not clear what the use is for content authors.

The specification explicitly says:

This specification extends the Navigator object [html] as follows.

I.e., if I understand your question properly, it adds an information.

iherman commented 2 years ago

@npdoty, on your separate comment: an RS is (usually) built on top of a webview system, e.g., a chromium core, and it does not implement a full browser. For those, I would expect that the navigator.userAgent is something like chromium (I am not an expert, so I may very well be wrong). Furthermore, the same webview might be shared among different reading systems. Hence the additional information.

iherman commented 2 years ago

@npdoty do you still believe we should do something about this issue? I am not sure where we are...

mattgarrish commented 2 years ago

Is the question only whether we should enable targeting reading systems via their name/version given that it allows more specific profiling?

If the goal of the object is to allow content to adapt to the capabilities of a reading system, we should only need the feature detection part of the epubReadingSystem object.

mattgarrish commented 2 years ago

FYI, just running a quick test on Thorium I get:

navigator.userAgent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) EDRLab.ThoriumReader/2.0.0 Chrome/102.0.5005.61 Electron/19.0.1 Safari/537.36

navigator.epubReadingSystem.name: Thorium

navigator.epubReadingSystem.version: 2.0.0

In this case, the name/version aren't telling you much more than you could parse out of the userAgent string. (I don't know if that holds for other implementations.)

iherman commented 2 years ago

Is the content of the userAgent value standardized? Is it expected that the Thorium and its version would appear in it?

@bduga @danielweck

On the other hand... what it also tells me that the EPUB extension does not add any more fingerprintability surface to what is already there, ie, this issue may be moot...

bduga commented 2 years ago

I think there is an RFC for UA strings. Also see https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent

But they are usually spoofable - that is, users can often change their UA strings. There is no requirement Thorium put that information in the UA string, so exposing the name and version COULD be additional information, but isn't necessarily. This is only an issue where scripting is allowed and scripted content is allowed access to the network, which is already pretty bad. That is, given such a RS, fingerprintability is the least of my concerns. That said, these values are terrible and I hope no one uses them, since they suffer from exactly the same issues that have made UA sniffing the nightmare it is. I would be happy to see it go away. What would break I have no idea.

iherman commented 2 years ago

The issue was discussed in a meeting on 2022-07-21

List of resolutions:

View the transcript ### 1. Information exposure and fingerprintability. _See github issue [epub-specs#1872](https://github.com/w3c/epub-specs/issues/1872)._ **Brady Duga:** adding the name and version number to the `epubReadingSystem` object in js exposes additional information, which allows authors of epubs to gather more information about the user than the user may expect. … there's been discussion about removing the object entirely, or specific attributes of the object. … also, whether the `epubReadingSystem` object really exposes more information than navigator.UserAgent. … i feel is is dangerous for RS to allow access to network + scripting anyway. … does anyone use this information if you are publisher of scripted epub content?. > *Murata Makoto:* Hard to believe so. **Matt Garrish:** Dave said that he had figured out that an epub was being read on an Apple device, but that was the only practical example of it being used. **Brady Duga:** happy to make it disappear, but we have a backwards compatibility issue maybe?. **Shinya Takami (高見真也):** I will address our JP members in Japanese now. … [in Japanese]. … okay JP RS vendor, Voyager, says that their plan is to omit the `epubReadingSystem` object, but that this has not been done yet. **Brady Duga:** do they expose the name and version number attributes?. **Shinya Takami (高見真也):** this is controlled by the browser, they say?. **Brady Duga:** no, we're referring specifically to the `epubReadingSystem` object, not the browser `UserAgent` object. … but all of this is "should" territory, because it is not clear that everyone who supports scripting supports the `epubReadingSystem` object (even though the spec says they should). > *Brady Duga:* See `navigator.epubReadingSystem.name: Thorium`. **Shinya Takami (高見真也):** so the `epubReadingSystem` object is provided by the RS, not by the browser?. **Brady Duga:** yes, specifically the `epubReadingSystem` object added by the RS. **Shinya Takami (高見真也):** [in Japanese...]. … maybe Thorium uses this object, but Voyager does not support it. > *Murata Makoto:* Great!. **Brady Duga:** okay, good news then? Because there is no objection to deprecating it then. **Matt Garrish:** epubcheck doesn't detect this object right now, so whether or not we deprecate it probably won't majorly affect the landscape. **Brady Duga:** and we're talking here specifically about deprecating the name and version attributes, yes?. **Matt Garrish:** yes. **Brady Duga:** are we concerned that newer RS that omit these attributes could be fingerprinted by their absence?. … i'd be happy to go ahead with deprecation still. > **Proposed resolution: Deprecate name and version properties on the `epubReadingSystem` object.** *(Brady Duga)* > *Brady Duga:* +1. > *Matt Garrish:* +1. > *Matthew Chan:* +1. > *Shinya Takami (高見真也):* +1. > *Toshiaki Koike:* +1. > *Masakazu Kitahara:* +1. > ***Resolution #1: Deprecate name and version properties on the `epubReadingSystem` object.***
danielweck commented 2 years ago

Is the content of the userAgent value standardized? Is it expected that the Thorium and its version would appear in it?

In Thorium's case, the navigator.userAgent string is populated automatically by Electron (which is the Chromium-based cross-platform application framework used by Thorium). Thorium does not use the navigator.userAgent setter ( https://www.electronjs.org/docs/latest/api/web-contents#contentssetuseragentuseragent ).

As for epubReadingSystem.name|version, Thorium will of course continue to inject the now-deprecated / legacy properties, for backward compatibility.

npdoty commented 2 years ago

I think that resolution makes sense, thanks. Feature detection is generally more useful and more future proof than encouraging more UA string parsing, but having these additional potentially duplicative settings may have discouraged future progress.