w3cping / font-anti-fingerprinting

A system for preventing font fingerprinting
Other
14 stars 4 forks source link

Cache on first use #7

Open annevk opened 4 years ago

annevk commented 4 years ago

I believe it's also safe to cache each font at the point where it's first used, as long as the cache never evicts fonts.

FWIW, this doesn't seem acceptable. It's quite easy to imagine someone belonging to a subgroup of sorts that visits only one site with a specific font and no other sites that use the same font.

cc @bholley

jyasskin commented 4 years ago

@annevk How would an attacker measure whether the user has the font cached without causing the font to be cached? If the attack causes the font to be cached, the attacker can no longer distinguish between the user visiting the sensitive site vs the user visiting some attacker.

annevk commented 4 years ago

If the font isn't cached the attacker has to find another avenue of attack for that particular user. It's the other scenario that's worrisome.

tabatkins commented 4 years ago

Jeffrey's point is that the attacker's actual distinguishable categories are "has never seen this font" and "either visits a site using this font, or has been attacked in the past by me or another attacker probing for this font".

If fonts are never evicted, the second category grows without bound over time, and becomes useless noise very quickly.

bholley commented 4 years ago

The threat model is basically:

The signal obviously degrades quickly over multiple measurement attempts, but one measurement is enough to do damage.

tabatkins commented 4 years ago

Note that "only used on site X" and our proposal's "aggressively/permanently cache fonts that are high-use in a region" are probably mutually exclusive.

Aggressively caching all fonts cross-domain is definitely attackable. We're proposing to expose a much more limited surface.

caraitto commented 4 years ago

We could also consider requiring a font to be used on a number of unique domains before it could be added to the "high-use" font set, mitigating the impact of this attack.

bholley commented 4 years ago

A high-use threshold is tricky to reason about, because it's difficult to predict all the ways that subgroups might be identified and put at risk. For example, imagine some font Q that is only used by an ethnic minority. There might be lots of sites serving that group, and thus Q might appear high-use in a global sense, but the presence of Q in cache could still be used to identify members of that group.

There are also intersectional concerns. There could be common fonts Q and R which are used in combination across a sufficiently-large number of sites, but only site X uses Q without R. Detecting that Q is present and R is not can thus be used to determine if the user has visited X.

I recognize that these scenarios seem unlikely individually, but they demonstrate that the attack space is not clear cut and is quite difficult to reason about. For that reason we think it's best to avoid any dependence between the performance characteristics of site X and previous visits to site Y.

pes10k commented 4 years ago

In general, i'm very nervous about this approach. There are all sorts of cases where users intentionally clear caches (private browsing is the most obvious case). Would sites that work in "normal" browsing mode now break in 'private browsing mode', if them working is tied to cache state?

pes10k commented 4 years ago

To second @bholley 's point above, couldn't the attacker probe the font cache by advertising the font-family / etc from a URL they control, serving non-font garbage back from that URL, and looking to see if the user requests the URL, no? This would probe, but never fill, the cache.

Additionally, i'll just note that this "privacy by user-distingiush-able state" is exactly the same vector that google pointed out allowed for tracking in ITP.

tabatkins commented 4 years ago

Would sites that work in "normal" browsing mode now break in 'private browsing mode', if them working is tied to cache state?

No, the intention of this (looks like not communicated well enough in the explainer currently) is that it's a browser-wide cache. It's just "browser includes a bunch of blessed fonts", but lazily populated, to avoid bloating everyone's browser installs with gigs of font data they might not need.

couldn't the attacker probe the font cache by advertising the font-family / etc from a URL they control, serving non-font garbage back from that URL, and looking to see if the user requests the URL, no? This would probe, but never fill, the cache.

No, we wouldn't be trusting the font data served by the page. Once a font has been flagged as high-use and worthy of being part of this cache, it'll be in a centrally-managed caching server controlled by the browser vendor. So probing will reliably fill the cache and poison all future probes.

Additionally, i'll just note that this "privacy by user-distingiush-able state" is exactly the same vector that google pointed out allowed for tracking in ITP.

I'm not super familiar with the details of that vector. Was it similarly something where the first probe can learn something, but it indelibly poisons all future probes?

pes10k commented 4 years ago

it's a browser-wide cache

Sorry, can you explain further here? If it's a browser cache, but shared across profiles / storage clears / etc, this is going to break all sorts of privacy guarantees users expect.

It seems, under this proposal, that either it introduces information leaks across profiles / certain types of sessions / etc (shared mega cache), or it will make "clear storage" break sites that worked before (because you're resetting the cache, or some thresholds, etc).

centrally-managed caching server controlled

Sorry, i did not catch this at all. How would any of this work in cases where Google / Moz / Brave / etc doesn't own the license to the font? Or the region is using a font that the vendor doesn't know about? Or where there are different implementations of the same font face / family?

Also, moving this to a central location doesn't solve many of the leaks (I can also read the cache by looking for timing channels, repaints, etc) and would seem to introduce additional privacy risk (from the vendor, instead of the site).

I'm not super familiar with the details of that vector. Was it similarly something where the first probe can learn something, but it indelibly poisons all future probes?

There are several attacks, but setting aside whether i get to probe once or many times (one is still bad, and especially when combined with threshold approaches like described above or a privacy budget approach, can be semi-arbitrarily expanded), but the main point is that basing non-identification goals on user-distinct state is a dangerous direction.

tabatkins commented 4 years ago

Sorry, can you explain further here? If it's a browser cache, but shared across profiles / storage clears / etc, this is going to break all sorts of privacy guarantees users expect.

Like I said slightly later in my comment, essentially it's just "browser has a big cache of common fonts", but lazily populated as needed.

Can you elaborate on what sort of privacy issues you see this causing?

Sorry, i did not catch this at all. How would any of this work in cases where Google / Moz / Brave / etc doesn't own the license to the font?

Licensing is definitely a sticking point here!

Or the region is using a font that the vendor doesn't know about?

I'm not sure what you mean by this. Tracking font usage so we know what fonts are common and useful to be included has been part of this proposal since it was first published.

Or where there are different implementations of the same font face / family?

I assume one of them would be common enough to trip the tracking. I think it would be incredibly unlikely that two distinct fonts for a minority language are both heavily used but have the exact same name. In case that ever happens we can worry about it, but until then this seems like a non-issue for the initial proposal.

would seem to introduce additional privacy risk (from the vendor, instead of the site).

Can you elaborate on this? The proposal has always contained a proposal for tracking which fonts are common in a region; are you objecting to that? Or is there something else?

There are several attacks, but setting aside whether i get to probe once or many times (one is still bad, and especially when combined with threshold approaches like described above or a privacy budget approach, can be semi-arbitrarily expanded), but the main point is that basing non-identification goals on user-distinct state is a dangerous direction.

I get the general point, yes. But I asked for details common to the two proposals for a reason - the "visible once then forever poisoned" is intentionally part of this proposal specifically to limit the possible damage here. Assembling a user-unique fingerprint from probing data seems impossible here, since the fingerprint will be invalid on the user's second visit - they'll instead now look exactly like every single other user you've probed.

(And note that it's not that "you" get to probe once, it's that anyone gets to probe once, and then it's ruined for them and everyone else. This is distinct from some of PB's struggles where distinct sites can secretly communicate and aggregate their budgets.)

pes10k commented 4 years ago

Can you elaborate on what sort of privacy issues you see this causing?

The harm here is history leak (either across site, session or profile) [edit] and potentially cross site trackability / fingerprintability

Or the region is using a font that the vendor doesn't know about?

I'm not sure what you mean by this. Tracking font usage so we know what fonts are common and useful to be included has been part of this proposal since it was first published.

The narrow concern here is that there is some sites request a font by name, that users are expected to have, but which the vendor doesn't know about (and so can't serve to the client). This is different from knowing the names of the fonts alone; this proposal would seem to require the vendor to know both the name of the font and the bits / implementation of the fonts.

Or where there are different implementations of the same font face / family?

I assume one of them would be common enough to trip the tracking… but until then this seems like a non-issue for the initial proposal.

I agree that this particular issue seems unlikely, but I mean it to demonstrate the risks / harm of the vendor effectively keeping a global (over all users, or at least over all users in per language) map of font name to bits. It requires a level of centralization (effectively a font registry) that seems undesirable for many reasons (the specific concern I stated being just one)

would seem to introduce additional privacy risk (from the vendor, instead of the site).

Can you elaborate on this? The proposal has always contained a proposal for tracking which fonts are common in a region; are you objecting to that? Or is there something else?

My privacy concern re vendor is that this proposal allows vendors to learn about users browsing patterns by watching the fonts users request from the central vendor server.

I get the general point, yes. But I asked for details common to the two proposals for a reason…

Point taken, but my point is that we should not be engineering privacy protections on known-unsound-foundations, even if, in this instant, we can't think of a trivial exploit; if the foundation is wrong, it seems very likely that someone more clever than us (or, at least, more clever than me ;) ) will figure out an exploit later.

Better to design over sound principals (e.g. don't base privacy / non-distinguishable promises on top of user-distinct properties).

jyasskin commented 4 years ago

A lot of the discussion between @pes10k and @tabatkins assumes that we go with #8 to identify fonts by name instead of URL. Many of the concerns are already discussed and possibly mitigated there.

The question of what kind of history leak this might introduce, is discussed more in #10. @bholley is suggesting we just make this sort of cache a Web Shared Library, with any mitigations that winds up having, including possibly precaching instead of caching-on-first-use.

Was there anything else that I missed?

pes10k commented 4 years ago

i'm not sure I understand how the proposal would address the "uncommon font never referenced via webfont" use case w/o referring by name and not URL; the URL definition seems to rule out the use case automatically, no?

I'd strongly prefer removing this issue (font fingerprinting) from any "Web Shared Library" proposal. I'm not familiar with it, and a google or duck duck go search for "Web Shared Library" or "… Libraries" seems to all bottom out into a few talks at BlinkOn11, and no where else (If this is incorrect, and there is something standards track i'm missing, i would be grateful for links).

More so, there are many issues raised above that don't seem to be addressed in #10

  1. cross session leaks (e.g. how to not break sites when I clear cache / storage / profiles)
  2. the centralization concerns WSL / central browser server for things raises (both for privacy reasons and otherwise)
  3. the general shakiness / bad-idea-ness of basing privacy-protections on user-distinct state
  4. what to do when sites need fonts that the central repository doesn't know about