w3c / csswg-drafts

CSS Working Group Editor Drafts
https://drafts.csswg.org/
Other
4.42k stars 652 forks source link

[css-fonts] incorporate mitigations for font based fingerprinting #4055

Open pes10k opened 5 years ago

pes10k commented 5 years ago

Font based finger printing is a common, privacy violating pattern, where websites build semi-identifiers based on uncommon fonts a user has installed. This semi-identifier is then combined with other semi-unique-identifiers (hardware configuration, user configuration, viewport size, etc) to build highly identifying values, used for tracking users.

Examples

Some browsers provide some defenses against this privacy violation. Safari, for example, only reports the default system fonts through Safari, and will not use other, uncommon fonts, even if they're installed on the OS. Firefox provides a similar option.

The standard should be modified to protect against / not allow font-based fingerprinting by default, instead of relying on non-standardized, vendor specific mitigations.

Suggested Mitigation I suggest having the standard follow Safari's approach, and requiring browsers to only treat the default fonts on the platform as system fonts. A simple (though maybe not the best / most elegant) way of doing this would be to modify section 5.2 in "CSS Fonts Module Level 3" to modify the system font fallback procedure to only return the default platform fonts. Those might be specified per platform, or just as this list: http://www.ampsoft.net/webdesign-l/WindowsMacFonts.html

pes10k commented 5 years ago

If the above approach is appealing, i would be happy to submit a PR to the existing level 3 standard, as well as the level 4 proposal.

svgeesus commented 5 years ago

If the above approach is appealing, i would be happy to submit a PR to the existing level 3 standard, as well as the level 4 proposal.

Appreciated, but please restrict that change to just CSS Fonts 4 which is the focus of current implementation. Errata can be gathered for Fonts 3, but there is no intention to back port all of Fonts 4 to Fonts 3. Instead, Fonts 4 is gradually replacing Fonts 3.

pes10k commented 5 years ago

I see. Is there an expected timeline for Fonts 4? If it's a far ways off, then possibly valuable to push out a 3.1 for security and privacy purposes (e.g. not all of font 4, but the things where the current spec is being leveraged to harm users)?

svgeesus commented 5 years ago

All browsers are implementing both the Variable fonts and the Color Fonts parts of Fonts 4, plus smaller changes (like font-weight being a number in the range 1 to 999 rather than being a set of number-like tokens 100, 200 etc).

So this is being used now.

pes10k commented 5 years ago

Okie dokie, sounds good. Would a strict, enumerated set of font faces that can act as system fonts be preferred, or would a broader phrase like "fonts provided by the platform by default, and not installed by the platform's user" suffice?

AmeliaBR commented 5 years ago

I don't think we want to list the specific fonts in the spec. The general rule should be that the list of fonts shouldn't provide any more information than can be obtained by other means: e.g., by the combination of browser & OS & preferred language. (I don't know about Mac, but Windows has language specific fonts that are included in the OS but not installed by default unless the user actually uses that language in the OS.)

I would also hope that there would be some exclusion/option for supporting a wider set of fonts for trusted sites.

For the PR: Fonts Level 4 already has a section on Preinstalled Fonts vs User-Installed Fonts, which currently says:

User Agents may choose to ignore User-Installed Fonts for the purpose of the Font Matching Algorithm.

So, the request here is to upgrade that "may" into a "should".

This should probably affect local() references in @font-face, as well as font-family matching. Otherwise, the fingerprinting techniques could be changed to compare a font-face defined as src: local(test-name), url(reference.woff);, where the reference file has a characteristic size that will differ from the true font of that name. (Unfortunately, this means that periodically downloading & installing the most popular Google Fonts will no longer save me on data!)


Next step: work on an API for full access to all installed fonts as a list! (With an explicit permission prompt, of course, which would also allow those fonts to be used for rendering.) This is essential for document-editing web apps to replace their native versions. Some apps still use Flash just to get this data.

pes10k commented 5 years ago

Hi @AmeliaBR

Thanks for the comments. A couple of comments:

So, the request here is to upgrade that "may" into a "should".

I think must (i.e. "User Agents ~may choose to~ must ignore…") would be the right word. Correctly implementing the standard should make impossible the kinds of privacy violations the current version enables. Similarly, standards should strictly protect user privacy, at least until there is some signal (permission, etc) saying the user granted the site greater privileges.

Re: local() thats all great points! I wouldn't have thought of that, but that all seems terrific! Thanks for catching my goof :)

Re: permissions: I don't have a strong sense about this (other than that permissions discussions often rounding down to "users don't like permissions, so just grant access by default". As long as things don't wind up there!). But for the use case you mentioned, maybe a better norm to push for would be a service worker + site hosted fonts?

dbaron commented 5 years ago

I don't think a must is viable here without a better solution for addressing language support.

Many languages aren't supported in the default fonts installed on a given operating system. In many cases users can then install fonts that support more languages by choosing to install support for those languages. Presumably the requirement being proposed here would allow web use of all of the default fonts for all languages -- which in turn still exposes a good bit of fingerprinting data (which languages the user has installed fonts for) -- but I think there are still significant languages that those defaults don't cover (with significant variation between operating systems). (It also wouldn't surprise me if the fonts installed on Android devices vary based on carrier/market and aren't consistent within a language, though I'd be happy to be wrong.)

So there's a tradeoff here between one of many active fingerprinting vectors and support for significant numbers of the world's languages. Without clear data that fixing just a part of this active fingerprinting vector (still allowing fingerprinting of which languages are supported by fonts on the system) would make a real dent in ability to do active fingerprinting on the web (which is much easier than passive fingerprinting) -- data that would probably require a project to gather a list of fingerprinting vectors available on the web (with entropy for each item) -- I don't think there's a very clear case for degrading the support for many minority languages on the Web.

pes10k commented 5 years ago

1) Fingerprinting doesn't get solved until you start solving it :) saying "this isn't the worst vector, so lets not fix" seems like a sure fire way to make sure fingerprinting never gets better

2) font based finger printing actually is one of the worst FP methods though! See the Panopticlick paper / project linked above, the Beauty and the beast: Diverting modern web browsers to build unique browser fingerprints paper, and many others (happy to provide links if you like). They all find the same thing: fonts are hugely identifying if you have anything but the default configuration (put differently: if you allow non-system fonts to be used, it will be hugely identifying in the cases where its useful, and not useful in the cases its not identifying)

3) You might consider the statement from PING regarding meta-standards for standards (e.g. ways to fix privacy in web standards). From the third section of the recent PING blog post, Privacy Anti-Patterns In Standards, there being bigger problems elsewhere doesn't obviate the need for standards to address the privacy harm they introduce. (note: I wrote it, but it states the position of the IG)

4) I think @AmeliaBR has the exactly right idea: fonts should give no more information away than "browser & OS & preferred language". So no argument against making the system fonts "the non-user installed fonts for the current system language." So not "all fonts for all languages", but something narrower than that. Would that address the concern?

dbaron commented 5 years ago
  1. many people see passive fingerprinting as not solvable given the web's API surface. Refuting that requires gathering the data from the various sources to see what the state of things is (see below), not just hoping. Agreeing to solve it needs to be a wide consensus, not a bunch of ad hoc and inconsistent decisions made in different working groups to different standards.

  2. many of the papers used flash-based font data, which is much more identifying since it's an ordered list of fonts, not just a set. That's why I'm suggesting that the convincing thing is to maintain a common repository of the state of fingerprinting rather than point to a bunch of papers all/most of which are seriously out of date in various ways, and all of which are incomplete.

  3. (sorry, need a (3) here for consistent numbering)

  4. many users use an OS and browser whose UI doesn't match their preferred language

dbaron commented 5 years ago

Oh, I guess I should respond to 3, actually: The Privacy IG isn't the right forum for making tradeoffs between Privacy and other issues; it's going to have an obvious bias. You'd probably get a very different result on a privacy vs. internationalization tradeoff in the Internationalization WG.

litherum commented 5 years ago

Safari, too, has different fonts for different internationalizations. My strategy when implementing this fingerprinting mitigation in Safari wasn't to treat every user the same; that would have made many of our users' lives worse. Instead, my goal was to limit the number of equivalence classes a user could fall into. Before the mitigation, a user could be in a class of one, thereby being uniquely identified. After the mitigation, there are still multiple equivalence classes, but there are only a handful. Each equivalence class has many, many users, thereby significantly reducing the number of bits of entropy.

litherum commented 5 years ago

Next step: work on an API for full access to all installed fonts as a list!

I would formally object to such an API. It explicitly undoes all the font-based privacy mitigations we've done. Users don't want to see more dialog boxes, and trying to explain the privacy implications of using fonts to a user is difficult. If a website wants to use fancy fonts, it can serve them as web fonts.

FremyCompany commented 5 years ago

Based on @litherum comments, my two cents here is that we should instead do the following:

User Agents must limit the exposure of system fonts to protect user privacy. The exact mechanism through which this is done is left at the discretion of User Agents.

To achieve this, User Agents should collect telemetry about fonts supported by their users. One way to prevent installed fonts to leak information about the user would be to cross-reference this telemetry data with their installed languages and operating system version, and not expose to the web the fonts which are not commonly supported in any of the [ OS-Version x Installed Language ] buckets that the user is part of.

tildelowengrimm commented 5 years ago

Agreeing to solve [passive fingerprinting] needs to be a wide consensus, not a bunch of ad hoc and inconsistent decisions made in different working groups to different standards.

David, I think I have the opposite expectations about what approach we should take. I believe that the only practical way to address passive fingerprinting is standard-by-standard and implementation-by-implementation doing the in-the-weeds work to ensure that passive fingerprinting surface isn't exposed.

But I definitely agree with you about the need for broad consensus. Good news, though: that one's already taken care of! People almost-universally agree that they don't want to be silently tracked across the web. It's not just consensus, it's basically unanimous. Now it's our job to implement that for everyone.

svgeesus commented 5 years ago

@dbaron wrote:

You'd probably get a very different result on a privacy vs. internationalization tradeoff in the Internationalization WG.

Similarly, people coming from a performance optimization perspective (which has substantial, real-word implications especially for those with slow network connections or with pricey, metered bandwidth) would take a different perspective if told "there is no need to download this font since you have it already, but we are going to forbid the browser to say so, and thus force you to download it every time, to enhance your privacy".

dbaron commented 5 years ago

There is not consensus that active fingerprinting is solvable to the point that there won't still be large numbers of unique users; I've seen a number of chrome implementors and tech leads take the position that it is not in discussions on fingerprinting, including in this working group and elsewhere. I'm not convinced either way as to whether it's solvable because I haven't seen anybody put together the data (an up-to-date list of fingerprinting vectors, with data on them and proposed mitigations) that would let me make that judgment.

(edit: fixed typo where I wrote passive when I meant active)

pes10k commented 5 years ago

@dbaron not sure what the suggestion is here. Freeze progress on CSS Font v4 until a "up-to-date list of fingerprinting vectors, with data on them and proposed mitigations" is built? It definitely does not seem user serving to say "we know there is a problem, we know its significant, but haven't had others propose mitigations for them, so we're going to ship the problem anyway".

Seems way better to fix a problem that we know exists now, and is harming users today. This isn't hypothetical; the current CSS Font v3 spec enables users to be tracked w/o their consent.

As stated before, there are many, many research papers showing this is a problem, as well as many deployed examples in the wild. It is not the case that these papers find no problem in the absence of flash, the findings are either "not having flash degrades identifiability some, but its still identifying" or "we measured w/o flash, and find its highly identifying." It's also apparently serious enough that FF and Safari have deployed mitigations.

pes10k commented 5 years ago

@litherum can you say more about Safari's algorithm? How different is it from anonymity sets of "browser & OS & preferred language"? I'm not married to the specific mitigation in the issue text, as long as the standard includes a fix for the problem. Maybe Safari's approach is the way to go!

jasonanovak commented 5 years ago

In terms of the efficacy of font fingerprinting / entropy it exposes, this paper from INRIA is fairly interesting/helpful. They conducted a real world study of fingerprinting, including Javascript based font probing, and found that fonts were one of the top contributors to fingerprint-ability.

jasonanovak commented 5 years ago

One additional thoughts: a concern has been raised about the performance impact of downloading fonts; there's an interesting performance impact of JS based font fingerprinting -- it takes time/resources for a fingerprinting script to iterate through fonts to determine what a user has installed (for example fingerprintjs2 has searching for an extended list of fonts as an defaulted-off option because of the performance impact of doing so).

AmeliaBR commented 5 years ago

@jasonanovak I don't think you can fairly compare performance impacts from malicious pages with performance impacts on normal usage. Blocking one fingerprinting script might just provoke the spyware to use another fingerprinting method with even worse performance impacts.

If the primary concern was the performance impact of the current methods for figuring out which fonts a user has, the solution would be to create a proper API for doing so.

tildelowengrimm commented 5 years ago

Agreed — I'm not a fan of a calculus which concludes that fixing a common fingerprinting method has a performance cost on the basis that sites might decide to use a less-performant fingerprinting method instead.

AmeliaBR commented 5 years ago

The performance cost of the fix is that people would end up downloading web fonts that they don't actually need (because they already have the font installed on their system).

E.g., I have most common Google Fonts installed, and one of the reasons I did that was to cut down on web font downloads. If we prevent browsers from using those custom installed fonts, there will be a performance cost to me (more data usage and slower page loading) when visiting sites that use these fonts.

How many people this will affect, and to what degree, I can't say. Some browsers give users the option to turn off web font downloads altogether, which would negate the performance impact but increase the impact on user experience. E.g., turning off web fonts might not be a good solution for people whose pre-installed system fonts don't offer a lot of choice for the languages/scripts they use.

The performance impact of malicious scripts is a separate issue altogether. I was using the example of switching fingerprint methods to emphasize that we can't expect that fixing the fingerprinting vector will have a net performance benefit on malicious sites. Malicious sites generally don't care about user data plans.

jumde commented 5 years ago

if there is a plan to introduce a local-font permission with font-table-access(https://github.com/inexorabletash/font-table-access/#privacy-and-security-considerations), then there is no need to allow non-standard system fonts by default

svgeesus commented 4 years ago

Privacy INterest Group tracking this

css-meeting-bot commented 4 years ago

The CSS Working Group just discussed mitigations for font based fingerprinting.

The full IRC log of that discussion <mstange> Topic: mitigations for font based fingerprinting
<mstange> github: https://github.com/w3c/csswg-drafts/issues/4055
<mstange> chris: The issue is that you can pretty much identify individuals based on the set of installed fonts.
<mstange> ... For example, I have all CSS test fonts installed and some fonts for languages I don't spec, and that identifies me uniquely.
<AmeliaBR> s/spec/speak/
<foolip> fantasai, florian, TabAtkins: we're in the #testing meeting debating what your requirements actually are. can we interview you later?
<mstange> ... One proposal was to only report fonts that are the standard fonts for that platform.
<mstange> ... But this would cause you to re-download fonts you already have.
<mstange> ... This consumes unnecessary bandwidth.
<mstange> florian: On some OSes, even the set of default fonts can almost uniquely identify you.
<mstange> myles: It is impossible for the spec to describe the set of default fonts.
<mstange> ... The proposal is to say in the spec that browsers must have some affordances to protect user privacy by having some sort of (?)
<mstange> florian: On the performance vs privacy question, I lean towards privacy. On performance vs internationalization, it's less clear: If you don't have the font for a particular language and can't read the text, that's bad.
<mstange> chris: There is a strong web compat problem here. Things that used to work should not break.
<mstange> florian: When working means look pretty, there's a trade-off. When it means you cannot read it, it's different.
<mstange> myles: WebKit has been doing this for over a year. We discard user-installed fonts.
<mstange> florian: Mongolian without fonts is unreadable.
<mstange> ... When it is readable, removing the fonts breaks it.
<mstange> myles: It's a trade-off.
<mstange> heycam: How did you choose that list of fonts?
<mstange> myles: I commented on the issue.
<heycam> s/heycam/thomas/
<fantasai> It was also pointed out that downloading fonts can cost money in some areas, and this is more likely to be the case in areas which are more likely to use minority languages
<mstange> thomas: Rather than a bespoce list, could we come up with a list that can be updated periodically? Some list that covers languages for i18n use cases, as well as some fonts that are installed on machines.
<fantasai> and which have less money to spend
<mstange> iank_: The information about fonts is queriable by measuring the bounds of boxes, without getting the list of fonts from an API.
<mstange> Rossen_: We will pause the discussion of this issue and unpause it after the break.
css-meeting-bot commented 4 years ago

The CSS Working Group just discussed mitigations for font based fingerprinting.

The full IRC log of that discussion <emilio> Topic: mitigations for font based fingerprinting
<emilio> github: https://github.com/w3c/csswg-drafts/issues/4055
<emilio> TabAtkins: [introduces the issue]
<emilio> TabAtkins: we expose a lot of PI data on the web
<emilio> ... even if you plug fonts we're probably not below the level where you cannot identify a single user
<emilio> ... to do that you probably need to do software rendering on canvas for example
<emilio> ... so unless somebody comes up with a list of stuff and data
<emilio> ... I think we shouldn't do that
<emilio> ... a bit annoying from a PR standpoint to argue why it doesn't really matter but...
<emilio> myles: our goal is to remove all the sources of fingerprinting on the web
<emilio> ... we should reduce as much as possible
<emilio> TabAtkins: you cannot remove all of them
<emilio> ... no media queries, etc..
<emilio> TabAtkins: unless you could reduce it to 20 you haven't done anything
<emilio> myles: well you're closer to the goal
<emilio> [funny methafores]
<emilio> metaphors*
<Rossen_> q?
<emilio> TabAtkins: going from "individually identify someone" to "individually identify someone" does nothing
<emilio> ... there's a specific threshold we need to reach to do anything
<emilio> ... and nobody can
<emilio> myles: we'll try
<emilio> dino: I really believe we should ask the question for each feature of what the cost is
<emilio> ... I accept what TabAtkins says about the number of bits
<emilio> ... but it's this group's duty to do the cost of the feature vs. the privacy impact
<emilio> florian: cost is breaking the web for minority languages, benefit is not clear yet
<emilio> TabAtkins: w3c has the privacy interest group working on this, if their conclusion is that we can hit this range by doing this
<emilio> ... then happy to
<emilio> plinss: every time we add a bit we make it that much harder, if we throw our hands up in the air then sure, let's add identifiers
<emilio> thomas: There's also ways to alert the user it's being fingerprinted
<Rossen_> q?
<emilio> nmccully: I'm hearing mostly that it's not the right fix. We shouldn't make it worse but...
<leaverou> q+
<emilio> myles: our job is to design CSS APIs and we have to weight pros and cons. We found that font-based fingerprinting is one of the most unique ways users are fingerprinted. We also found that it doesn't affect most users' experience
<Rossen_> ack leaverou
<emilio> ... so pros and cons seem clear here
<dino> emilio: I agree with myles
<emilio> leaverou: Lots of old sites rely on common fonts like Calibri or Cambria installed
<florian> q?
<florian> q+
<emilio> ... also there's a perf impact of always downloading the font since sites tend to use `local()`
<emilio> ???: Are we getting ahead of the game between standards and impls
<fantasai> s/???/glenn/
<dino> s/???/Glenn/
<emilio> myles: the spec can't do much here
<Rossen_> ack flackr
<emilio> myles: we are an standardization, we can't do more that saying in the spec that should have privacy considerations
<Rossen_> ack florian
<emilio> ... but browsers like Safari can and have gone further
<emilio> florian: so you mentioned that you investigated the amount of sites
<emilio> ... that broke or not
<emilio> ... if you're removing language support minority users can't use the web
<emilio> ... also bandwidth may be a concern
<emilio> ... I don't care if sites are slowly slower for californians
<emilio> myles: having philosophical discussions is not particularly useful
<emilio> ... we need a concrete proposal
<emilio> ... and there's nothing to resolve on until there's one
<emilio> ... the spec already says that a UA may or not scan al fonts in the system
<emilio> Rossen_: out of time
npdoty commented 4 years ago

The performance cost of the fix is that people would end up downloading web fonts that they don't actually need (because they already have the font installed on their system).

E.g., I have most common Google Fonts installed, and one of the reasons I did that was to cut down on web font downloads. If we prevent browsers from using those custom installed fonts, there will be a performance cost to me (more data usage and slower page loading) when visiting sites that use these fonts.

How many people this will affect, and to what degree, I can't say. Some browsers give users the option to turn off web font downloads altogether, which would negate the performance impact but increase the impact on user experience. E.g., turning off web fonts might not be a good solution for people whose pre-installed system fonts don't offer a lot of choice for the languages/scripts they use.

I think it would be useful to know how many people have separately installed many web fonts onto their systems and would get this bandwidth-reduction benefit. It looks like SkyFonts provides a service for that (including citing bandwidth benefits), but it's not really emphasized on the Google Fonts site itself, for example.

But couldn't browsers provide that performance benefit by caching web fonts? It doesn't have to be system-installed, a site can refer to a web font and if the browser has it cached, then the user doesn't incur the bandwidth cost; Google Fonts are typically cached for one year. There are potential privacy implications regarding timing attacks on cached resources as well, but they're not nearly as easy or expansive as accessing the list of fonts, which (sorry to repeat the point) is one of the highest entropy fingerprinting sources available (in the top 3 to 4, depending on some details like platform or the particular dataset).

AmeliaBR commented 4 years ago

But couldn't browsers provide that performance benefit by caching web fonts?

Some benefit, for repeat visits to the same website. For visits to different sites, browsers are switching to a model where the cache of 3rd party resources gets partitioned by the site making the request (to avoid security issues where sites could guess at your browsing history by timing how long it takes to download a resource from that domain). Even without that security enhancement, cross-site caching fails if the site has done anything unique re subsetting the font.

pes10k commented 4 years ago

Glad to hear this was discussed in TPAC. However, i couldn't tell from the IRC notes above what the group decided on for next steps. PINGs objection is still the same, that the privacy harm enabled by the spec has demonstrated "in the wild" harm, and so need some solution in the spec.

What I took away from the IRC conversation is that the group needs further data to decide the correct mitigation. Is this correct? If so, do ya'll have a plan for getting that data? Happy to support that effort if possible.

I'm confused by the "Needs Design / Proposal" label though. https://github.com/w3c/csswg-drafts/issues/4055#issuecomment-505279789 is a concrete proposal, no?

dscorbett commented 4 years ago

This will break websites with user-generated content in minority scripts. Maybe browsers should be encouraged to ask users, upon first going to a site that requests a certain installed font, whether to permanently allow that site access to that font, to minimize the disruption.

pes10k commented 4 years ago

@dscorbett can you explain more? These would be sites that expect the visitor to have a non OS provided font, don't have a useful / useable fallback, and don't include / web-font the font they want to use? Can you send some example links?

These are websites that break in all WebKit browsers currently then? And breaks under the suggestion in https://github.com/w3c/csswg-drafts/issues/4055#issuecomment-505279789?

dscorbett commented 4 years ago

https://www.facebook.com/RohingyaLanguageAcademy/ includes some user-generated content in the Hanifi Rohingya script. Facebook doesn’t distribute a Hanifi Rohingya font, but I can see the text because I have Noto Sans Hanifi Rohingya installed. If the browser skipped that font because it is not a default system font, no one would be able to see the text.

That text is visible in Safari. Have I misunderstood this proposal?

litherum commented 4 years ago

*not all WebKit browsers. Only Safari blocks these user-installed fonts. Regular web views in 3rd party apps need to continue honoring these fonts because it’s common for apps themselves to “install” fonts for the current process and use web content as UI, which should get the font.

tabatkins commented 4 years ago

So I just had a discussion with @jschuh to figure out some further details beyond what I discussed during TPAC. Here's our (Chrome's) current thoughts on the matter:

  1. For the explicit enumeration of local fonts API, we want to only expose system fonts by default.

    • If the user has given explicit signals that they've trusted the site with their identity, such as installing the site as a PWA or the browser detecting that the user has logged into the site, we intend to allow enumeration of all local fonts by default; you can't identify the user much more at that point. Exactly what triggers this condition is up for revision as time goes on.
  2. To support the middle ground of non-trusted sites that still want font data from local fonts, we want an API (<input type=font>, or a DOM API call that pops up a font chooser) that lets the user explicitly choose a single font to expose. (This is akin to <input type=file>; enumerating the user's filesystem is clearly terrible, but letting the user affirmatively provide a single file is clearly totally OK.) We'll pursue this separately as another spec proposal.

  3. For more general font usage, such as in 'font-family', we are not interested in locking down access to only webfonts and local system fonts; there are important usability and a11y concerns, as expressed in this thread and in the TPAC discussions, for allowing pages to use local fonts beyond the system ones.

    However, to deal with bad actors using this access as a back-door for font enumeration/fingerprinting, we're actively working on a Privacy Budget system, in which "how many fonts is this page accessing" will be counted (among many other things). A page cycling thru a large number of fonts will burn thru their budget quickly, at which point further fingerprintable APIs will stop working (or become very noisy/generic) until the user gives explicit permission to continue.

Privacy Budget satisfies my concerns, expressed during TPAC, that trying to solve privacy issues with one-off restrictions won't work, and even with coordinated efforts across the web platform you'd need harmfully-draconian restrictions to have even a chance of protecting privacy. We believe the Privacy Budget framework is the correct way to address fingerprinting concerns going forward.

yisibl commented 4 years ago

@litherum

If a website wants to use fancy fonts, it can serve them as web fonts.

For CJK-like fonts, there is currently no better way to make them a universal web font. Because these fonts are very large, a font in ttf format usually exceeds 10MB. The current solution is font subsetting, but it has a lot of limitations.

pes10k commented 4 years ago

re @tabatkins and privacy budget

I haven't see a standard for it, any specifics of thresholds or empirical observations of it being a useful privacy protection strategy. Further, since a unique font generally puts someone in an extremely small equiv class by itself (w/o needing to be combined with other inputs), its unclear how a privacy budget approach would be useful here.

Put differently, users are being harmed today by this flaw in the font standard. It seems inappropriate to hinge the solution to that problem to something that isn't anywhere close to standardization (i.e. privacy budget).

re @yisibl (https://github.com/yisibl) could the privacy harm be addressed by solving the problems related to font subsetting?

Re @litherum @dscorbett Would be interested to know how Safari handles these cases, as it seems the best (only?) proposal on the table currently is to do what Safari does.

In general, again, I (and I dont think anyone on PING) is wedded to any particular mitigation, only that there is a deep privacy harming flaw in the current spec that needs fixing. Would be very happy to work with the WG to come up with other options, if the Safari option doesn't work. But some solution needs to be found (keeping in mind that privacy budget does not seem to be a solution to this problem).

hsivonen commented 4 years ago

For CJK-like fonts, there is currently no better way to make them a universal web font. Because these fonts are very large,

This seems not particularly relevant to whether it's practical for browsers to block the visibility of user-installed fonts to the Web, because CJK fonts are included in the default install of all popular operating systems these days.

hax commented 4 years ago

@hsivonen Not all OS have good quality CJK fonts installed, currently only HeiTi have good quality fonts broadly available among all major OS. So if you need other high-quality CJK typeface (like Song, Fangsong, Kai, etc.) support, the web page authors may rely on user-installed fonts, for example the fonts which available with MS-Office installation.

There are also some web apps in business (eg. tax software) in China require special fonts to be installed and used in their pages.

pes10k commented 4 years ago

@hax would it suffice to have a browser setting (defaulting to off) to enable this?

(distinct from a per-page permission, for the reasons mentioned in https://github.com/w3c/csswg-drafts/issues/4055#issuecomment-505281057)? This would be similar to the do-not-track setting defined in that standard, but defaulting to off instead of on.

@hax also, can you clarify what happens on these sites when you visit them in Safari, on a local install of OSX? Do they work correctly in Safari b/c OSX installs a category of fonts that (for example) Windows doesn't? Or that these sites don't support Safari / users w/ default fonts?

Would another option be to just have Microsoft systems include the common office fonts as the set of system fonts they expose (since the number of office users is likely large enough to preserve useful equivalence classes)

yisibl commented 4 years ago

I am not against reducing the impact of fonts on fingerprints, but Safari's approach is arbitrary. Even if a user installs a high-quality CJK font, it cannot specify it via CSS. As a result, we can only use Web fonts, but CJK Web font faces many problems.

The Chrome team offers a lot of advice, and we should go in this direction instead of killing the Web's creativity.

In Safari 12.1.1, whether I use local() to specify a font or directly set font-family to SourceHanSerifCN-Light I can't use my own installed CJK font.

I tried two ways to enable locally installed fonts:

  1. Directly set font-family: "Source Han Serif CN";(思源宋体)
  2. Via local()

Either way, this font cannot be enabled.

@font-face {
  font-family: "$";
  src: local("SourceHanSerifCN-Light");
}
.test1 {
  font-family: "Source Han Serif CN", "PingFangSC-Regular";
}

.test2 {
  font-family: "$", "PingFangSC-Regular";
}

Demo: https://codepen.io/yisi/pen/OJLGoxj

Safari 12.1.1(fallback to PingFang SC)

image

Firefox、Chrome、IE

image

hax commented 4 years ago

@snyderp

Because CJK fonts are always big issues from the first day of internet, some front-end developers may specify a complex font settings to utilize the best quality fonts which may available on user's computer. For example, they may use font-family: Source Han Sans, Source Han Sans SC, Source Han Sans CN, Noto Sans CJK, Noto Sans CJK SC, Hiragino Sans GB, Lantinghei SC, Microsoft Yahei, HYQihei, PingFang SC, STXihei, WenQuanYi Micro Hei. This list include many HeiTi fonts from different source --- open source fonts, additional fonts of widely used software (like Office), popular commercial fonts, etc. A simple assumption behind such strategy is, if user buy/install a font, it's very likely they want to use this font as default font.

If you ask what will happen if users upgrade to mojave... Luckily, CJK HeiTi fonts in OSX/iOS are good start from 2015, so fallback to PingFang SC seems not too bad.

But there may be still many cases which will face problems. For example, I used to see some mobile devices / apps provide many CJK fonts for downloading/installing as important feature. It would be unacceptable if the users can not use these fonts in their browsers/webapps.

Or that these sites don't support Safari / users w/ default fonts?

In the past, there were many sites/webapps only test Windows in China market. As developers, we try our best to make web pages/apps compatible with all platforms, but there are things which out of our control. For example, tax software need special fonts installed for legal compliance. We really need solution for such cases. And the solution should be good enough, or we will eventually go to the instruction like "don't use Safari, or new Chrome, pls use XXX, YYY, ZZZ browsers (which based on old versions of chromium)" 😂

Currently I can not tell whether a browser setting (or any method) is ok or not for each use cases. I just want to provide some background which I believe need to put into consideration.

hsivonen commented 4 years ago

@hax Thank you!

Not all OS have good quality CJK fonts installed, currently only HeiTi have good quality fonts broadly available among all major OS. So if you need other high-quality CJK typeface (like Song, Fangsong, Kai, etc.) support, the web page authors may rely on user-installed fonts, for example the fonts which available with MS-Office installation.

From the privacy pespective, it's problematic that for some systems, there isn't a single font bundle. E.g. an en-US install of Windows 10 does have fonts for Chinese but not the ones you mention, but you don't need to install Office: AFAICT, adding the Simplified Chinese IME to available text input methods adds the fonts DengXian, FangSong, KaiTi, and SimHei.

Adding the Japanese and Traditional Chinese IMEs similarly expands the set of fonts even though the en-US base install already has coverage. (And indeed, for Japanese, the base set is gothic-only with no mincho!)

Sites that involve text input can pretty easily figure out what IME a user is using, so in that sense having the font list correlate with IME doesn't give away more information, but when there's no text input on a site or when the user has added IMEs to the menu but isn't currently using them, being able to detect the full set of IMEs the user keeps available is bad. I don't know how to solve this unless Microsoft changes its disk space vs. privacy considerations when deciding how this stuff works, but as long as browsers expose whatever user-installed fonts to the Web, Microsoft has no incentive to change the privacy properties of the system font configurations.

Some privacy can be traded away for typographic quality by not blocking any font that is bundled with Windows even if the font isn't guaranteed to be present in all configurations of Windows.

(I'd expect a Korean font subsetted to the KS X 1001 set of modern-use syllables to be of reasonable size as WOFF2, so I expect the constraints for site-provided fonts for Korean to be different from Chinese and Japanese.)

A simple assumption behind such strategy is, if user buy/install a font, it's very likely they want to use this font as default font.

Surely that bit of user intent can be seen from the user actually taking action to change the browser font prefs in addition to just installing the font.

hax commented 4 years ago

Surely that bit of user intent can be seen from the user actually taking action to change the browser font prefs in addition to just installing the font.

If I understand correctly, if browsers allow user set installed fonts as default fonts, it still could be utilize as fingerprinting 😂

Essentially, users may want to install and use fonts in web platform for various reasons, like a11y, business requirements, legal compliance, political position, aesthetics or just highlighting personality. It's a hard problem that how to make tradeoff between privacy and user rights of choice.

hsivonen commented 4 years ago

If I understand correctly, if browsers allow user set installed fonts as default fonts, it still could be utilize as fingerprinting

Of course.

Essentially, users may want to install and use fonts in web platform for various reasons, like a11y, business requirements, legal compliance, political position, aesthetics or just highlighting personality. It's a hard problem that how to make tradeoff between privacy and user rights of choice.

If users are given choice, we can't protect users who use the ability to make choices from being fingerprinted on those choices.

However, at present there's the problem that people who make no browser configuration changes still get fingerprinted on their non-Web uses of their computer. I think it's worthwhile to protect users who don't change browser font prefs from being fingerprinted on what fonts they've installed for other things that they do on their computer.

pes10k commented 4 years ago

If I understand correctly, if browsers allow user set installed fonts as default fonts, it still could be utilize as fingerprinting 😂

This is correct, and why this issue exists. It would be great to have more suggestions for how to solve the problem, instead of privacy ¯\_(ツ)_/¯

(I don't mean you specifically, in anyway, but the above thread is heavy on attacking a single suggestion, and light on the WG suggesting solutions to a problem in the WG's spec)

The "browser font prefs" suggestion is not appealing to me (I mean it mostly as a straw man), but the feedback i'm hearing from the WG is that for some subset of users, in some locals, fingerprinting is unavoidable. I'm not ready to throw the towel in yet, but its just there to say "here is at least one option for having the standard be privacy preserving by default, instead of privacy harming by default".

hax commented 4 years ago

@snyderp We all agree privacy is very important. I think no one want to "be heavy on attacking a single suggestion" in the whole thread. But as @yisibl and I point out, current Safari solution have bigger impact on CJK users than others because the alternative workaround (use webfont) have much bigger cost/difficulties for CJK fonts (and considering other factor like partition cache, such cost will be even bigger).

@hsivonen

If users are given choice, we can't protect users who use the ability to make choices from being fingerprinted on those choices.

I believe that's the problem the working groups (CSS WG + privacy WG + other related WG like i18n, etc.) need to working together and figure out.

tabatkins commented 4 years ago

I haven't see a standard for it, any specifics of thresholds or empirical observations of it being a useful privacy protection strategy.

I assume you read the explainer I linked? Note that this is also something we're actively working on and developing; it's far from complete so far.

Further, since a unique font generally puts someone in an extremely small equiv class by itself (w/o needing to be combined with other inputs), its unclear how a privacy budget approach would be useful here.

A single unique font probably does, yeah. How do you expect a website to find that single unique font that the user has? If it's highly identifying, that means only a small number of people have it. So either the website is only targetting those handful of people and is thus testing only for that font (interesting case...) or they're testing lots of "unique" fonts to see which small bucket the user falls in. The latter is exactly what the Privacy Budget approach is intended to detect - spamming hundreds or thousands of local font requests looking for the one that highly identifies the user.

Put differently, users are being harmed today by this flaw in the font standard. It seems inappropriate to hinge the solution to that problem to something that isn't anywhere close to standardization (i.e. privacy budget).

And as others have argued in this thread, users will be harmed by the suggestion to restrict local font access to solely system fonts. (And aren't currently harmed by Safari's actions due to the differences in user demographics between browsers.) We need to think about the balance of benefits, harms, and costs of mitigating those harms.

As I argued in TPAC, and Chris Wilson and others at Google argued in their response to PING's charter discussion, the web is chock full of data that can be used for fingerprinting. Any attempt to reduce that, particularly any attempt with significant user-harmful side effects, needs to show that it'll actually reduce the fingerprinting surface to a usefully low level; going from 400 bits to 40 bits of identifying information achieves precisely nothing, since you only need 33 bits to uniquely identify every person on Earth. (And you really want to allow less than 20 bits, to ensure that people are "bucketed" together with at least several thousand others.)

If the PING can show that the sum of their suggested mitigations will reduce fingerprinting surface to 20 bits or less, or at least that there's a believeable path to getting under that limit, and that performing all of those mitigations will not harm the web to such an extent that the attack surface just moves elsewhere (such as sites moving to native apps...), then great! That would be an ideal solution, because reducing information wholesale is typically far easier than trying to be clever!

So far, the PING hasn't attempted to show that it's possible to do that. And so far, Chrome's security engineers don't believe it's possible to reasonably do an absolute fingerprinting reduction, either. Thus Privacy Budget, our attempt to dynamically enforce a pay-as-you-go budget that, hopefully, will let us prevent attacks (like scanning the user's local fonts) without harming legitimate uses (like using a handful of local fonts to actually render text).

I think you should do more than dismiss Privacy Budget out-of-hand; it's a serious effort to actually solve fingerprinting across the entire web platform, not an attempt to deflect attention. The math is clear here: this isn't a problem that can be solved with band-aids, and even knowing if your efforts will achieve anything at all requires a serious analysis of the whole attack surface; standard defense-in-depth security intuitions don't apply, at least not with the current state of things.

So, as Chris Wilson said, without a formal model showing that this change is part of a combined effort that will achieve a useful result, Chrome will continue to be against it, and will instead pursue methods like I described to achieve useful fingerprinting reduction. Harming users and webdevs for what is currently just a fig-leaf is not something we're interested in.

hsivonen commented 4 years ago

If users are given choice, we can't protect users who use the ability to make choices from being fingerprinted on those choices.

I believe that's the problem the working groups (CSS WG + privacy WG + other related WG like i18n, etc.) need to working together and figure out.

For clarity: I'm not suggesting taking away choice from users. That exercising choice makes a user fingerprintable is just a fact and there's nothing for WGs to work out about it. (I think most of this issue is probably not a standard-setting one but a browser product decision one.)

What I see as a problem is that what I believe to be substantial populations of users who don't need to exercise such choice and could be protected are not. It's not particularly nice to know that there are users who are cannot be protected, but that's not a good reason not to protect the substantial user population who could be protected.

Consider the following types of users:

  1. Users who never install fonts.
  2. Users who install fonts unknowingly by installing apps that add fonts that are available system-wide.
  3. Users who knowingly install fonts for non-browser use but don't change the default font prefs in their browser.
  4. Users whose language is well-served by system fonts but who as a matter of individual taste still change their browser font prefs.
  5. Users whose language is merely covered by system fonts and who as a matter of community norm install additional fonts for their language.
  6. Users whose language isn't covered by system fonts and who have to install fonts for stuff to work at all.
  7. Web developers who want to prototype with local non-system fonts before deploying with @font-face.

(The taxonomy is simplified: The most notable complication is the one seen upthread on Windows 10 with Chinese and Japanese: That there are fonts that are bundled with the system and that are conditionally enabled. For example, for someone in Japan, having the conditionally-enabled Japanese fonts enumerable probably isn't a substantial fingerprinting vector. For someone in Europe who has a Japanese IME in the text input menu, they are. For the purpose of the below paragraphs, I'm hand-waving conditionally-enabled system fonts into group 1.)

Users in group 1 need no protection mechanisms compared to status quo. Evidently users in group 4 can change browser prefs and could uncheck whatever "don't expose user-installed fonts to the Web" checkbox to opt out of protection.

Browsers cannot protect users in group 6 without developing all-encompassing font download mechanisms as part of the browser.

Groups 2 and 3 could be protected but aren't. As a user in group 3 (previously in group 4), I'm unhappy that I'm not made indistinguishable from group 1. It should be within technical feasibility to do so without breaking use cases for groups 5, 6, and 7, but the details do need careful thought. In particular, it would be good to know what language communities are in group 5 and with what details (e.g. in the context of particular operating systems only or out of habit despite operating system font repertoire having improved). Group 4, as noted, will manage.