Identify fonts by name instead of URL

jyasskin commented 4 years ago

In https://github.com/w3cping/font-anti-fingerprinting/issues/3#issuecomment-597077705, @hsivonen suggested the interesting idea of identifying popular fonts by their names instead of by their URLs, to reduce the incentive to centrally host fonts.

@hsivonen, are you thinking of pulling the font's name out of the font-family descriptor or the font's name table or from somewhere else?

hsivonen commented 4 years ago

are you thinking of pulling the font's name out of the font-family descriptor or the font's name table or from somewhere else?

For matching purpose, I'd expect the matching of local() to fonts in the browser-managed set of local fonts to use the same names as would be matched if the font was a system font.

At least in Firefox on Linux, both the user-facing name of the font (e.g. 'Fira Sans Light Italic') and the PostScript name (e.g. 'FiraSans-LightItalic') works, so a site could specify either. (Google Fonts specifies both.) I don't know how exactly the user-facing name is formed from TrueType/OpenType fields. In particular, I'm not at all sure matching on the user-facing name works for fonts with multiple natural-language forms of user-facing name. CC @jfkthame

For tallying what's popular, local('Fira Sans Light Italic') and local('FiraSans-LightItalic') should increment the same counter. The values should be counted from the @font-face descriptors. Two reasons for that:

The piece of software doing the counting can't necessarily get these out of the font file by loading the url(), because a subset font file might have intentionally obfuscated values in order to mechanically comply with the "Reserved Font Name" part of the Open Font License 1.1 or to defeat unlicensed use of proprietary fonts.
If the sites don't have local() in the @font-face descriptor, the @font-face wouldn't match against the cache even if the font had been in the cache, so in that sense it's not really useful to define popularity in terms of how often a font is used as a Web Font generated from a particular original being loaded but in terms of how often a font is used with local() specified. This makes moot the issue of having to guess what the real name of subsetted fonts that have their name obfuscated would have been by guessing from the font files.

However, collapsing the user-facing name(s) and PostScript name to one counter might be best done by examining a corpus of known fonts that are licensing-wise suitable candidates for being browser-managed fonts.

(The above should not, at least at this time, be taken as an endorsement of the general idea of having the browser eagerly populate a cross-site-available font cache.)

jfkthame commented 4 years ago

Regarding src:local(...), the spec currently says that

For OpenType and TrueType fonts, this string is used to match only the Postscript name or the full font name in the name table of locally available fonts

and moreover that

For OpenType fonts with multiple localizations of the full font name, the US English version is used (language ID = 0x409 for Windows and language ID = 0 for Macintosh) or the first localization when a US English full font name is not available [...] User agents that also match other full font names [...] are considered non-conformant. This is done [...] to avoid matching inconsistencies across font versions and OS localizations

Current browsers vary in how accurately they implement this, IIRC. I think I've seen cases where some UAs match font family names in local(...), which is incorrect; the local(...) name is supposed to refer to a specific face.

jfkthame commented 4 years ago

If "popular" (whatever that's determined to mean) web fonts are to be cached for cross-site use, identified by src:local(...) names in @font-face rules and not served from a centralized location, isn't there some risk of the wrong resource ending up cached under a given local(...) key?

Suppose we were to include Zapfino in the set of "aggressively-cachable" fonts, and then the first site the user visits happens to have CSS that says

@font-face {
    font-family: flourish;
    src: local("Zapfino"), /* use preinstalled Zapfino on macOS */
         url("fonts/MyCalligraphicFont.woff"); /* alternative for other systems */
}

This will result (for users who don't have Zapfino installed locally) in the site's substitute font being cached as "Zapfino" and used everywhere thereafter.

jfkthame commented 4 years ago

Another failure mode would be if the first version encountered for a given "popular" font name happens to be a subsetted resource, which is then cached as the "canonical" resource the browser will use for that name.

Are we envisaging that the canonical list of "popular" web fonts would come with some kind of metadata -- SHA256 checksums or whatever -- for each corresponding resource, so that the browser can know whether what it has just fetched is the correct, canonical resource that should be cached? If they're not coming from a centralized source -- which I agree has its own issues -- then ISTM some such mechanism will be needed to assure correctness/integrity.

hsivonen commented 4 years ago

If "popular" (whatever that's determined to mean) web fonts are to be cached for cross-site use, identified by src:local(...) names in @font-face rules and not served from a centralized location, isn't there some risk of the wrong resource ending up cached under a given local(...) key?

I think there's only a little risk now but more risk over time.

To avoid the browser caching totally bogus fonts, the set of fonts from which the popular subset is chosen for eager caching would need to be vetted for licensing. (I was assuming that for Chrome the vetted repository would be the Google Fonts set of original fonts that Google Fonts builds its subset fonts from.) So there wouldn't be a risk of getting a garbage font. The risk would be limited to getting a wrong font from the repository.

Right now, if a site specifies a different font (as in different design) in local() and url(), local() could match when the user has the font installed, since locally-installed fonts aren't blocked. Therefore, the site has to be OK with the outcome of local() matching, so it's not really breakage.

A more realistic case than Zapfino would be Montressat, which Fedora bundles, which is the fourth family in the most popular ranking of Google Fonts, and would licensing-wise be obviously eligible for eager caching if browsers had a license-vetted set of candidate fonts. Do we really need to care if a designer did:

@font-face {
    font-family: differentforfedora;
    src: local("Montserrat"), /* use preinstalled Montserrat on Fedora */
         url("fonts/SomethingElse.woff"); /* alternative for other systems */
}

...and then ended up getting Montserrat on all systems? I think we don't need to treat that as breakage that we need to avoid.

So I think right now, local() is trustworthy in the sense that designers have to be OK with the possibility of it matching.

Once browsers block user-installed fonts by default, things get more complicated. local("Montserrat") would still match on Fedora, but e.g. Fira Sans, while popular, would no longer be matched if installed by the user.

So once browsers block user-installed fonts, malicious poisoning of the local() tallies becomes more feasible, but sites that do it would still risk being so successful at messing with the numbers that the site's own appearance would change.

hsivonen commented 4 years ago

Another failure mode would be if the first version encountered for a given "popular" font name happens to be a subsetted resource, which is then cached as the "canonical" resource the browser will use for that name.

For this reason, I think it's not feasible to automatically takes files from the Web and expose them to different sites, but browsers would have pull the eagerly-cached fonts from a vetted repository of unsubsetted appropriately-licensed fonts.

hsivonen commented 4 years ago

For sites like mine, this poses the risk that if the original font looks bad with the Microsoft rasterizer and I've taken steps to make the copy my site serves look better with the Microsoft rasterizer, if I specify local() my site would look bad on Windows when local() matches. But in this scenario, specifying local() leading to a bad appearance would be my fault even today. (So I don't specify it.)

jfkthame commented 4 years ago

Do we really need to care if a designer did: ... ...and then ended up getting Montserrat on all systems? I think we don't need to treat that as breakage that we need to avoid.

That wasn't my concern, but rather that a site's alternative fallback would potentially "poison" the cache such that other sites that use local("Montserrat") could end up getting an alternative that is not what they expect.

Because of this, it seems that the eagerly-cached fonts have to come from a vetted repository (i.e. centralization, which I'm not sure is good). Unless we maintain a list of the vetted resources and their checksums, in which case perhaps the browser could safely add them to its cross-site cache regardless of where it first encounters them?

hsivonen commented 4 years ago

A browser vendor vetting a repository of fonts that are eligible of being treated as popular is a form of centralization, but it's not the kind of tracking-enabling centralization as sites pointing to a central CDN themselves. It's more like CRLite. (But again, my opinion above is how I'd do the eager caching with minimally-bad incentives if eager caching is to be done, and I'm not saying that eager caching should be done.)

litherum commented 4 years ago

Identifying fonts just by name is a little scary because there are a gazillion different versions of popular fonts, and many of them are busted in a variety of ways.

w3cping / font-anti-fingerprinting

Identify fonts by name instead of URL #8