Script-aware fallback - Githubissues

raphlinus commented 5 years ago

Clearly one of the trickier aspects to integrating with a text layout system is script aware fallback. Issue #28 is one way, provide metadata. The problem is, it's not really supported by platform text libraries.

What is supported, and used by most web browsers, is an API for choosing a fallback font based on a Unicode code point and other data, generally locale preferences. In CoreText, that's the not-well-documented CTFontCopyDefaultCascadeListForLanguages. In DirectWrite, it's IDWriteFontFallback with IDWriteTextAnalysisSource. In discussions with @pcwalton and further thought, I think the best thing to do is to implement an abstraction for those API's in font-kit.

This will change and increase the API surface of font-kit quite a bit. In particular, we need to adopt a locale representation, and have code to plumb that to platform API calls. I suggest we use fluent-locale-rs, as this has promise to be a standard across the Rust ecosystem.

Discussion is welcome, as is an offer to implement the functionality. However, I'm also quite willing to do it myself, especially as I expect this to interact with the design of the rest of skribo, including its API.

mikeday commented 5 years ago

What about for Fontconfig? We're facing this exact issue right now and throwing around various extensions to @font-face rules that might help us on systems that don't have a platform font stack.

Edit: oh I see you covered Fontconfig in script_matching.md

raphlinus commented 5 years ago

It's a good question. One of the issues I found in my research is that some systems (like my stock Debian) have thin config files that don't do CJK properly, and there are various blog posts on how to configure that. The Firefox approach, of having a configurable "pref" that defaults to hardcoding known fonts, should work expediently for users, but is another configuration point and will degrade as fonts change (and there's lots of activity with quality free CJK fonts, so I expect this to be an issue). I did address it on the skribo side, but since the work is planned for this repo, it's worth capturing.

mikeday commented 5 years ago

We have separate fonts.css for Linux, MacOS, and Windows, but yes that requires ongoing maintenance work to keep it up to date. Also CSS is not really the right format for this as @font-face rules only have one simple sorting mechanism so you can't easily express language/script-based preferences in conjunction with that, ah well. Not that users always use appropriate language tags in the first place! :laughing:

raphlinus commented 5 years ago

Agree with above. This is slightly beyond the scope of my work, but maybe with awareness of the script-specific issues (I might blog about it) we might push Linux distros to ship a more precise fontconfig.

And yeah, CSS is not quite adequate. When I make "font collection" objects in skribo, I have to think about how much of that functionality to export in builders, or to treat it pretty much like CSS, and relegate the script-aware stuff to the system fallback fonts. That also gets into how much of the "font collection" and "font family" concepts spill over from being skribo to font-kit data types. The next couple weeks are going to be fun, I predict.

raphlinus commented 5 years ago

I discussed this with @pcwalton and I think we've come up with an API that represents the intersection of the Windows and Mac based approaches. This API takes:

A string.
The base font (used for style matching).
Additional explicit style parameters (weight, width, italic).
A single locale.

It returns:

An ordered list of fonts. Each font also has style synthesis (and maybe scale) metadata.
A substring length, saying that the fallback is valid for that substring.

This interface is quite a bit like IDWriteFontFallback::MapCharacters but with a couple important differences, to accommodate other platforms. Notably, it returns a list of fonts rather than a single font, and takes a single locale. On Windows, the implementation of this method is a very thin wrapper around MapCharacters.

On macOS, it can ignore the string and just return the result of CTFontCopyDefaultCascadeListForLanguages (but see below).

The guarantee is essentially this: the composition of the returned list of fonts will render the string up to the returned substring length. Further, the order of fonts returned will respect the requested locale (primarily used for Han unification).

The returned fonts should match in unique ID (see #40) so that existing loaded resources can be reused.

Rationale for taking a single locale

For correct handling of multi-locale fallback, it is the caller's responsibility to determine the most relevant locale for the string. In an extreme case, for "ल骨" and a locale list of "en-US, zh-CN, mr-IN", there should be two different queries, the first using "mr-IN" and the second "zh-CN", to get the Marathi-specific allographs and Han unification correct.

We could take multiple locales, which might increase accuracy in some cases, but would also increase the risk of differences in behavior across platforms. For example, the above string might render correctly on Windows, which can handle the locale list, but then apply neither the Marathi or Simplified Chinese rules. (Note: allographs are usually applied with the 'locl' feature of in Devanagari OpenType fonts, as opposed to font selection, but the principle stands).

Using script hints for filtering

A typical macOS device will have more than 30 fonts in the cascade list. To improve efficiency (as there is a cost to considering each font in the list), the implementation may well filter the list. I think a good approach is to use script to rule out fonts - for example, if the source text contains no glyphs in the range U+0900 to U+097F, then any Devanagari fonts may be excluded. That might be determined from well-known font names, analysis of cmap coverage, or other sources.

Alternatively, the caller might do this filtering; we'll need to see which is cleaner and more efficient. In any case, the API should allow it.

Android

It should be possible to basically just list the contents of fonts.xml, using the locale to prioritize script matches; this would be quite similar to macOS.

The number of fonts is large, however, so it might be even more important to do script-based filtering as above. Fortunately, the lang attributes explicitly identify the script, so it should be possible to identify the script coverage reliably.

Note also that the fonts.xml file has a dire warning that this mechanism is going away soon. We should try to get input from the Android team what NDK-friendly mechanism will replace it. @nona-google can you comment?

Linux/FontConfig

I think the API sketched above can work using FontConfig, but I haven't dug into the details. I'd love to have help with this also.

nona-google commented 5 years ago

Yes, fonts.xml is dying.

We are now proposing new API for Java and NDK, named SystemFonts. https://developer.android.com/reference/android/graphics/fonts/SystemFonts

The equivalent APIs will be in NDK too but there is no public documentation yet. (Please note that we are still refining API surface. This draft APIs may be different from the final released version).

raphlinus commented 5 years ago

@nona-google Thanks, this is very helpful! In particular, it looks like the getLocaleList method might be useful for doing the script-based filtering as mentioned above.

nona-google commented 5 years ago

It is not clear to me how you implement the script-based filter, but please note that the returning locale does not completely reflect the font's cmap coverage. For example, NotoSansCJK covers Latin characters or some symbols but we don't put Latn script or Zsym script to that font. This is because we only give extra score with the locale info. Android filters out the coverage first, then give extra score based on locale preference, as you know well :) So, if you filter out the font only by script returned by getLocaleList(), that sounds too aggressive to me.

BTW, if you are looking for font matching logic, we will also provide font matching API in NDK (not for Java, sorry). This is actually exposing minikin::FontCollection::itemize function.

I want to point the API documentation but not yet in public :/

nona-google commented 5 years ago

Sorry, even ndk-r19, the relevant information is not included. I'll notify you once the NDK API documents are public.

raphlinus commented 5 years ago

I agree, for Latn and Zsym in particular (also Zyyy) it's very difficult to use script coverage to choose fonts. This is why I'm suggesting we use script metadata to rule certain fonts out, very especially the long tail of seldom-used scripts. Likely you'd never rule out the CJK fonts, because the question of script coverage for them is so complex. But Sharada and SoraSompeng I think you can exclude safely unless there are U+111xx or U+110xx code points, respectively. I think we will need to do further experimentation, but I also think it's likely we can get the list of fonts down from many dozens.

Having the itemize function available might also be a good approach - as you can see this is close to what we plan for Windows. We can't use it for everything (among other things, I don't see how to make it support unicode-range for Web fonts), but when we get to the fallback chain having it choose the font for a given string might well be practical and efficient. I look forward to seeing the NDK.

mikeday commented 5 years ago

Presumably you could also pass in a locale and the empty string and get a suitable list of fonts back?

raphlinus commented 5 years ago

@mikeday I like the idea but don't know how to implement that on Windows. Related, I'm not sure how best to implement this on old Windows - the IDWriteFontFallback API is 8.1+.

pcwalton commented 5 years ago

@raphlinus Yikes, the Windows 8.1 dependence will be a problem for Servo eventually. We don't have to fix this now, but we will need to come up with a solution for Windows 7 at some point.

raphlinus commented 5 years ago

I should do more research on Windows 7, but at the absolute worst, hard-coding a list of system fonts along with metadata (so it knows to put Meiryo at the front for ja-JP locale) should work at least ok. The biggest question is whether the API proposed above is valid. My sense is that it is, but if we're uncertain about 7 and want more confidence, lemme know and I'll dig in.

raphlinus commented 5 years ago

Research on Windows 7 below (it was bothering me, obviously).

Qt has a hardcoded list of fonts and includes the Han unification logic in there (there's a separate font list for each of the CJK locales). It doesn't have any script awareness beyond CJK. I haven't researched how it deals with the non-CJK fonts.

Blink calls into Uniscribe for this, using quite a bit of trickery - it creates a metafile DC so it can intercept the font request, then uses ScriptStringOut It calls this when the request for the IDWriteFactory2 fails. Based on a quick read of the code, I think the approach is workable - like the MapCharacters approach, it will yield a single font. I'm not 100% sure how to wire up the locale, but at the very least it should be sensitive to system settings, which might be Good Enough for many cases.

In looking through the Blink codebase, I also found code to query the registry for font linking. Again, I don't fully understand how this works, but it might be worth investigating further.

Current Firefox is probably most similar to the Qt approach: it doesn't appear to do deep queries into the platform, but rather uses the pref system, which is pre-initialized with known font names.

raphlinus commented 5 years ago

Looking more closely at DirectWrite, I think another possibility for Win7+ is to create a TextLayout object, then call pass a custom TextRenderer to the Draw method, gathering the font references in the callback. On further search, I see Gecko doing this, but for a reason I can't determine, it doesn't write up its aRunScript argument to set the locale. Maybe @jfkthame can shed more light on why?

jfkthame commented 5 years ago

I see Gecko doing this, but for a reason I can't determine, it doesn't write up its aRunScript argument to set the locale. Maybe @jfkthame can shed more light on why?

I don't have any particular insight into why. This was implemented in https://bugzilla.mozilla.org/show_bug.cgi?id=705594, and I can only assume that it seemed to work adequately with just passing the hardcoded en-US locale; but it may not have been tested in sufficient depth that locale-dependent differences would show up.

I'd guess the likeliest place for a dependency on the locale argument to show up would be with Han-unified characters, selecting between fonts preferred for ja/zh-Hans/zh-Hant. But within Firefox, this is likely to be handled by the prefs-based font fallback setup, and so this "last-ditch" fallback won't apply. Which could explain why the lack of locale support here hasn't been an issue in practice, even though it does in theory look like a shortcoming.

raphlinus commented 5 years ago

@jfkthame This seems likely to me, yes. The CJK stuff would have already been covered by pref, so this seems most effective in the long tail fonts like Ebrima.

I think this (custom renderer) should be the plan of record for Windows 7, and we can have a separate issue for it.

Also thanks for the bug link, it makes for interesting reading!

servo / font-kit

Script-aware fallback #37

Rationale for taking a single locale

Using script hints for filtering

Android

Linux/FontConfig