go-text / typesetting

High quality text shaping in pure Go.
Other
88 stars 11 forks source link

fontscan: how to combine ResolveFace and ResolveFaceForLang? #139

Open dominikh opened 3 months ago

dominikh commented 3 months ago

ResolveFace returns the first face that covers a given rune, while ResolveFaceForLang returns the first face that covers a given language. But how do I find the first face that covers a given rune in a given language?

For example, we might have two fonts cn0-4 and cn4-8 that cover disjoint sets of runes for Traditional Chinese, and two fonts jp0-4 and jp4-8 that cover the same runes as the Chinese fonts, but for Japanese, registered in the order cn0-4, cn4-8, jp0-4, jp4-8.

I cannot just look for "rune 5", nor for "japanese" to find jp4-8. The first search would find cn4-8, and the second search would find jp0-4.

This also impacts shaping.SplitByFace, which currently discards language information.

dominikh commented 3 months ago

(It's probably also worth documenting that ResolveFaceForLang just maps languages to rune sets and uses those for the lookup; it doesn't consult the metadata of the fonts. The ideal combination of resolving by rune and face should probably also consult the LOCL table; though really, segmenting by face should use grapheme clusters, not individual runes, and also handle Unicode normalization, etc.)

benoitkugler commented 3 months ago

Hum.. I was not aware the same Unicode code point may have different glyphs presentation depending on the language. Have you some examples of fonts and languages that have this behavior ?

Perhaps this issue would be resolved by rules (like the ones used by fontconfig) such as "for given language and family, use this family instead of that one" (related to #82).

Its true that the segmentation process is limited, because we rely on Harfbuzz for normalization and cluster handling, which is a rather complex topic. I'm not sure how hard it would be to extract the Harfbuzz logic and apply it during segmentation..

dominikh commented 3 months ago

Hum.. I was not aware the same Unicode code point may have different glyphs presentation depending on the language. Have you some examples of fonts and languages that have this behavior ?

The most famous example is https://en.wikipedia.org/wiki/Han_unification. It also sometimes happens for different languages using Cyrillic. For Han, if you're not using a pan-CJK font like Noto Sans CJK, you will have different fonts for Japanese Kanji and Chinese Han. There are even regional differences, with mainland China, Taiwan, Hong Kong, and Singapore all having slight regional differences for the same code points.