go-text / typesetting

High quality text shaping in pure Go.
Other
88 stars 11 forks source link

Implement font language coverage and querying #95

Closed whereswaldon closed 7 months ago

whereswaldon commented 11 months ago

This has been discussed previously (see below the hr)

I've discovered what I think is a compelling use case for including language info in footprints: determining the "primary" font of a piece of text. I'd like to set the default line height for a paragraph to the builtin line height of the primary font in the text. However, it's difficult to define "primary" font because you can't know a priori what font will be used. It's chosen based on aspect and the codepoints in the text.

You could consider using heuristics like the most-frequently-occurring font within the text, but you can also trivially create pathological cases that defeat such logic.

The best option I've been able to devise is this:

the "primary" font is the font that will be selected for the given query term when shaping the user's system language (in a UI context)

This should result in a stable choice of primary font that will work well with the rest of an application's UI.

Implementing this is tricky as there isn't a good mapping between system language.Language and language.Script (and as far as I understand, there cannot be). For that reason, I think the only way to query "which font will be used for this query when displaying the system language" is to expose a FontMap ResolveFace-like operation that acts on languages instead of runes. This, in turn, requires our footprints to carry supported languages so that they can be efficiently queried.

@benoitkugler Does this make sense and seem like a good approach? I'm happy to work on this feature, or split the work between us if you have a concrete idea of how to implement parts of this.


By the way, I think there is one change we may safely forecast : adding the set of languages supported by a font. It may be deduced from the runeSet, but, similarly to the scriptSet, is quite expensive to compute, so I guess we would rather store it on the footprint.

I should be able to post a PR soon if you think we should add it before the index format is widely used.

Originally posted by @whereswaldon in https://github.com/go-text/typesetting/issues/87#issuecomment-1629478341

benoitkugler commented 11 months ago

Interesting approach ! Let me start with the easy answer :

I'm happy to work on this feature, or split the work between us if you have a concrete idea of how to implement parts of this.

I think I have indeed a precise idea for the "footprint part" (inspired by fontconfig) :

A langSet would be represented by a bit set : type langSet [8]uint32 This would require to map language.Language to the internal byte code, but it would save quite some space on the index.

Having say that, I'm not sure I understand what exactly would be the ResolveForLang algorithm. Could you elaborate on that part?

whereswaldon commented 11 months ago

Interesting approach ! Let me start with the easy answer :

I'm happy to work on this feature, or split the work between us if you have a concrete idea of how to implement parts of this.

I think I have indeed a precise idea for the "footprint part" (inspired by fontconfig) :

* collect, for every usual languages, a representative sample (as a string)

* for performance reasons, choose a mapping between this languages to a byte (there is less than 256 languages)

* then, for each font, collect the runeSet, and filter the languages by keeping only the ones whose sample is included in the runeSet.

A langSet would be represented by a bit set : type langSet [8]uint32 This would require to map language.Language to the internal byte code, but it would save quite some space on the index.

Sounds good to me! Just curious, is this language metadata unreliable?

Having say that, I'm not sure I understand what exactly would be the ResolveForLang algorithm. Could you elaborate on that part?

My goal would be to identify the font matching the current query that would be used to display a given language. A simple implementation would perform the same steps as ResolveFace except would stop at the first face supporting the target language instead of testing for support of a particular rune.

Perhaps this is a bad idea, but I can't think of another way to identify the font that the user will expect to be primary within the text. If you know of other approaches, please share them. :D

benoitkugler commented 11 months ago

My goal would be to identify the font matching the current query that would be used to display a given language. A simple implementation would perform the same steps as ResolveFace except would stop at the first face supporting the target language instead of testing for support of a particular rune.

Perhaps this is a bad idea, but I can't think of another way to identify the font that the user will expect to be primary within the text. If you know of other approaches, please share them. :D

Thanks for the details, I get it now. We basically want to compute the intersection, over the runes used in a given language, of the fonts supporting theses runes, that makes sense!

benoitkugler commented 11 months ago

Just curious, is this language metadata unreliable?

This list would give you the languages, but we would also need a sample for each one. Besides, this table is used internally by harfbuzz to convert a regular language.Language to the internal opentype tag.

benoitkugler commented 11 months ago

I'm happy to work on this feature, or split the work between us if you have a concrete idea of how to implement parts of this.

I'm away from a computer for a week, so I would be happy to let you work on it.

benoitkugler commented 11 months ago

Some languages text samples databases :

https://kermitproject.org/utf8.html#glass https://gitlab.freedesktop.org/fontconfig/fontconfig/-/tree/main/fc-lang

I'm not sure what are the licenses, though..

whereswaldon commented 11 months ago

It appears that the fontconfig .orth files are available under this license:

# Copyright © 2002 Keith Packard
#
# Permission to use, copy, modify, distribute, and sell this software and its
# documentation for any purpose is hereby granted without fee, provided that
# the above copyright notice appear in all copies and that both that
# copyright notice and this permission notice appear in supporting
# documentation, and that the name of the author(s) not be used in
# advertising or publicity pertaining to distribution of the software without
# specific, written prior permission.  The authors make no
# representations about the suitability of this software for any purpose.  It
# is provided "as is" without express or implied warranty.
#
# THE AUTHOR(S) DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE,
# INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS, IN NO
# EVENT SHALL THE AUTHOR(S) BE LIABLE FOR ANY SPECIAL, INDIRECT OR
# CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE,
# DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER
# TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR
# PERFORMANCE OF THIS SOFTWARE.
#

I'm honestly unsure if us parsing their .orth library and building our own code out of it requires us to carry this license. The .orth files are purely data (they contain ranges of codepoints used by each language). It's hard for me to reason about how normal software licensing applies in this case, since we're not modifying "the software." To be defensive, we could carry this license on the single source code file we generated from the .orth library. It's not very restrictive.

benoitkugler commented 8 months ago

@whereswaldon I'll take a stab at generating the language samples from fontconfig .orth files, if that's OK for you.

whereswaldon commented 8 months ago

Absolutely! I don't think I could tackle it myself for quite a while.