[BUG] emoji-picker fallback replaces **all** emoji with fallback versions, in most cases

ferdnyc commented 7 months ago

Describe the bug When browsing in the emoji-picker window using a non-default emoji font, in most cases activating the "Fallback" checkbox will replace ALL emoji with the ones from the default emoji font, not only the missing glyphs.

To Reproduce Steps to reproduce the behavior:

Open "Emoji Picker", with Fallback unchecked
Select a non-default font (I used "Twemoji") which has some coverage gaps
Select a category, such as "food", that contains missing emoji glyphs
Click the "Fallback" checkbox to enable fallback

Expected behavior Only the missing glyphs not present in the Twemoji font will be filled in with glyphs from the default Noto Color Emoji font.

Screenshots or videos

'food' category in Twemoji with fallback off

'food' category in Twemoji with fallback enabled

'food' category in Noto Color Emoji with fallback off

emoji-picker version? emoji-picker-2.25.3-1.fc39.noarch from Fedora repo

ibus version? Not applicable, but ibus-1.5.29-1.fc39.x86_64 from Fedora repo

Distribution and version? Fedora 39

Desktop and version? GNOME Shell 45.5

Xorg or Wayland? Wayland

Additional context

For some reason, this doesn't happen in the "regional" category — and only in the "regional" category.

'regional' category in Twemoji with fallback off

'regional' category in Twemoji with fallback on

'regional' category in Noto Color Emoji (falback off)

ferdnyc commented 7 months ago

Interestingly, "Fallback" appears to function... well, ''differently'', if I select a non-emoji font, like Symbola, which does contain some Emoji characters. Then, activating Fallback does replace only the missing glyphs with their emoji presentations:

But that's actually kind of super weird, and I'm not sure it's how fallback is really supposed to work.

(Unicode TR51 defines ''fallback'' presentations for Emoji, but they're something different than font-fallback. One of the definitions they give for an emoji fallback presentation involves displaying a composed emoji as the individual emoji that make up the sequence, instead of the product of their composition.

For example, the :rainbow_flag: emoji is formed by composing the :white_flag: emoji and the :rainbow: emoji together using a Zero-Width Joiner. In implementations where :rainbow_flag: is unavailable, the fallback presentation would be to display :white_flag::rainbow:.)

mike-fabian commented 1 month ago

@ferdnyc

Test file to reproduce the problem only using pango:

pango-test.txt

$ cat ~/pango-test.txt
<span font="Twemoji 48" fallback="false">🤯😀🫨</span> Twemoji, fallback=false
<span font="Twemoji 48" fallback="true">🤯😀🫨</span> Twemoji, fallback=true
<span font="Symbola 48" fallback="false">😇︎🫩︎</span> Symbola, fallback=false
<span font="Symbola 48" fallback="true">😇︎🫩︎</span> Symbola, fallback=true
$

Running pango-view --markup ~/pango-test.txt gives the following result:

Screenshot

When fallback is true, even the emoji for which glyphs are are available in the Twemoji font are replaced by glyphs from "Noto Color Emoji". But it the main font is Symbola, then the fallback to "Noto Color Emoji" happens only for the glyphs which are really lacking in the Symbola font.

mike-fabian commented 1 month ago

@ferdnyc

This used to work just as you expected, a few years ago. I also agree that this would be the correct behaviour.

I had already noticed a while ago that it didn't work anymore as it used to but had no time to investigate and then forgot about it again.

I didn't change the code in emoji-picker at all so I suspect that something either in Pango or fontconfig has changed.

emoji-picker does the same as shown in the test file above:

<span font="fontname size" fallback="true">some emoji</span>

And this does not work anymore as it used to work.

mike-fabian commented 1 month ago

Now I need to find out whether this is because of a change in Pango or in fontconfig ...

ferdnyc commented 1 month ago

Indeed, it does sound like Pango is the culprit.

I wonder if this is somehow related to your old bug https://gitlab.gnome.org/GNOME/pango/-/issues/289, or the related https://gitlab.gnome.org/GNOME/pango/-/issues/298 — both of which are still open, though the first one is at least somewhat addressed since your original report was:

With pango 1.40.12 and fontconfig from git master, it is not possible to choose the font used for emoji from pango anymore.

And obviously that's no longer the case. (Except when fallback is activated. Sometimes.)

ferdnyc commented 1 month ago

Hmf. Interestingly, on my Fedora 40 system I get this, when running the test file through pango-view --markup:

Hmm... probably because activating Fallback in Emoji Picker with Symbola selected doesn't fill in that glyph, either. I guess my default emoji font is missing it.

But this one works:

pango-test2.txt

ferdnyc commented 1 month ago

Hmf. Interestingly, on my Fedora 40 system I get this, when running the test file through pango-view --markup:

Ah, Fedora 40 is still on Noto Color Emoji 20231130. I updated to the 20241008 version from Rawhide, and all is well:

mike-fabian commented 1 month ago

Indeed, it does sound like Pango is the culprit.

I wonder if this is somehow related to your old bug https://gitlab.gnome.org/GNOME/pango/-/issues/289, or the related https://gitlab.gnome.org/GNOME/pango/-/issues/298 — both of which are still open, though the first one is at least somewhat addressed since your original report was:

With pango 1.40.12 and fontconfig from git master, it is not possible to choose the font used for emoji from pango anymore.

And obviously that's no longer the case. (Except when fallback is activated. Sometimes.)

Both bugs don't seem to be resolved in a way which I would need for emoji picker. Especially the second one does not seem to have been addressed at all.

Of course I want the old behaviour back from the time when I first implemented the font selection and the fallback option: I want to show as many glyphs as possible with the selected fonts and use fallback only for the glyphs which the selected font lacks. That would give the most useful information to the user, one would see easily which emoji are available in a font and which other font(s) one could use for the missing emoji.

So I would like to have that behaviour back.

I wonder whether there is any way to force that behaviour with the current pango and fontconfig ...

mike-fabian commented 1 month ago

U+1FAE8 🫨 shaking face U+1F92F 🤯 shocked face with exploding head

$ fc-match "Twemoji:lang=und-zsye:charset=1fae8"
NotoColorEmoji.ttf: "Noto Color Emoji" "Regular"
$ fc-match "Twemoji:lang=und-zsye:charset=1f92f"
Twemoji.ttf: "Twemoji" "Regular"

So fontconfig’s fc-match does give us TWemoji when we request TWemoji and a codepoint for which TWemoji has a glyph. And falls back to Noto Color Emoji only when Twemoji lacks a glyph for the code point.

And then Pango overrides that result?

If I request the generic emoji family or no family at all, I always get Noto Color Emoji:

$ fc-match "emoji:lang=und-zsye:charset=1f92f"
NotoColorEmoji.ttf: "Noto Color Emoji" "Regular"
$ fc-match "emoji:lang=und-zsye:charset=1fae8"
NotoColorEmoji.ttf: "Noto Color Emoji" "Regular"
$ fc-match ":lang=und-zsye:charset=1fae8"
NotoColorEmoji.ttf: "Noto Color Emoji" "Regular"

mike-fabian commented 1 month ago

@ferdnyc I think theoretically I could implement a workaround as follows (only for emoji which are single code points!):

For each emoji to be displayed
   if the requested font has the emoji
       use <span font="font size" fallback="false">emoji</span>
       i.e. use fallback="false" always, no matter what the checkbox says.
   else
       use fallback as chosen by the checkbox

Whether a font has an emoji could be checked with fontconfig:

$ fc-list "Twemoji:lang=und-zsye:charset=1fae8"
$

No result, that means Twemoji does not have U+1FAE8.

$ fc-list "Twemoji:lang=und-zsye:charset=1f92f"
/usr/share/fonts/twemoji/Twemoji.ttf: Twemoji:style=Regular
$

Here we have a result, that means Twemoji does have U+1F92F.

This way I could avoid getting a fallback when is not necessary because the requested font does have that glyph.

But there are several problems with that idea:

I would probably need to write my own Python interface to fontconfig (it looks like https://pypi.org/project/Python-fontconfig/ does not do what I would need)
it might cause a significant slowdown emoji-picker when displaying a page like the “people” category which has 606 emoji at the moment if I need to do this extra check for every emoji
It still doesn't solve the problem for emoji which are not single code points but sequences, with fontconfig I can only check whether a font has a glyph for a codepoint

For example consider this emoji sequence:

🙂‍↕️ U+1F642 U+200D U+2195 U+FE0F “head shaking vertically”

Checking with fontconfig

$ fc-list "Twemoji:charset=1f642"
/usr/share/fonts/twemoji/Twemoji.ttf: Twemoji:style=Regular
$ fc-list "Twemoji:charset=200d"
/usr/share/fonts/twemoji/Twemoji.ttf: Twemoji:style=Regular
$ fc-list "Twemoji:charset=2195"
/usr/share/fonts/twemoji/Twemoji.ttf: Twemoji:style=Regular
$ fc-list "Twemoji:charset=fe0f"
$

So Twemoji has all the code points of the emoji (I think I could ignore whether a font has U+200D ZERO WIDTH JOINER or U+FE0F VARIATION SELECTOR-16, a font does not need to have glyphs for unprintable characters like these).

But even if I know that Twemoji does have glyphs for all emoji such a sequence is composed of, I still don’t know whether the font has a glyph for the whole sequence. fontconfig cannot answer this.

So I still don’t know what I could do here, I have no good idea for a workaround.

mike-fabian commented 1 month ago

Screenshot

ferdnyc commented 1 month ago

@mike-fabian

Whether a font has an emoji could be checked with fontconfig:
$ fc-list "Twemoji:lang=und-zsye:charset=1fae8"
$
No result, that means Twemoji does not have U+1FAE8.
$ fc-list "Twemoji:lang=und-zsye:charset=1f92f"
/usr/share/fonts/twemoji/Twemoji.ttf: Twemoji:style=Regular
$ 
Here we have a result, that means Twemoji does have U+1F92F.

This way I could avoid getting a fallback when is not necessary because the requested font does have that glyph.

But there are several problems with that idea:

I would probably need to write my own Python interface to fontconfig (it looks like https://pypi.org/project/Python-fontconfig/ does not do what I would need)

it might cause a significant slowdown emoji-picker when displaying a page like the “people” category which has 606 emoji at the moment if I need to do this extra check for every emoji

Well, I do have a workaround for part of that issue.

The same way fontconfig can be used to match fonts based on parameters like charset, it can also be used to query a font's available charsets.

If you use fc-list to get the path to a given font, you can fc-query that file to extract its data. And with a custom format (courtesy of the FcPatternFormat(3) syntax), properties can be expanded.

The list of all glyphs present in Twemoji, for example, is:

$ fc-query $(fc-list -f '%{file}' Twemoji) -f '%{[]charset{%{charset}}}' 
20 23 2a 30-39 a9 ae 200d 203c 2049 20e3 2122 2139 2194-2199 21a9-21aa 231a-231b 2328 23cf 23e9-23f3 23f8-23fa 24c2 25aa-25ab 25b6 25c0 25fb-25fe 2600-2604 260e 2611 2614-2615 2618 261d 2620 2622-2623 2626 262a 262e-262f 2638-263a 2640 2642 2648-2653 265f-2660 2663 2665-2666 2668 267b 267e-267f 2692-2697 2699 269b-269c 26a0-26a1 26a7 26aa-26ab 26b0-26b1 26bd-26be 26c4-26c5 26c8 26ce-26cf 26d1 26d3-26d4 26e9-26ea 26f0-26f5 26f7-26fa 26fd 2702 2705 2708-270d 270f 2712 2714 2716 271d 2721 2728 2733-2734 2744 2747 274c 274e 2753-2755 2757 2763-2764 2795-2797 27a1 27b0 27bf 2934-2935 2b05-2b07 2b1b-2b1c 2b50 2b55 3030 303d 3297 3299 e50a 1f004 1f0cf 1f170-1f171 1f17e-1f17f 1f18e 1f191-1f19a 1f1e6-1f1ff 1f201-1f202 1f21a 1f22f 1f232-1f23a 1f250-1f251 1f300-1f321 1f324-1f393 1f396-1f397 1f399-1f39b 1f39e-1f3f0 1f3f3-1f3f5 1f3f7-1f4fd 1f4ff-1f53d 1f549-1f54e 1f550-1f567 1f56f-1f570 1f573-1f57a 1f587 1f58a-1f58d 1f590 1f595-1f596 1f5a4-1f5a5 1f5a8 1f5b1-1f5b2 1f5bc 1f5c2-1f5c4 1f5d1-1f5d3 1f5dc-1f5de 1f5e1 1f5e3 1f5e8 1f5ef 1f5f3 1f5fa-1f64f 1f680-1f6c5 1f6cb-1f6d2 1f6d5-1f6d7 1f6dd-1f6e5 1f6e9 1f6eb-1f6ec 1f6f0 1f6f3-1f6fc 1f7e0-1f7eb 1f7f0 1f90c-1f93a 1f93c-1f945 1f947-1f9ff 1fa70-1fa74 1fa78-1fa7c 1fa80-1fa86 1fa90-1faac 1fab0-1faba 1fac0-1fac5 1fad0-1fad9 1fae0-1fae7 1faf0-1faf6 e0030-e0039 e0061-e007a e007f fe4e5-fe4ee fe82c fe82e-fe837

(Presented in a compact format that still forces you to parse out and expand ranges, but at least it's only a SINGLE call that will — with sufficient processing and expansion — net you the "needs fallback" state of every [single-codepoint] emoji in one fell swoop, rather than having to make 600+ separate calls.)

Still doesn't even begin to address the ZWJ-combined emoji issue; it feels like that information has to be stored SOMEWHERE in the font data, but I'm at a loss for where/how it would even be stored, never mind queried.

(For that matter, how are the glyphs for those combined emoji stored and accessed, when the appropriate sequence of code points has been encountered in a string and needs to be rendered?)

ferdnyc commented 1 month ago

Oh, actually you don't even need that complex expansion formatting — turns out it doesn't do anything.

This:

$ fc-query $(fc-list -f '%{file}' Twemoji) -f '%{[]charset{%{charset}}}' 
20 23 2a 30-39 a9 ae 200d 203c 2049 20e3 2122 2139 2194-2199 21a9-21aa 231a-231b 2328 23cf 23e9-23f3 23f8-23fa 24c2 25aa-25ab 25b6 25c0 25fb-25fe 2600-2604 260e 2611 2614-2615 2618 261d 2620 2622-2623 2626 262a 262e-262f 2638-263a 2640 2642 2648-2653 265f-2660 2663 2665-2666 2668 267b 267e-267f 2692-2697 2699 269b-269c 26a0-26a1 26a7 26aa-26ab 26b0-26b1 26bd-26be 26c4-26c5 26c8 26ce-26cf 26d1 26d3-26d4 26e9-26ea 26f0-26f5 26f7-26fa 26fd 2702 2705 2708-270d 270f 2712 2714 2716 271d 2721 2728 2733-2734 2744 2747 274c 274e 2753-2755 2757 2763-2764 2795-2797 27a1 27b0 27bf 2934-2935 2b05-2b07 2b1b-2b1c 2b50 2b55 3030 303d 3297 3299 e50a 1f004 1f0cf 1f170-1f171 1f17e-1f17f 1f18e 1f191-1f19a 1f1e6-1f1ff 1f201-1f202 1f21a 1f22f 1f232-1f23a 1f250-1f251 1f300-1f321 1f324-1f393 1f396-1f397 1f399-1f39b 1f39e-1f3f0 1f3f3-1f3f5 1f3f7-1f4fd 1f4ff-1f53d 1f549-1f54e 1f550-1f567 1f56f-1f570 1f573-1f57a 1f587 1f58a-1f58d 1f590 1f595-1f596 1f5a4-1f5a5 1f5a8 1f5b1-1f5b2 1f5bc 1f5c2-1f5c4 1f5d1-1f5d3 1f5dc-1f5de 1f5e1 1f5e3 1f5e8 1f5ef 1f5f3 1f5fa-1f64f 1f680-1f6c5 1f6cb-1f6d2 1f6d5-1f6d7 1f6dd-1f6e5 1f6e9 1f6eb-1f6ec 1f6f0 1f6f3-1f6fc 1f7e0-1f7eb 1f7f0 1f90c-1f93a 1f93c-1f945 1f947-1f9ff 1fa70-1fa74 1fa78-1fa7c 1fa80-1fa86 1fa90-1faac 1fab0-1faba 1fac0-1fac5 1fad0-1fad9 1fae0-1fae7 1faf0-1faf6 e0030-e0039 e0061-e007a e007f fe4e5-fe4ee fe82c fe82e-fe837

is actually identical to this:

$ fc-query $(fc-list -f '%{file}' Twemoji) -f '%{charset}' |fmt
20 23 2a 30-39 a9 ae 200d 203c 2049 20e3 2122 2139 2194-2199 21a9-21aa
231a-231b 2328 23cf 23e9-23f3 23f8-23fa 24c2 25aa-25ab 25b6 25c0 25fb-25fe
2600-2604 260e 2611 2614-2615 2618 261d 2620 2622-2623 2626 262a 262e-262f
2638-263a 2640 2642 2648-2653 265f-2660 2663 2665-2666 2668 267b 267e-267f
2692-2697 2699 269b-269c 26a0-26a1 26a7 26aa-26ab 26b0-26b1 26bd-26be
26c4-26c5 26c8 26ce-26cf 26d1 26d3-26d4 26e9-26ea 26f0-26f5 26f7-26fa
26fd 2702 2705 2708-270d 270f 2712 2714 2716 271d 2721 2728 2733-2734
2744 2747 274c 274e 2753-2755 2757 2763-2764 2795-2797 27a1 27b0 27bf
2934-2935 2b05-2b07 2b1b-2b1c 2b50 2b55 3030 303d 3297 3299 e50a 1f004
1f0cf 1f170-1f171 1f17e-1f17f 1f18e 1f191-1f19a 1f1e6-1f1ff 1f201-1f202
1f21a 1f22f 1f232-1f23a 1f250-1f251 1f300-1f321 1f324-1f393 1f396-1f397
1f399-1f39b 1f39e-1f3f0 1f3f3-1f3f5 1f3f7-1f4fd 1f4ff-1f53d 1f549-1f54e
1f550-1f567 1f56f-1f570 1f573-1f57a 1f587 1f58a-1f58d 1f590 1f595-1f596
1f5a4-1f5a5 1f5a8 1f5b1-1f5b2 1f5bc 1f5c2-1f5c4 1f5d1-1f5d3 1f5dc-1f5de
1f5e1 1f5e3 1f5e8 1f5ef 1f5f3 1f5fa-1f64f 1f680-1f6c5 1f6cb-1f6d2
1f6d5-1f6d7 1f6dd-1f6e5 1f6e9 1f6eb-1f6ec 1f6f0 1f6f3-1f6fc 1f7e0-1f7eb
1f7f0 1f90c-1f93a 1f93c-1f945 1f947-1f9ff 1fa70-1fa74 1fa78-1fa7c
1fa80-1fa86 1fa90-1faac 1fab0-1faba 1fac0-1fac5 1fad0-1fad9 1fae0-1fae7
1faf0-1faf6 e0030-e0039 e0061-e007a e007f fe4e5-fe4ee fe82c fe82e-fe837

(With wrapping added, this time, to keep things readable.)

mike-fabian commented 1 month ago

$ fc-query $(fc-list -f '%{file}' Twemoji) -f '%{charset}' |fmt

That is a good idea, thank you very much!

But I still wonder whether implementing a workaround which only works for single code point emoji makes sense.

Should I do that? Maybe it is better than nothing.

Of course I would very much prefer to fix this for the emoji sequences as well but I have no idea how I could do that at the moment.

mike-fabian commented 1 month ago

Still doesn't even begin to address the ZWJ-combined emoji issue; it feels like that information has to be stored SOMEWHERE in the font data, but I'm at a loss for where/how it would even be stored, never mind queried.

(For that matter, how are the glyphs for those combined emoji stored and accessed, when the appropriate sequence of code points has been encountered in a string and needs to be rendered?)

I don’t know how exactly that works either at the moment.

mike-fabian commented 1 month ago

I made some limited progress using this:

from typing import List
from typing import Tuple
from typing import Dict
from typing import Any
import sys
from gi import require_version # type: ignore
require_version('Gtk', '3.0')
from gi.repository import Gtk # type: ignore
require_version('Pango', '1.0')
from gi.repository import Pango

def get_fonts_used_for_text(
        font: str, text: str, fallback: bool = True) -> List[Tuple[str, Dict[str, Any]]]:
    '''Return a list of fonts which were really used to render a text

    :param font: The font requested to render the text in
    :param text: The text to render
    :param fallback: Whether to enable font fallback. If disabled, then
                     glyphs will only be used from the closest matching
                     font on the system. No fallback will be done to other
                     fonts on the system that might contain the glyphs needed
                     for the text.

    Examples:

    (Don’t run CI checks regularly on these examples, it depends too much
    on the fonts installed on the system  used to do the test}

    >>> get_fonts_used_for_text('DejaVu Sans Mono', '😀 ')
    [('😀', {'font': 'Noto Color Emoji', 'glyphcount': 1}), (' ', {'font': 'DejaVu Sans Mono', 'glyphcount': 1})]

    >>> get_fonts_used_for_text('DejaVu Sans', '日本語 नमस्ते')
    [('日本語 ', {'font': 'Droid Sans Fallback', 'glyphcount': 4}), ('नमस्ते', {'font': 'FreeSans', 'glyphcount': 5})]

    >>> get_fonts_used_for_text('DejaVu Sans', '日本語 🕉️')
    [('日本語 ', {'font': 'Droid Sans Fallback', 'glyphcount': 4}), ('🕉️', {'font': 'Noto Color Emoji', 'glyphcount': 1})]

    >>> get_fonts_used_for_text('Twemoji', '🙂‍↕️')
    [('🙂\u200d↕️', {'font': 'Noto Color Emoji', 'glyphcount': 1})]

    >>> get_fonts_used_for_text('Twemoji', '🙂‍↕️', fallback=False)
    [('🙂\u200d↕️', {'font': 'Twemoji', 'glyphcount': 3})]

    “Twemoji” has no glyph for the flag of Sark (added in Unicode 16.0) but “Noto Color Emoji” has it.
    Even though “Twemoji” has no glyph for the flag of Sark, Pango renders the sequence of two
    code points (U+1F1E8 U+1F1F) as one glyph when “Twemoji” is specified and fallback is not allowed
    (Visually the glyph shown appears empty, there is no “Tofu”):

    >>> get_fonts_used_for_text('Twemoji', '🇨🇶', fallback=False)
    [('🇨🇶', {'font': 'Twemoji', 'glyphcount': 1})]

    >>> get_fonts_used_for_text('Twemoji', '🇨🇶')
    [('🇨🇶', {'font': 'Noto Color Emoji', 'glyphcount': 1})]

    “Twemoji” does not have the glyph for this single code point emoji but “Noto Color Emoji” has it
    (visually the glyph shown when Twemoji is used is a “Tofu” block with the code point inside):

    >>> get_fonts_used_for_text('Twemoji', '🫩', fallback=False)
    [('\U0001fae9', {'font': 'Twemoji', 'glyphcount': 1})]

    >>> get_fonts_used_for_text('Twemoji', '🫩', fallback=True)
    [('\U0001fae9', {'font': 'Noto Color Emoji', 'glyphcount': 1})]

    Both “Twemoji” and “Noto Color Emoji” have the glyph for this single code point emoji,
    both render it well when inspected visually:

    >>> get_fonts_used_for_text('Twemoji', '🤥', fallback=False)
    [('🤥', {'font': 'Twemoji', 'glyphcount': 1})]

    >>> get_fonts_used_for_text('Twemoji', '🤥', fallback=True)
    [('🤥', {'font': 'Noto Color Emoji', 'glyphcount': 1})]
    '''
    fonts_used = []
    text_utf8 = text.encode('UTF-8', errors='replace')
    label = Gtk.Label()
    pango_context = label.get_pango_context()
    pango_layout = Pango.Layout(pango_context)
    pango_font_description = Pango.font_description_from_string(font)
    pango_layout.set_font_description(pango_font_description)
    pango_attr_list = Pango.AttrList()
    pango_attr_fallback = Pango.attr_fallback_new(fallback)
    pango_attr_list.insert(pango_attr_fallback)
    pango_layout.set_attributes(pango_attr_list)
    pango_layout.set_text(text)
    pango_layout_line = pango_layout.get_line_readonly(0)
    gs_list = pango_layout_line.runs
    number_of_runs = len(gs_list)
    for glyph_item in gs_list:
        pango_item = glyph_item.item
        offset = pango_item.offset
        length = pango_item.length
        _num_chars = pango_item.num_chars
        pango_glyph_string = glyph_item.glyphs
        num_glyphs = pango_glyph_string.num_glyphs
        pango_analysis = pango_item.analysis
        pango_font = pango_analysis.font
        font_description_used = pango_font.describe()
        run_text = text_utf8[offset:offset + length].decode('UTF-8', errors='replace')
        run_family = font_description_used.get_family()
        fonts_used.append((run_text, {'font': run_family, 'glyphcount': num_glyphs}))
    return fonts_used

def _init() -> None:
    '''Initialization'''
    return

def _del() -> None:
    '''Cleanup'''
    return

class __ModuleInitializer: # pylint: disable=too-few-public-methods,invalid-name
    def __init__(self) -> None:
        _init()

    def __del__(self) -> None:
        return

if __name__ == "__main__":
    import doctest
    (FAILED, _ATTEMPTED) = doctest.testmod()
    sys.exit(FAILED)

mike-fabian commented 1 month ago

As you can see in the comments I can now detect how many glyphs were used to render an emoji, i.e. I can detect for some sequences that they are not supported by a font if they render with more than one glyph:

    >>> get_fonts_used_for_text('Twemoji', '🙂‍↕️')
    [('🙂\u200d↕️', {'font': 'Noto Color Emoji', 'glyphcount': 1})]

    >>> get_fonts_used_for_text('Twemoji', '🙂‍↕️', fallback=False)
    [('🙂\u200d↕️', {'font': 'Twemoji', 'glyphcount': 3})]

But there are sequences where this “trick” doesn’t work:

    “Twemoji” has no glyph for the flag of Sark (added in Unicode 16.0) but “Noto Color Emoji” has it.
    Even though “Twemoji” has no glyph for the flag of Sark, Pango renders the sequence of two
    code points (U+1F1E8 U+1F1F) as one glyph when “Twemoji” is specified and fallback is not allowed
    (Visually the glyph shown appears empty, there is no “Tofu”):

    >>> get_fonts_used_for_text('Twemoji', '🇨🇶', fallback=False)
    [('🇨🇶', {'font': 'Twemoji', 'glyphcount': 1})]

    >>> get_fonts_used_for_text('Twemoji', '🇨🇶')
    [('🇨🇶', {'font': 'Noto Color Emoji', 'glyphcount': 1})]

So now I wonder how I can detect whether a glyphs is empty.

mike-fabian commented 1 month ago

Also, in case of a single code point, when fallback is not allowed, and a font which does not have that glyph is used, Pango still renders it using one glyph, but that glyph is a “Tofu” replacement glyph (a box with the code point inside):

    “Twemoji” does not have the glyph for this single code point emoji but “Noto Color Emoji” has it
    (visually the glyph shown when Twemoji is used is a “Tofu” block with the code point inside):

    >>> get_fonts_used_for_text('Twemoji', '🫩', fallback=False)
    [('\U0001fae9', {'font': 'Twemoji', 'glyphcount': 1})]

    >>> get_fonts_used_for_text('Twemoji', '🫩', fallback=True)
    [('\U0001fae9', {'font': 'Noto Color Emoji', 'glyphcount': 1})]

So I need to detect empty and Tofu glyphs somehow ...

ferdnyc commented 4 weeks ago

@mike-fabian

One option to make the doctests universal/reproducible, at the cost of (admittedly) a good deal of detail, would be to only return a boolean value indicating whether the font used was the one requested, rather than its exact identity.

On my system, I hit some failures with the original code — as your own comments indicated would likely be the case — when the fallback font chosen didn't match what was selected on your system. But this version passes with flying colors, by eliminating the dependence on exact font identities:

#!/bin/env python3

from typing import List
from typing import Tuple
from typing import Dict
from typing import Any
import sys
from gi import require_version # type: ignore
require_version('Gtk', '3.0')
from gi.repository import Gtk # type: ignore
require_version('Pango', '1.0')
from gi.repository import Pango

def get_fonts_used_for_text(
        font: str, text: str, fallback: bool = True) -> List[Tuple[str, Dict[str, Any]]]:
    '''Return a list of fonts which were really used to render a text

    :param font: The font requested to render the text in
    :param text: The text to render
    :param fallback: Whether to enable font fallback. If disabled, then
                     glyphs will only be used from the closest matching
                     font on the system. No fallback will be done to other
                     fonts on the system that might contain the glyphs needed
                     for the text.

    Examples:

    (Don’t run CI checks regularly on these examples, it depends too much
    on the fonts installed on the system  used to do the test}

    >>> get_fonts_used_for_text('DejaVu Sans Mono', '😀 ')
    [('😀', {'requested': False, 'glyphcount': 1}), (' ', {'requested': True, 'glyphcount': 1})]

    >>> get_fonts_used_for_text('DejaVu Sans', '日本語 नमस्ते')
    [('日本語 ', {'requested': False, 'glyphcount': 4}), ('नमस्ते', {'requested': False, 'glyphcount': 5})]

    >>> get_fonts_used_for_text('DejaVu Sans', '日本語 🕉️')
    [('日本語 ', {'requested': False, 'glyphcount': 4}), ('🕉️', {'requested': False, 'glyphcount': 1})]

    >>> get_fonts_used_for_text('Twemoji', '🙂‍↕️')
    [('🙂\u200d↕️', {'requested': False, 'glyphcount': 1})]

    >>> get_fonts_used_for_text('Twemoji', '🙂‍↕️', fallback=False)
    [('🙂\u200d↕️', {'requested': True, 'glyphcount': 3})]

    “Twemoji” has no glyph for the flag of Sark (added in Unicode 16.0) but “Noto Color Emoji” has it.
    Even though “Twemoji” has no glyph for the flag of Sark, Pango renders the sequence of two
    code points (U+1F1E8 U+1F1F) as one glyph when “Twemoji” is specified and fallback is not allowed
    (Visually the glyph shown appears empty, there is no “Tofu”):

    >>> get_fonts_used_for_text('Twemoji', '🇨🇶', fallback=False)
    [('🇨🇶', {'requested': True, 'glyphcount': 1})]

    >>> get_fonts_used_for_text('Twemoji', '🇨🇶')
    [('🇨🇶', {'requested': False, 'glyphcount': 1})]

    “Twemoji” does not have the glyph for this single code point emoji but “Noto Color Emoji” has it
    (visually the glyph shown when Twemoji is used is a “Tofu” block with the code point inside):

    >>> get_fonts_used_for_text('Twemoji', '🫩', fallback=False)
    [('\U0001fae9', {'requested': True, 'glyphcount': 1})]

    >>> get_fonts_used_for_text('Twemoji', '🫩', fallback=True)
    [('\U0001fae9', {'requested': False, 'glyphcount': 1})]

    Both “Twemoji” and “Noto Color Emoji” have the glyph for this single code point emoji,
    both render it well when inspected visually:

    >>> get_fonts_used_for_text('Twemoji', '🤥', fallback=False)
    [('🤥', {'requested': True, 'glyphcount': 1})]

    >>> get_fonts_used_for_text('Twemoji', '🤥', fallback=True)
    [('🤥', {'requested': False, 'glyphcount': 1})]
    '''
    fonts_used = []
    text_utf8 = text.encode('UTF-8', errors='replace')
    label = Gtk.Label()
    pango_context = label.get_pango_context()
    pango_layout = Pango.Layout(pango_context)
    pango_font_description = Pango.font_description_from_string(font)
    pango_layout.set_font_description(pango_font_description)
    pango_attr_list = Pango.AttrList()
    pango_attr_fallback = Pango.attr_fallback_new(fallback)
    pango_attr_list.insert(pango_attr_fallback)
    pango_layout.set_attributes(pango_attr_list)
    pango_layout.set_text(text)
    pango_layout_line = pango_layout.get_line_readonly(0)
    gs_list = pango_layout_line.runs
    number_of_runs = len(gs_list)
    for glyph_item in gs_list:
        pango_item = glyph_item.item
        offset = pango_item.offset
        length = pango_item.length
        _num_chars = pango_item.num_chars
        pango_glyph_string = glyph_item.glyphs
        num_glyphs = pango_glyph_string.num_glyphs
        pango_analysis = pango_item.analysis
        pango_font = pango_analysis.font
        font_description_used = pango_font.describe()
        run_text = text_utf8[offset:offset + length].decode('UTF-8', errors='replace')
        run_family = font_description_used.get_family()
        fonts_used.append((run_text, {'requested': run_family == font , 'glyphcount': num_glyphs}))
    return fonts_used

def _init() -> None:
    '''Initialization'''
    return

def _del() -> None:
    '''Cleanup'''
    return

class __ModuleInitializer: # pylint: disable=too-few-public-methods,invalid-name
    def __init__(self) -> None:
        _init()

    def __del__(self) -> None:
        return

if __name__ == "__main__":
    import doctest
    (FAILED, _ATTEMPTED) = doctest.testmod()
    sys.exit(FAILED)

(TBH I... just can't decide whether that change makes the tests less useful, or if it doesn't actually matter.)

mike-fabian commented 4 weeks ago

@mike-fabian

One option to make the doctests universal/reproducible, at the cost of (admittedly) a good deal of detail, would be to only return a boolean value indicating whether the font used was the one requested, rather than its exact identity.

On my system, I hit some failures with the original code — as your own comments indicated would likely be the case — when the fallback font chosen didn't match what was selected on your system. But this version passes with flying colors, by eliminating the dependence on exact font identities: [...]

(TBH I... just can't decide whether that change makes the tests less useful, or if it doesn't actually matter.)

I am not sure, it still might fail depending on which fonts exactly are installed on the system where the test was done. A different version of Twemoji might be installed for example. I have enormous amounts of fonts installed on my personal system and therefore results for such font tests on my machine are typically already different then on a default installation of the same Fedora version I am using. Running this successfully on every distribution out there is probably hopeless.

Also I want the function to return which font name was really used to display that in the context menu in emoji-picker.

As in this screenshot where the requested font is "Symbola" but the font acctually used for U+1FAE2 face with open eyes and hand over mouth is TH-Tshyn-P1. And I want to know that...

Screenshot

mike-fabian commented 4 weeks ago

In the meantime I have improved my code to detect whether a font seems to render an emoji sequence but the result is empty (like the flag of Sark in the Twemoji font), and whether a glyph for a single code point emoji is unavailable.

My new code is this:

from typing import List
from typing import Tuple
from typing import Dict
from typing import Any
import sys
from gi import require_version # type: ignore
require_version('Gtk', '3.0')
from gi.repository import Gtk # type: ignore
require_version('Pango', '1.0')
from gi.repository import Pango

def get_fonts_used_for_text(
        font: str, text: str, fallback: bool = True) -> List[Tuple[str, Dict[str, Any]]]:
    '''Return a list of fonts which were really used to render a text

    :param font: The font requested to render the text in
    :param text: The text to render
    :param fallback: Whether to enable font fallback. If disabled, then
                     glyphs will only be used from the closest matching
                     font on the system. No fallback will be done to other
                     fonts on the system that might contain the glyphs needed
                     for the text.

    Examples:

    (Don’t run CI checks regularly on these examples, it depends too much
    on the fonts installed on the system used to do the test}

    >>> get_fonts_used_for_text('DejaVu Sans Mono', '😀 ')
    [('😀', {'font': 'Noto Color Emoji', 'glyph-count': 1, 'visible': True, 'glyph-available': True}), (' ', {'font': 'DejaVu Sans Mono', 'glyph-count': 1, 'visible': False, 'glyph-available': True})]

    >>> get_fonts_used_for_text('DejaVu Sans', '日本語 नमस्ते')
    [('日本語 ', {'font': 'Droid Sans Fallback', 'glyph-count': 4, 'visible': True}), ('नमस्ते', {'font': 'FreeSans', 'glyph-count': 5, 'visible': True})]

    >>> get_fonts_used_for_text('DejaVu Sans', '日本語 🕉️')
    [('日本語 ', {'font': 'Droid Sans Fallback', 'glyph-count': 4, 'visible': True}), ('🕉', {'font': 'Noto Color Emoji', 'glyph-count': 1, 'visible': True, 'glyph-available': True})]

    >>> get_fonts_used_for_text('DejaVu Sans', '🕉\uFE0F')
    [('🕉', {'font': 'Noto Color Emoji', 'glyph-count': 1, 'visible': True, 'glyph-available': True})]

    >>> get_fonts_used_for_text('DejaVu Sans', '')
    []

    >>> get_fonts_used_for_text('DejaVu Sans', '\\n')
    []

    >>> get_fonts_used_for_text('DejaVu Sans', '\u0008') # BACKSPACE
    [('\\x08', {'font': 'DejaVu Sans', 'glyph-count': 1, 'visible': True, 'glyph-available': False})]

    >>> get_fonts_used_for_text('DejaVu Sans', '\u001b') # ESCAPE
    [('\\x1b', {'font': 'DejaVu Sans', 'glyph-count': 1, 'visible': True, 'glyph-available': False})]

    >>> get_fonts_used_for_text('DejaVu Sans', ' ')
    [(' ', {'font': 'DejaVu Sans', 'glyph-count': 1, 'visible': False, 'glyph-available': True})]

    >>> get_fonts_used_for_text('', 'a')
    [('a', {'font': 'DejaVu Sans', 'glyph-count': 1, 'visible': True, 'glyph-available': True})]

    >>> get_fonts_used_for_text('Twemoji', '🙂‍↕️')
    [('🙂\u200d↕️', {'font': 'Noto Color Emoji', 'glyph-count': 1, 'visible': True})]

    >>> get_fonts_used_for_text('Twemoji', '🙂‍↕️', fallback=False)
    [('🙂\u200d↕️', {'font': 'Twemoji', 'glyph-count': 3, 'visible': True})]

    “Twemoji” has no glyph for the flag of Sark (added in Unicode 16.0) but “Noto Color Emoji” has it.
    Even though “Twemoji” has no glyph for the flag of Sark, Pango renders the sequence of two
    code points (U+1F1E8 U+1F1F) as one glyph when “Twemoji” is specified and fallback is not allowed
    (Visually the glyph shown appears empty, there is no “Tofu”):

    >>> get_fonts_used_for_text('Twemoji', '🇨🇶', fallback=False)
    [('🇨🇶', {'font': 'Twemoji', 'glyph-count': 1, 'visible': False})]

    >>> get_fonts_used_for_text('Twemoji', '🇨🇶')
    [('🇨🇶', {'font': 'Noto Color Emoji', 'glyph-count': 1, 'visible': True})]

    >>> get_fonts_used_for_text('Twemoji', '🏴󠁧󠁢󠁷󠁬󠁳󠁿', fallback=False)
    [('🏴\U000e0067\U000e0062\U000e0077\U000e006c\U000e0073\U000e007f', {'font': 'Twemoji', 'glyph-count': 1, 'visible': True})]

    >>> get_fonts_used_for_text('Twemoji', '🏴󠁧󠁢󠁷󠁬󠁳󠁿')
    [('🏴\U000e0067\U000e0062\U000e0077\U000e006c\U000e0073\U000e007f', {'font': 'Noto Color Emoji', 'glyph-count': 1, 'visible': True})]

    “Twemoji” does not have the glyph for this single code point emoji but “Noto Color Emoji” has it
    (visually the glyph shown when Twemoji is used is a “Tofu” block with the code point inside):

    >>> get_fonts_used_for_text('Twemoji', '🫩', fallback=False)
    [('\U0001fae9', {'font': 'Twemoji', 'glyph-count': 1, 'visible': True, 'glyph-available': False})]

    >>> get_fonts_used_for_text('Twemoji', '🫩', fallback=True)
    [('\U0001fae9', {'font': 'Noto Color Emoji', 'glyph-count': 1, 'visible': True, 'glyph-available': True})]

    Both “Twemoji” and “Noto Color Emoji” have the glyph for this single code point emoji,
    both render it well when inspected visually:

    >>> get_fonts_used_for_text('Twemoji', '🤥', fallback=False)
    [('🤥', {'font': 'Twemoji', 'glyph-count': 1, 'visible': True, 'glyph-available': True})]

    >>> get_fonts_used_for_text('Twemoji', '🤥', fallback=True)
    [('🤥', {'font': 'Noto Color Emoji', 'glyph-count': 1, 'visible': True, 'glyph-available': True})]
    '''
    fonts_used = []
    text_utf8 = text.encode('UTF-8', errors='replace')
    label = Gtk.Label()
    pango_context = label.get_pango_context()
    pango_layout = Pango.Layout(pango_context)
    pango_font_description = Pango.font_description_from_string(font)
    pango_layout.set_font_description(pango_font_description)
    pango_attr_list = Pango.AttrList()
    pango_attr_fallback = Pango.attr_fallback_new(fallback)
    pango_attr_list.insert(pango_attr_fallback)
    pango_layout.set_attributes(pango_attr_list)
    pango_layout.set_text(text)
    pango_layout_line = pango_layout.get_line_readonly(0)
    gs_list = pango_layout_line.runs
    _number_of_runs = len(gs_list)
    for glyph_item in gs_list:
        pango_item = glyph_item.item
        offset = pango_item.offset
        length = pango_item.length
        _num_chars = pango_item.num_chars
        pango_glyph_string = glyph_item.glyphs
        num_glyphs = pango_glyph_string.num_glyphs
        pango_analysis = pango_item.analysis
        pango_font = pango_analysis.font
        font_description_used = pango_font.describe()
        run_text = text_utf8[offset:offset + length].decode(
            'UTF-8', errors='replace')
        run_family = font_description_used.get_family()
        pango_layout_run = Pango.Layout(pango_context)
        pango_layout_run.set_font_description(pango_font_description)
        pango_layout_run.set_attributes(pango_attr_list)
        pango_layout_run.set_text(run_text)
        pango_layout_run_line = pango_layout_run.get_line_readonly(0)
        visible = False
        ink_rect, logical_rect = pango_layout_run_line.get_pixel_extents()
        if ink_rect.width > 0 and ink_rect.height > 0:
            visible = True
        results_for_run = {
            'font': run_family,
            'glyph-count': num_glyphs,
            'visible': visible}
        # If it is only one character followed by a variation
        # selector, remove the variation selector before checking
        # whether the Pango font has that character:
        if len(run_text) == 2 and run_text[1] in ('\uFE0F', '︎\uFE0E'):
            run_text = run_text[0]
        if (num_glyphs == 1
            and len(run_text) == 1
            and hasattr(Pango.Font, 'has_char')):
            results_for_run['glyph-available'] = pango_font.has_char(run_text)
        fonts_used.append((run_text, results_for_run))
    return fonts_used

def emoji_font_fallback_needed(font: str, text: str) -> bool:
    '''
    Examples:

    Twemoji does not support the emoji sequence for “head shaking vertically”
    (U+1F642 U+200D U+2195, added in Unicode 15.1):

    >>> emoji_font_fallback_needed('Twemoji', '🙂‍↕️')
    True

    Twemoji does not have the flag of Sark (U+1F1E8 U+1F1F6, added in Unicode 16.0):

    >>> emoji_font_fallback_needed('Twemoji',  '🇨🇶')
    True

    Twemoji does not have U+1FAE9 FACE WITH BAGS UNDER EYES (added in Unicode 16.0):

    >>> emoji_font_fallback_needed('Twemoji', '🫩')
    True

    But Twemoji has U+1F925 LYING FACE (added in Unicode 9.0):

    >>> emoji_font_fallback_needed('Twemoji', ' 🤥')
    False

    Twemoji does support the emoji sequence for the flag of Wales
    (U+1F3F4 U+E0067 U+E0062 U+E0077 U+E006C U+E0073 U+E007F):

    >>> emoji_font_fallback_needed('Twemoji', '🏴󠁧󠁢󠁷󠁬󠁳󠁿')
    False

    Twemoji does not have regular Latin characters like “A”:

    >>> emoji_font_fallback_needed('Twemoji', 'A')
    True

    But of course any standard font has “A”:

    >>> emoji_font_fallback_needed('Sans', 'A')
    False

    If the text given contains more than one emoji, then we don’t know and
    the result is always True because a fallback might be needed:

    >>> emoji_font_fallback_needed('Twemoji', '🫩🤥')
    True

    >>> emoji_font_fallback_needed('Twemoji', '🏴󠁧󠁢󠁷󠁬󠁳󠁿🤥')
    True
    '''
    fonts_used = get_fonts_used_for_text(font, text, fallback=False)
    if len(fonts_used) > 1:
        # If there is more than one run, that means the text contained more
        # then just a single emoji or a single character. A fallback
        # might be needed in that case, that is hard to tell. Just
        # assume it is needed for the moment:
        return True
    results_for_run = fonts_used[0][1]
    if results_for_run['glyph-count'] > 1:
        return True
    if not results_for_run['visible']:
        return True
    if 'glyph-available' in results_for_run and not results_for_run['glyph-available']:
        return True
    return False

def _init() -> None:
    '''Initialization'''
    return

def _del() -> None:
    '''Cleanup'''
    return

class __ModuleInitializer: # pylint: disable=too-few-public-methods,invalid-name
    def __init__(self) -> None:
        _init()

    def __del__(self) -> None:
        return

if __name__ == "__main__":
    import doctest
    (FAILED, _ATTEMPTED) = doctest.testmod()
    sys.exit(FAILED)

mike-fabian commented 4 weeks ago

Now I have glyph-count, visible, and glyph-available and together these enable be to figure out whether a single emoji (which might be a sequence) is already supported by the requested font or not.

This new convenience function

def emoji_font_fallback_needed(font: str, text: str) -> bool:

seems to do correctly what we need.

mike-fabian commented 4 weeks ago

If we want automatic test cases, it is probably better to add some in the tests/ subdirectory of the ibus-typing-booster source code.

There we can test only certain parts of the return values of functions if we want to and we can add conditionals depending on which fonts are installed or which distribution the test is run on.

The doctests in the itb_pango.py file above are probably more useful for documenting how that function works and how it can be used then for running automatic tests on a wide variety of systems.

mike-fabian commented 4 weeks ago

@ferdnyc

As you are using Fedora, could you try one of the ibus-typing-booster-2.26.1 test builds from my copr repo please:

https://copr.fedorainfracloud.org/coprs/mfabian/ibus-typing-booster/builds/

There are builds for Fedora 39, Fedora 40, and Fedora 41.

You can install by enabling the repo and then using dnf to updade:

sudo dnf copr enable mfabian/ibus-typing-booster 
sudo dnf update

With these builds, the font fallbacks in emoji-picker should finally work correctly again.

mike-fabian commented 4 weeks ago

I uploaded ibus-typing-booster-2.26.2 test builds to the copr repo now.

Compared to 2.26.2 these fix some minor issues for “Symbola” and “Twitter Color Emoji”.

I also added some test cases in the tests/ subdirectory and cleaned up the function documentation in itb_pango.py.

“Twitter Color Emoji” is not the same as “Twemoji”, it is a a font with SVG images in an OpenType font. That means it can scale to any size, even very huge sizes without becoming blurry.

There is no package for Fedora but one can get it from https://github.com/13rac1/twemoji-color-font

The latest release is currently for Unicode 15.1.0:

https://github.com/13rac1/twemoji-color-font/releases/download/v15.1.0/TwitterColorEmoji-SVGinOT-Linux-15.1.0.tar.gz

Just download and unpack the tarball in ~/.fonts/

I tested that it works well on Fedora 40 and Fedora 41.

mike-fabian commented 4 weeks ago

@ferdnyc

“Twitter Color Emoji” is not the same as “Twemoji”, it is a a font with SVG images in an OpenType font. That means it can scale to any size, even very huge sizes without becoming blurry.

Actually “Noto Color Emoji” is also available as an SVG in OpenType font. The Fedora package google-noto-color-emoji-fonts-20241008-1.fc41.noarch contains the bitmap version, the SVG in OpenType font is available here:

https://github.com/googlefonts/noto-emoji/blob/main/fonts/Noto-COLRv1.ttf

I downloaded it and put it into my ~/.fonts/ directory and made this screenshot comparing the SVG in OpenType version (left side) with the bitmap version (right side):

Screenshot

mike-fabian commented 3 weeks ago

Fix included in https://github.com/mike-fabian/ibus-typing-booster/releases/tag/2.26.6

mike-fabian commented 3 weeks ago

@ferdnyc

Weird special case when using the “Blobemoji” font (https://github.com/C1710/blobmoji (Old “blob” style Google emoji, fork of Noto Color Emoji)):

Screenshot

This font doesn’t really have a flag for Sark but shows its own replacement flag. Therefore, the glyph-count is 1 and visible is True and I cannot detect anymore that a fallback to a different font would be nice for the flag of Sark.

Compare this with the behaviour for “Twemoji” which does not have the flag of Sark either but renders en empty glyphs with zero ink extent which I can detect as visible equal to False:

Screenshot-twemoji

Therefore, when fallback is enabled and the “Twemoji” font is used, a fallback is used for the flag of Sark:

Screenshot-twemoji-fallback

mike-fabian / ibus-typing-booster