Preedit of accented words can be improved

psads-git commented 3 years ago

Hi, Mike,

Preedit of accented words can be improved. Let me explain my idea. In Portuguese, many words contain accented character (typically, only one accented character). Suppose the word

estão

The prediction of the word should be fired as soon as the accent

~

is typed and not after the entire accented character is typed (ã), in order to improve the typing speed.

Thanks!

mike-fabian commented 3 years ago

Prediction is accent insensitive at the moment (for Portuguese).

I.e. estã and esta gives you exactly the same predictions.

This has the advantage that one often doesn’t have to care about typing the accents at all and can just select them from the prediction.

Like this:

Screenshot

mike-fabian commented 3 years ago

For me, this is very helpful when typing French or Spanish, I don’t know these languages well and often make mistakes in the accents, but with this feature, I can write the word first without the accents if I am not sure and then select the correct version.

mike-fabian commented 3 years ago

Here is a list of language where this accent insensitive matching makes sense:

https://github.com/mike-fabian/ibus-typing-booster/blob/main/engine/hunspell_suggest.py#L67

# List of languages where accent insensitive matching makes sense:
ACCENT_LANGUAGES = {
    'af': '',
    'ast': '',
    'az': '',
    'be': '',
    'bg': '',
    'br': '',
    'bs': '',
    'ca': '',
    'cs': '',
    'csb': '',
    'cv': '',
    'cy': '',
    'da': 'æÆøØåÅ',
    'de': '',

   and more like this ...

mike-fabian commented 3 years ago

As you see, pt is in that list.

You might notice that some language have a list of special characters after the language name, for example:

    'da': 'æÆøØåÅ',

That is because for a native speaker of Danish, “å” is not an accented version of “a” but a completely different letter. In Danish, “å” is also sorted after “z”, not after “a”!

In German, an “ä” is considered as some variation of “a” and therefore sorted as a secondary difference after “a”.

mike-fabian commented 3 years ago

For example, when using Danish, typing "Smørrebrød” gives me a match:

Screenshot

mike-fabian commented 3 years ago

But typing “Smorrebrod” does not give me a match (When using the Danish dictionary):

Screenshot

mike-fabian commented 3 years ago

That is because in Danish, “ø” is considered a different letter, not a variation of “o”.

mike-fabian commented 3 years ago

Now this might depend on whether one is native speaker of a langauage or not.

A non-native speaker of Danish, who is trying to n Danish, might find it helpful if “Smorrebrod“ did actually match “Smørrebrød”.

psads-git commented 3 years ago

Thanks, Mike. The point is that typing the accent increases the predictive ability of ibus-typing-booster.

https://user-images.githubusercontent.com/75945439/130484950-a992dbe6-8c56-42d5-9bd0-1bd737636990.mp4

mike-fabian commented 3 years ago

Currently this behaviour is hardcoded in the list shown above, so when using Portuguese, you have no choice, matching is *always done accent insensitive. And when using Danish, matching for the characters in the above exception list is done accent sensitive, matching for all other accented characters is done accent insensitive even when using Danish.

As users might disagree about this, especially users who are native speakers and users who are not, this should not be just hardcoded.

There should be an option, probably with 3 values:

Accent insensitive matching:   [always | never | according to the language rules]

When this option is set to “always”, accent insentive matching would occur even for Danish when typing “Smorrebrod”.

When this option is set to “never”, accent insensitive matching would never happeņ not even for “grun” -> “grün” in German or “estao” -> “estão” in Portuguese.

When this option is set to “according to language rules” this would be the current behaviour, generally accent insensitive matching is done but some languages may have some characters which are exceptions and are matched accent sensitive.

psads-git commented 3 years ago

Yeah, Mike, that is an excellent idea!

mike-fabian commented 3 years ago

Thanks, Mike. The point is that typing the accent increases the predictive ability of ibus-typing-booster. out.mp4

Yes, if you can type the accents correctly, the number of matching words is smaller and it is more likely that the correct one is among them. If I look in the Portuguese hunspell dictionary, I find:

$ grep ^agua /usr/share/myspell/pt_PT.dic 
aguaçal [CAT=nc,G=m,N=s]
aguaça  [CAT=nc,G=f,N=s]
aguaceiro/fp    [CAT=nc,G=m,N=s]
aguada/p    [CAT=nc,G=f,N=s]
aguadeiro/p [CAT=a_nc,G=m,N=s]
aguardar/XYPLD  [CAT=v,T=inf,TR=t]
aguardenteiro   [CAT=nc,G=m,N=s]
aguardente/p    [CAT=nc,G=f,N=s]
aguardentoso    [CAT=adj,N=s,G=m]
aguarela/p  [CAT=nc,G=f,N=s]
aguarelar/XYPL  [CAT=v,T=inf,TR=t]
aguarelista [CAT=nc,G=_,N=s]
aguarrás    [CAT=nc,G=f,N=s]
aguar/YPLM  [CAT=v,T=inf,TR=t,I=3]
aguas/PL    [$aguar$CAT=v,T=inf,TR=_$P=2,N=s,T=p]
agua/PL [$aguar$CAT=v,T=inf,TR=_$P=3,N=s,T=p]
aguamos/PL  [$aguar$CAT=v,T=inf,TR=_$P=1,N=p,T=p]
aguais/PL   [$aguar$CAT=v,T=inf,TR=_$P=2,N=p,T=p]
aguam/PL    [$aguar$CAT=v,T=inf,TR=_$P=3,N=p,T=p]
agua/PL [$aguar$CAT=v,T=inf,TR=_$P=2,N=s,T=i]
aguai/PL    [$aguar$CAT=v,T=inf,TR=_$P=2,N=p,T=i]

and

$ grep ^água /usr/share/myspell/pt_PT.dic 
água-ardente/p  [CAT=nc,G=f,N=s]
água-chilra [CAT=nc,G=f,N=s]
água-forte  [CAT=nc,G=f,N=s]
água-furtada    [CAT=nc,G=f,N=s]
água/p  [CAT=nc,G=f,N=s]
água-marinha    [CAT=nc,G=f,N=s]
água-oxigenada  [CAT=nc,G=f,N=s]
água-pé/p   [CAT=nc,G=f,N=s]
águas-furtadas  [$água-furtada$CAT=nc,G=f,N=s$N=p]
água-tinta  [CAT=nc,G=f,N=s]

Accent insentive matching gives you the results of both, no matter whether you typed “agua” or “água”.

(And what makes it worse is that the hunspell dictionaries have no information about which words are common and which are not!)

But if you know that you want “ág” plus something and not “ag” plus something, then accent sensitive matching would help.

So if you prefer typing the accents correctly yourself, accent sensitive matching is better.

But this is really a matter of choice, I prefer accent insensitive matching a lot, even when typing my native language (which is German). I often type words without the accents and select the correct version.

For German, I can also type accents correctly without problems if I want to, so for German accent sensitive matching would work for me. Although I still prefer accent insensitive matching, even for German. For French, accent insensitive matching helps me a lot as I make too many mistakes when typing the accents and therefore often would not get the right matches at all if the match is accent insensitive.

So this really should be a user option.

mike-fabian commented 3 years ago

I didn't yet create that option although I immediately thought of this when a user from Scandinavia requested exceptions for characters like ø, å, ..., because I was thinking about how to exactly make this optional and got confused because one could even make it optional in a more fine grained way and I was wondering whether I should do that and how exactly.

If there is only an option with 3 values like

Accent insensitive matching:   [always | never | according to the language rules]

then it is obviously already better than the current situation with no option. But what if one wants accent insensitive matching for one langauge and not for others?

For example, I as a native German speaker and learner of French and Spanish have the German, French, and Spanish dictionaries configured in ibus-typing-booster.

Maybe I would want accent sensitive matching in the German dictionary but accent insensitive matching in the French and Spanish dictionaries.

So it might be useful if one could set this up not with a single option for all dictionaries, but with more fine grained options per dictionary.

And then I wondered how the user interface for this should look like ... and got confused ...

One options is rather easy to implement, making this configurable per language might be even more useful than a single option, but the UI is going to get complicated.

psads-git commented 3 years ago

Indeed, Mike, it is essentially a matter of preference.

There is a small detail that may improve ibus-typing-booster performance, in the accent sensitive case. Consider the word

também

ibus-typing-booster should display its prediction as soon as the accent is typed and not after the accented character is typed (é, in this example).

psads-git commented 3 years ago

I think there is plenty of space in the gui:

Screenshot_2021-08-23_18-15-31

mike-fabian commented 3 years ago

You write: “as soon as the accent is typed”.

I wonder how you type your accents. There are many ways. I can type an ü for example:

Hit a key with ü directly on my keyboard
Use the t-latn-post input method (add it in the ibus-typing-booster setup) and type u"
Use the t-latn-pre input method and type "u
Type <dead_diaeresis> <u> ( is a key which produces a dead " which does nothing at first and when an u follows it becomes ü (this is similar to the t-latn-pre input method but not the same, completely different mechanism)
Type <Multi_key> <quotedbl> <u>
Type <Multi_key> <u> <quotedbl>
Type u followed by a combinig diaeresis (U+0308 COMBINING DIAERESIS)

On my keyboard layout (heavily customized verision of a US English layout), I can actually type all of the above to get an ü.

My preferred method is the t-latn-post input method, i.e. most of the time I type u" to get an ü.

mike-fabian commented 3 years ago

I think there is plenty of space in the gui:

Yes, at the end of each dictionary line in the setup tool, there could be a combobox where you have the above mentioned 3 choices (accent sensitive, insensitive, language rules).

psads-git commented 3 years ago

I cannot name the way I and everyone here in Portugal types accented words. However, the process consists of two steps:

One presses the key with the wanted accent;
One presses the key with the letter on which one wants to place the accent.

That is why my suggestion tends to save the second step!

mike-fabian commented 3 years ago

My current feeling is, that if I make only one option with these 3 choices for all languages, then I would probably regret it later. Because most likely I would need to expand it later to more fine grained control for each dictionary separately and that would be a nasty change with backwards compatibility problems.

So at the moment I think I should make this optional per language immediately, even if it is far more complicated to implement.

mike-fabian commented 3 years ago

I cannot name the way I and everyone here in Portugal types accented words. However, the process consists of two steps:
1. One presses the key with the wanted accent;

2. One presses the key with the letter on which one wants to place the accent.

That seems to me you are using so called “dead keys”.

psads-git commented 3 years ago

To add the option per language is even better than to all, since it gives more freedom to the user.

mike-fabian commented 3 years ago

To add the option per language is even better than to all, since it gives more freedom to the user.

Yes, and in the long run I will have to do that anyway, so I better should not postpone that.

mike-fabian commented 3 years ago

I guess you are using the first keyboard layout from /usr/share/X11/xkb/symbols/pt, which is:

default partial alphanumeric_keys
xkb_symbols "basic" {

    include "latin(type4)"
    name[Group1]="Portuguese";

    key <TLDE> { [     backslash,             bar,        notsign,          notsign ] };
    key <AE03> { [             3,      numbersign,       sterling,         sterling ] };
    key <AE04> { [             4,          dollar,        section,           dollar ] };
    key <AE11> { [    apostrophe,        question,      backslash,     questiondown ] };
    key <AE12> { [ guillemotleft,  guillemotright,   dead_cedilla,      dead_ogonek ] };

    key <AD11> { [          plus,        asterisk, dead_diaeresis,   dead_abovering ] };
    key <AD12> { [    dead_acute,      dead_grave,     dead_tilde,      dead_macron ] };
    key <BKSL> { [    dead_tilde, dead_circumflex,     dead_grave,       dead_breve ] };

    key <AC10> { [      ccedilla,        Ccedilla,     dead_acute, dead_doubleacute ] };
    key <AC11> { [     masculine,     ordfeminine,dead_circumflex,       dead_caron ] };

    key <LSGT> { [          less,         greater,      backslash,        backslash ] };

    include "level3(ralt_switch)"
};

The second layout in that file is one without dead keys:

partial alphanumeric_keys
xkb_symbols "nodeadkeys" {

    include "pt(basic)"
    name[Group1]="Portuguese (no dead keys)";

    key <AE12> { [ guillemotleft,  guillemotright,        cedilla,           ogonek ] };
    key <AD11> { [          plus,        asterisk,       quotedbl,         quotedbl ] };
    key <AD12> { [         acute,           grave                                   ] };
    key <BKSL> { [    asciitilde,     asciicircum                                   ] };
    key <AC10> { [      ccedilla,        Ccedilla,          acute,      doubleacute ] };
    key <AC11> { [     masculine,     ordfeminine,    asciicircum,            caron ] };
    key <AB10> { [         minus,      underscore,  dead_belowdot,         abovedot ] };
};

mike-fabian commented 3 years ago

So probably you are using a layout like this one:

https://en.wikipedia.org/wiki/Portuguese_keyboard_layout#/media/File:KB_Portuguese.svg

The keys marked in red on that layout are dead keys.

psads-git commented 3 years ago

My keyboard is similar to this one:

https://www.worten.pt/i/8466d924afbaa5bd14e604fa3ca649a377762776.jpg

psads-git commented 3 years ago

So probably you are using a layout like this one:

https://en.wikipedia.org/wiki/Portuguese_keyboard_layout#/media/File:KB_Portuguese.svg

Exactly, Mike!

mike-fabian commented 3 years ago

The problem with these dead keys is, that they are handled in a very special way.

They don’t go directly into the preëdit. Maybe you noticed the option “Use color for the compose preview” in the setup tool. Dead keys and compose are basically the same mechanism. Try to use that option and choose a obvious colour like I did in this screenshot:

Screenshot

If you do that, it makes it more obvious what is going on.

mike-fabian commented 3 years ago

https://user-images.githubusercontent.com/2330175/130492965-4d7224ca-1d81-46c2-90b6-a6f58dc1e6f3.mp4

psads-git commented 3 years ago

Thanks, Mike. I have just done that.

mike-fabian commented 3 years ago

The normal colour for the preëdit text is black.

I type est and it is black (and some completions are shown in yellow because I did choose that color for completions).

Now I type a dead_tilde and see est in black followed by a ~ in green.

Note that all completions have disappeared while the green ~ is there!

That is because the compose handling is a sort of preëdit inside a preëdit. The external preëdit (black) has to wait until the internal preëdit (green) is finished to continue searching for completions.

After the a has been typed, the green ~ plus the a combine to ã and this ã is black. Because the compose sequence is finished and the internal preëdit is now gone, the ã is now part of the “normal” preëdit.

mike-fabian commented 3 years ago

Try typing Tab while you see the green ~.

mike-fabian commented 3 years ago

Typing Tab while you see the green ~ gives you something like this:

https://user-images.githubusercontent.com/2330175/130494027-e3eb253f-1136-4afa-9f31-2a6b6c314a15.mp4

mike-fabian commented 3 years ago

So this shows you how the compose sequence starting with a dead ~ could be completed.

What you see in that list of possible completions might be somewhat different than what I see because I include only those completions in the list which can actually be typed on the current keyboard layout.

If I didn't limit it to those possible to on the current keyboard layout, there would almost always be hundreds of possible completions. For example there is this:

$ grep '^<dead_tilde>.*ᾶ'   /usr/share/X11/locale/en_US.UTF-8/Compose
<dead_tilde> <Greek_alpha>          : "ᾶ"   U1FB6 # GREEK SMALL LETTER ALPHA WITH PERISPOMENI

But if your keyboard layout doesn't even have a key <Greek_alpha>, then I omit this because it is probably not so interesting.

For a user of a Greek keyboard layout, I show this one but omit others which cannot be typed on the Greek keyboard layout.

mike-fabian commented 3 years ago

In my video, I typed a dead ~ and Tab and then selected ≳ from the candidate list shown.

The candidate list shows in the first column what one could type to get this.

So in case of the ≳ it shows a > in the first column and ≳ in the second column.

This tells you that after typing a dead ~ you could type a > to get ≳.

So this typing of Tab when a compose sequence is started but not finished yet tells you what choices you have to finish the sequences, makes it easier to learn the possible compose sequences, certainly easier than reading the /usr/share/X11/locale/en_US.UTF-8/Compose file where all these sequences are defined.

psads-git commented 3 years ago

I guess there is no <Greek_alpha> key in my keyboard, Mike!

mike-fabian commented 3 years ago

While an unfinished compose sequence is typed, ibus-typing-booster basically stops everything else it doing and waits until the compose sequence is finished and then continues with predictions. While the compose sequence is unfinished, the only things you can do is show possible completions with Tab or correct with Backspace or cancel the compose sequence with Escape.

psads-git commented 3 years ago

While an unfinished compose sequence is typed, ibus-typing-booster basically stops everything else it doing and waits until the compose sequence is finished and then continues with predictions. While the compose sequence is unfinished, the only things you can do is show possible completions with Tab or correct with Backspace or cancel the compose sequence with Escape.

Got it, Mike!

mike-fabian commented 3 years ago

Even if I could get the ~ from the unfinished compose sequence, it would not be useful to complete anything. Because neither in the dictionaries nor in the database are things like est~ao which one could match with est~. The dictionary only has estão.

For matching, the dictionary and database are internally converted to NFD (Normalization form D):

https://unicode.org/reports/tr15/#Norm_Forms

I.e. what is matched against is actually something like esta~o where the ~ is a combining tilde.

And then the combining characters like the combining ~ are filtered out in case of accent insensitive matches and kept in case of accent sensitive matches like the Danish ø.

mike-fabian commented 3 years ago

That's why I asked how you type an ã because there are so many ways.

In case of using something like t-latn-post or t-latn-pre, the ~ would actually be part of the “normal” preëdit, not the compose preëdit. But it can be before or after the base character a. On some ķeyboard layouts one can actually type a followed by combining ~ (which is the NFD way!). Handwriting is usually done in the same way, writing the accent after the base character.

So I think starting to match something when only a ~ has been typed is near impossible, the possibilities are enormous.

Converting all the dictionaries on reading them to forms having ã, a~, and ~a (would make the loaded dictionaries much bigger and the search much slower for very little gain.

psads-git commented 3 years ago

Well, Mike, in Portuguese, that would be easy since all accented characters are vowels! Therefore, ibus-typing-booster could search for:

estã
estẽ 
estĩ 
estõ 
estũ

😉

psads-git commented 3 years ago

Converting all the dictionaries on reading them to forms having ã, a~, and ~a (would make the loaded dictionaries much bigger and the search much slower for very little gain.

I agree that the gain would be small.

mike-fabian commented 3 years ago

So while a simple ~ might match something if the user uses t-latn-pre and often types est~ao using t-latn-pre, this does match a previously typed estão. Because it is remembered that what was actually typed was est~ao and what was committed was estão. Then the next time est~ is typed, this can complete to estão.

mike-fabian commented 3 years ago

See how est~ shows estão among the candidates here:

https://user-images.githubusercontent.com/2330175/130499129-dd1f8a1b-1ea0-4ab4-8d0e-ed2814bde7d1.mp4

This works only because I use t-latn-pre in this example and not a dead ~, that makes a big difference.

And of course I typed est~ao a few times before recording that video to make ibus-typing-booster learn this.

mike-fabian commented 3 years ago

On your Portuguese keyboard layout, using t-latn-pre would be very inconvenient though. Because you don’t have a normal ~, only a dead ~. To get a normal ~ you would need to type the dead ~ twice to get a normal ~, then that could combine with the following letter using t-latn-pre.

I.e. with t-latn-pre on you keyboard layout, you actually would need to type ~~a to get an ã.

Makes no sense for you, I just mentioned t-latn-pre because this is yet another way to type this which can be very useful on layouts which do not have dead keys.

mike-fabian commented 3 years ago

Well, Mike, in Portuguese, that would be easy since all accented characters are vowels! Therefore, ibus-typing-booster could search for:
estã
estẽ 
estĩ 
estõ 
estũ
wink

But ibus-typing-booster doesn't know you are typing Portuguese. One can have several languages configured at the same time. For example one could have a Spanish and a Portuguese dictionary configured at the same time. And in Spanish typing ~ could mean an n is coming to make a ñ. And while you are typing something into the preëdit, ibus-typing-booster cannot know from which of the several languages you may have configured the word you are typing is going to be.

mike-fabian commented 3 years ago

So I think matching something forward when only some accent has been typed like matching estão when only est~ has been typed doesn’t seem reasonably possible (except for special circumstances like when using t-latn-pre).

The amount of calculation for this would be huge, it would depend very much on which languages exactly are configured, lots of special casing, no high speed matching with patterns like regular expressions possible anymore.

So I think matching ~ will probably never work.

But matching accent sensitive, as discussed above, is possible and I think I will do that.

I.e. making ã match something different than what just a would match. That is possible and probably useful as a user option.

And it probably already would do most of what you want.

psads-git commented 3 years ago

Thanks, Mike. Your arguments have convinced me that matching ~ is not a good idea.

mike-fabian commented 3 years ago

Thanks, Mike. Your arguments have convinced me that matching ~ is not a good idea.

Great, but I'll do the other thing with the accent sensitive matching.

This might take quite a while though as it is really quite difficult to implement.

I think combobox buttons at the end of each dictionary line are a good idea, but I still need to think about how to save that to gsettings and read it back from there. I have a few ideas but I think I need to think about this for a few days before starting to implement anything.

psads-git commented 3 years ago

Thanks, Mike. That is nothing really urgent! So, take your time. No hurry at all!

mike-fabian commented 3 years ago

I remembered that there needs to be an extra option for the user database.

Each dictionary line needs to have an option whether to match accent insensitive [always | never | language rules].

And there needs to be an option whether to store accents the user typed in the user database.

Currently, accents are removed from the text the user typed when storing in the database.

That means if one types estã and then selects estão and commits, the ~ is dropped from the user input. So what is stored in the database is the user typed esta (without the ~) and then selected estão.

So the next time the user types either esta or estã, estão is a match in both cases because the user input with accents removed is esta in both cases and what was recorded in the database is also esta without the accent, so it matches.

The user database is language agnostic (which is a good thing!), it just records what the user typed in what context and which completion candidate was selected.

As some users may wish to make the matching more strict by matching accent sensitive, such an option has also to be added for the user database.

Maybe a simple checkbox is enough:

[✅] Accent sensitive matching in user database

An option with 3 values like for the dictionaries ([always | never | according to language rules]) doesn’t seem to make sense for the user database because the user database has no language. So maybe that simple checkbox which allows to switch it on or off is enough for the user database.

But one could also make it a more detailed option with more possibilities, maybe even allowing the user to specify a list of characters he wants match accent sensitivly:

[✅] Accent sensitive matching in user database
Accent sensitive matching in user database only for [ÅåØø]

I.e. when [ ] Accent sensitive matching in user database is off, this would be the current behaviour, all accents are ignored. The second option does not matter then.

When [✅] Accent sensitive matching in user database is on, and the list of exception characters is empty, all accents would be kept in the user database.

When [✅] Accent sensitive matching in user database is on, and the list of exception characters is not empty, the exception characters would be stored with their accents in the user database when the user types such characters but all other accented characters still get their accents stripped.

These options for the user database would only have an effect on new input.

Stuff which is already in the user database cannot be changed later. Theoretically I could remove accents from user input which is already in the user database, but there is no way to put them back. Because if there is onl a in the user database, one cannot know whether the user really typed a or ã, ä, .... and the accent was stripped.

So after switching the [✅] Accent sensitive matching in user database to a different value, the effect would become visible slowly after more typing as the newly typed words get higher weight.

Side note: For a long time already I am thinking of a kind of expire feature in the user database, words which have not been typed for a long, long time should fade away from the user database. Currently everything is kept forever. My user database which is many years old has 32 megabytes now:

$ ls ~/.local/share/ibus-typing-booster/user.db -lh
-rw-r--r--. 1 mfabian mfabian 32M  8月 24 08:43 /home/mfabian/.local/share/ibus-typing-booster/user.db

There is old junk inside because I tested some stuff many years ago.

That hurts less than one might think because if I never type it again, it never gets a higher score. I stays with a count of 1 in the database forever but with such a low count it is unlikely that it will ever show up as a candidate. Words which I type often have much higher counts. But a needlessly huge database makes everything slower. I am thinking about something similar to radioactive decay. if an entry which has been typed 10 times, don’t keep that count forever but slowly reduce it over time and drop the entry if it reaches 0. As time passes, old junk which is never typed again would be dropped automatically. Words typed recently would get higher scores than words typed months ago ...

mike-fabian / ibus-typing-booster

Preedit of accented words can be improved #231