Open psads-git opened 3 years ago
Prediction is accent insensitive at the moment (for Portuguese).
I.e. estã and esta gives you exactly the same predictions.
This has the advantage that one often doesn’t have to care about typing the accents at all and can just select them from the prediction.
Like this:
For me, this is very helpful when typing French or Spanish, I don’t know these languages well and often make mistakes in the accents, but with this feature, I can write the word first without the accents if I am not sure and then select the correct version.
Here is a list of language where this accent insensitive matching makes sense:
https://github.com/mike-fabian/ibus-typing-booster/blob/main/engine/hunspell_suggest.py#L67
# List of languages where accent insensitive matching makes sense:
ACCENT_LANGUAGES = {
'af': '',
'ast': '',
'az': '',
'be': '',
'bg': '',
'br': '',
'bs': '',
'ca': '',
'cs': '',
'csb': '',
'cv': '',
'cy': '',
'da': 'æÆøØåÅ',
'de': '',
and more like this ...
As you see, pt is in that list.
You might notice that some language have a list of special characters after the language name, for example:
'da': 'æÆøØåÅ',
That is because for a native speaker of Danish, “å” is not an accented version of “a” but a completely different letter. In Danish, “å” is also sorted after “z”, not after “a”!
In German, an “ä” is considered as some variation of “a” and therefore sorted as a secondary difference after “a”.
For example, when using Danish, typing "Smørrebrød” gives me a match:
But typing “Smorrebrod” does not give me a match (When using the Danish dictionary):
That is because in Danish, “ø” is considered a different letter, not a variation of “o”.
Now this might depend on whether one is native speaker of a langauage or not.
A non-native speaker of Danish, who is trying to n Danish, might find it helpful if “Smorrebrod“ did actually match “Smørrebrød”.
Thanks, Mike. The point is that typing the accent increases the predictive ability of ibus-typing-booster.
Currently this behaviour is hardcoded in the list shown above, so when using Portuguese, you have no choice, matching is *always done accent insensitive. And when using Danish, matching for the characters in the above exception list is done accent sensitive, matching for all other accented characters is done accent insensitive even when using Danish.
As users might disagree about this, especially users who are native speakers and users who are not, this should not be just hardcoded.
There should be an option, probably with 3 values:
Accent insensitive matching: [always | never | according to the language rules]
When this option is set to “always”, accent insentive matching would occur even for Danish when typing “Smorrebrod”.
When this option is set to “never”, accent insensitive matching would never happeņ not even for “grun” -> “grün” in German or “estao” -> “estão” in Portuguese.
When this option is set to “according to language rules” this would be the current behaviour, generally accent insensitive matching is done but some languages may have some characters which are exceptions and are matched accent sensitive.
Yeah, Mike, that is an excellent idea!
Thanks, Mike. The point is that typing the accent increases the predictive ability of ibus-typing-booster. out.mp4
Yes, if you can type the accents correctly, the number of matching words is smaller and it is more likely that the correct one is among them. If I look in the Portuguese hunspell dictionary, I find:
$ grep ^agua /usr/share/myspell/pt_PT.dic
aguaçal [CAT=nc,G=m,N=s]
aguaça [CAT=nc,G=f,N=s]
aguaceiro/fp [CAT=nc,G=m,N=s]
aguada/p [CAT=nc,G=f,N=s]
aguadeiro/p [CAT=a_nc,G=m,N=s]
aguardar/XYPLD [CAT=v,T=inf,TR=t]
aguardenteiro [CAT=nc,G=m,N=s]
aguardente/p [CAT=nc,G=f,N=s]
aguardentoso [CAT=adj,N=s,G=m]
aguarela/p [CAT=nc,G=f,N=s]
aguarelar/XYPL [CAT=v,T=inf,TR=t]
aguarelista [CAT=nc,G=_,N=s]
aguarrás [CAT=nc,G=f,N=s]
aguar/YPLM [CAT=v,T=inf,TR=t,I=3]
aguas/PL [$aguar$CAT=v,T=inf,TR=_$P=2,N=s,T=p]
agua/PL [$aguar$CAT=v,T=inf,TR=_$P=3,N=s,T=p]
aguamos/PL [$aguar$CAT=v,T=inf,TR=_$P=1,N=p,T=p]
aguais/PL [$aguar$CAT=v,T=inf,TR=_$P=2,N=p,T=p]
aguam/PL [$aguar$CAT=v,T=inf,TR=_$P=3,N=p,T=p]
agua/PL [$aguar$CAT=v,T=inf,TR=_$P=2,N=s,T=i]
aguai/PL [$aguar$CAT=v,T=inf,TR=_$P=2,N=p,T=i]
and
$ grep ^água /usr/share/myspell/pt_PT.dic
água-ardente/p [CAT=nc,G=f,N=s]
água-chilra [CAT=nc,G=f,N=s]
água-forte [CAT=nc,G=f,N=s]
água-furtada [CAT=nc,G=f,N=s]
água/p [CAT=nc,G=f,N=s]
água-marinha [CAT=nc,G=f,N=s]
água-oxigenada [CAT=nc,G=f,N=s]
água-pé/p [CAT=nc,G=f,N=s]
águas-furtadas [$água-furtada$CAT=nc,G=f,N=s$N=p]
água-tinta [CAT=nc,G=f,N=s]
Accent insentive matching gives you the results of both, no matter whether you typed “agua” or “água”.
(And what makes it worse is that the hunspell dictionaries have no information about which words are common and which are not!)
But if you know that you want “ág” plus something and not “ag” plus something, then accent sensitive matching would help.
So if you prefer typing the accents correctly yourself, accent sensitive matching is better.
But this is really a matter of choice, I prefer accent insensitive matching a lot, even when typing my native language (which is German). I often type words without the accents and select the correct version.
For German, I can also type accents correctly without problems if I want to, so for German accent sensitive matching would work for me. Although I still prefer accent insensitive matching, even for German. For French, accent insensitive matching helps me a lot as I make too many mistakes when typing the accents and therefore often would not get the right matches at all if the match is accent insensitive.
So this really should be a user option.
I didn't yet create that option although I immediately thought of this when a user from Scandinavia requested exceptions for characters like ø, å, ..., because I was thinking about how to exactly make this optional and got confused because one could even make it optional in a more fine grained way and I was wondering whether I should do that and how exactly.
If there is only an option with 3 values like
Accent insensitive matching: [always | never | according to the language rules]
then it is obviously already better than the current situation with no option. But what if one wants accent insensitive matching for one langauge and not for others?
For example, I as a native German speaker and learner of French and Spanish have the German, French, and Spanish dictionaries configured in ibus-typing-booster.
Maybe I would want accent sensitive matching in the German dictionary but accent insensitive matching in the French and Spanish dictionaries.
So it might be useful if one could set this up not with a single option for all dictionaries, but with more fine grained options per dictionary.
And then I wondered how the user interface for this should look like ... and got confused ...
One options is rather easy to implement, making this configurable per language might be even more useful than a single option, but the UI is going to get complicated.
Indeed, Mike, it is essentially a matter of preference.
There is a small detail that may improve ibus-typing-booster performance, in the accent sensitive case. Consider the word
também
ibus-typing-booster should display its prediction as soon as the accent is typed and not after the accented character is typed (é, in this example).
I think there is plenty of space in the gui:
You write: “as soon as the accent is typed”.
I wonder how you type your accents. There are many ways. I can type an ü
for example:
ü
directly on my keyboardu"
"u
<dead_diaeresis> <u>
(<Multi_key> <quotedbl> <u>
<Multi_key> <u> <quotedbl>
u
followed by a combinig diaeresis (U+0308 COMBINING DIAERESIS)On my keyboard layout (heavily customized verision of a US English layout), I can actually type all of the above to get an ü.
My preferred method is the t-latn-post input method, i.e. most of the time I type u"
to get an ü
.
I think there is plenty of space in the gui:
Yes, at the end of each dictionary line in the setup tool, there could be a combobox where you have the above mentioned 3 choices (accent sensitive, insensitive, language rules).
I cannot name the way I and everyone here in Portugal types accented words. However, the process consists of two steps:
That is why my suggestion tends to save the second step!
My current feeling is, that if I make only one option with these 3 choices for all languages, then I would probably regret it later. Because most likely I would need to expand it later to more fine grained control for each dictionary separately and that would be a nasty change with backwards compatibility problems.
So at the moment I think I should make this optional per language immediately, even if it is far more complicated to implement.
I cannot name the way I and everyone here in Portugal types accented words. However, the process consists of two steps:
1. One presses the key with the wanted accent; 2. One presses the key with the letter on which one wants to place the accent.
That seems to me you are using so called “dead keys”.
To add the option per language is even better than to all, since it gives more freedom to the user.
To add the option per language is even better than to all, since it gives more freedom to the user.
Yes, and in the long run I will have to do that anyway, so I better should not postpone that.
I guess you are using the first keyboard layout from /usr/share/X11/xkb/symbols/pt
, which is:
default partial alphanumeric_keys
xkb_symbols "basic" {
include "latin(type4)"
name[Group1]="Portuguese";
key <TLDE> { [ backslash, bar, notsign, notsign ] };
key <AE03> { [ 3, numbersign, sterling, sterling ] };
key <AE04> { [ 4, dollar, section, dollar ] };
key <AE11> { [ apostrophe, question, backslash, questiondown ] };
key <AE12> { [ guillemotleft, guillemotright, dead_cedilla, dead_ogonek ] };
key <AD11> { [ plus, asterisk, dead_diaeresis, dead_abovering ] };
key <AD12> { [ dead_acute, dead_grave, dead_tilde, dead_macron ] };
key <BKSL> { [ dead_tilde, dead_circumflex, dead_grave, dead_breve ] };
key <AC10> { [ ccedilla, Ccedilla, dead_acute, dead_doubleacute ] };
key <AC11> { [ masculine, ordfeminine,dead_circumflex, dead_caron ] };
key <LSGT> { [ less, greater, backslash, backslash ] };
include "level3(ralt_switch)"
};
The second layout in that file is one without dead keys:
partial alphanumeric_keys
xkb_symbols "nodeadkeys" {
include "pt(basic)"
name[Group1]="Portuguese (no dead keys)";
key <AE12> { [ guillemotleft, guillemotright, cedilla, ogonek ] };
key <AD11> { [ plus, asterisk, quotedbl, quotedbl ] };
key <AD12> { [ acute, grave ] };
key <BKSL> { [ asciitilde, asciicircum ] };
key <AC10> { [ ccedilla, Ccedilla, acute, doubleacute ] };
key <AC11> { [ masculine, ordfeminine, asciicircum, caron ] };
key <AB10> { [ minus, underscore, dead_belowdot, abovedot ] };
};
So probably you are using a layout like this one:
https://en.wikipedia.org/wiki/Portuguese_keyboard_layout#/media/File:KB_Portuguese.svg
The keys marked in red on that layout are dead keys.
My keyboard is similar to this one:
https://www.worten.pt/i/8466d924afbaa5bd14e604fa3ca649a377762776.jpg
So probably you are using a layout like this one:
https://en.wikipedia.org/wiki/Portuguese_keyboard_layout#/media/File:KB_Portuguese.svg
Exactly, Mike!
The problem with these dead keys is, that they are handled in a very special way.
They don’t go directly into the preëdit. Maybe you noticed the option “Use color for the compose preview” in the setup tool. Dead keys and compose are basically the same mechanism. Try to use that option and choose a obvious colour like I did in this screenshot:
If you do that, it makes it more obvious what is going on.
Thanks, Mike. I have just done that.
The normal colour for the preëdit text is black.
I type est
and it is black (and some completions are shown in yellow because I did choose that color for completions).
Now I type a dead_tilde and see est
in black followed by a ~
in green.
Note that all completions have disappeared while the green ~
is there!
That is because the compose handling is a sort of preëdit inside a preëdit. The external preëdit (black) has to wait until the internal preëdit (green) is finished to continue searching for completions.
After the a
has been typed, the green ~
plus the a
combine to ã
and this ã
is black. Because the compose sequence is finished and the internal preëdit is now gone, the ã
is now part of the “normal” preëdit.
Try typing Tab while you see the green ~
.
Typing Tab while you see the green ~
gives you something like this:
https://user-images.githubusercontent.com/2330175/130494027-e3eb253f-1136-4afa-9f31-2a6b6c314a15.mp4
So this shows you how the compose sequence starting with a dead ~
could be completed.
What you see in that list of possible completions might be somewhat different than what I see because I include only those completions in the list which can actually be typed on the current keyboard layout.
If I didn't limit it to those possible to on the current keyboard layout, there would almost always be hundreds of possible completions. For example there is this:
$ grep '^<dead_tilde>.*ᾶ' /usr/share/X11/locale/en_US.UTF-8/Compose
<dead_tilde> <Greek_alpha> : "ᾶ" U1FB6 # GREEK SMALL LETTER ALPHA WITH PERISPOMENI
But if your keyboard layout doesn't even have a key <Greek_alpha>
, then I omit this because it is probably not so interesting.
For a user of a Greek keyboard layout, I show this one but omit others which cannot be typed on the Greek keyboard layout.
In my video, I typed a dead ~
and Tab and then selected ≳
from the candidate list shown.
The candidate list shows in the first column what one could type to get this.
So in case of the ≳
it shows a >
in the first column and ≳
in the second column.
This tells you that after typing a dead ~
you could type a >
to get ≳
.
So this typing of Tab when a compose sequence is started but not finished yet tells you what choices you have to finish the sequences, makes it easier to learn the possible compose sequences, certainly easier than reading the /usr/share/X11/locale/en_US.UTF-8/Compose
file where all these sequences are defined.
I guess there is no <Greek_alpha>
key in my keyboard, Mike!
While an unfinished compose sequence is typed, ibus-typing-booster basically stops everything else it doing and waits until the compose sequence is finished and then continues with predictions. While the compose sequence is unfinished, the only things you can do is show possible completions with Tab or correct with Backspace or cancel the compose sequence with Escape.
While an unfinished compose sequence is typed, ibus-typing-booster basically stops everything else it doing and waits until the compose sequence is finished and then continues with predictions. While the compose sequence is unfinished, the only things you can do is show possible completions with Tab or correct with Backspace or cancel the compose sequence with Escape.
Got it, Mike!
Even if I could get the ~
from the unfinished compose sequence, it would not be useful to complete anything. Because neither in the dictionaries nor in the database are things like est~ao
which one could match with est~
. The dictionary only has estão
.
For matching, the dictionary and database are internally converted to NFD (Normalization form D):
https://unicode.org/reports/tr15/#Norm_Forms
I.e. what is matched against is actually something like esta~o
where the ~
is a combining tilde.
And then the combining characters like the combining ~
are filtered out in case of accent insensitive matches and kept in case of accent sensitive matches like the Danish ø
.
That's why I asked how you type an ã
because there are so many ways.
In case of using something like t-latn-post
or t-latn-pre
, the ~
would actually be part of the “normal” preëdit, not the compose preëdit. But it can be before or after the base character a
. On some ķeyboard layouts one can actually type a
followed by combining ~
(which is the NFD way!). Handwriting is usually done in the same way, writing the accent after the base character.
So I think starting to match something when only a ~
has been typed is near impossible, the possibilities are enormous.
Converting all the dictionaries on reading them to forms having ã
, a~
, and ~a
(would make the loaded dictionaries much bigger and the search much slower for very little gain.
Well, Mike, in Portuguese, that would be easy since all accented characters are vowels! Therefore, ibus-typing-booster could search for:
estã
estẽ
estĩ
estõ
estũ
😉
Converting all the dictionaries on reading them to forms having
ã
,a~
, and~a
(would make the loaded dictionaries much bigger and the search much slower for very little gain.
I agree that the gain would be small.
So while a simple ~
might match something if the user uses t-latn-pre
and often types est~ao
using t-latn-pre
, this does match a previously typed estão
. Because it is remembered that what was actually typed was est~ao
and what was committed was estão
. Then the next time est~
is typed, this can complete to estão
.
See how est~
shows estão
among the candidates here:
https://user-images.githubusercontent.com/2330175/130499129-dd1f8a1b-1ea0-4ab4-8d0e-ed2814bde7d1.mp4
This works only because I use t-latn-pre
in this example and not a dead ~
, that makes a big difference.
And of course I typed est~ao
a few times before recording that video to make ibus-typing-booster learn this.
On your Portuguese keyboard layout, using t-latn-pre
would be very inconvenient though. Because you don’t have a normal ~
, only a dead ~
. To get a normal ~
you would need to type the dead ~
twice to get a normal ~
, then that could combine with the following letter using t-latn-pre
.
I.e. with t-latn-pre
on you keyboard layout, you actually would need to type ~~a
to get an ã
.
Makes no sense for you, I just mentioned t-latn-pre
because this is yet another way to type this which can be very useful on layouts which do not have dead keys.
Well, Mike, in Portuguese, that would be easy since all accented characters are vowels! Therefore, ibus-typing-booster could search for:
estã estẽ estĩ estõ estũ
wink
But ibus-typing-booster doesn't know you are typing Portuguese. One can have several languages configured at the same time.
For example one could have a Spanish and a Portuguese dictionary configured at the same time. And in Spanish typing ~
could mean an n
is coming to make a ñ
. And while you are typing something into the preëdit, ibus-typing-booster cannot know from which of the several languages you may have configured the word you are typing is going to be.
So I think matching something forward when only some accent has been typed like matching estão
when only est~
has been typed doesn’t seem reasonably possible (except for special circumstances like when using t-latn-pre
).
The amount of calculation for this would be huge, it would depend very much on which languages exactly are configured, lots of special casing, no high speed matching with patterns like regular expressions possible anymore.
So I think matching ~
will probably never work.
But matching accent sensitive, as discussed above, is possible and I think I will do that.
I.e. making ã
match something different than what just a
would match. That is possible and probably useful as a user option.
And it probably already would do most of what you want.
Thanks, Mike. Your arguments have convinced me that matching ~
is not a good idea.
Thanks, Mike. Your arguments have convinced me that matching
~
is not a good idea.
Great, but I'll do the other thing with the accent sensitive matching.
This might take quite a while though as it is really quite difficult to implement.
I think combobox buttons at the end of each dictionary line are a good idea, but I still need to think about how to save that to gsettings and read it back from there. I have a few ideas but I think I need to think about this for a few days before starting to implement anything.
Thanks, Mike. That is nothing really urgent! So, take your time. No hurry at all!
I remembered that there needs to be an extra option for the user database.
Each dictionary line needs to have an option whether to match accent insensitive [always | never | language rules].
And there needs to be an option whether to store accents the user typed in the user database.
Currently, accents are removed from the text the user typed when storing in the database.
That means if one types estã
and then selects estão
and commits, the ~
is dropped from the user input. So what is stored in the database is the user typed esta
(without the ~
) and then selected estão
.
So the next time the user types either esta
or estã
, estão
is a match in both cases because the user input with accents removed is esta
in both cases and what was recorded in the database is also esta
without the accent, so it matches.
The user database is language agnostic (which is a good thing!), it just records what the user typed in what context and which completion candidate was selected.
As some users may wish to make the matching more strict by matching accent sensitive, such an option has also to be added for the user database.
Maybe a simple checkbox is enough:
[✅] Accent sensitive matching in user database
An option with 3 values like for the dictionaries ([always | never | according to language rules]
) doesn’t seem to make sense for the user database because the user database has no language. So maybe that simple checkbox which allows to switch it on or off is enough for the user database.
But one could also make it a more detailed option with more possibilities, maybe even allowing the user to specify a list of characters he wants match accent sensitivly:
[✅] Accent sensitive matching in user database
Accent sensitive matching in user database only for [ÅåØø]
I.e. when [ ] Accent sensitive matching in user database
is off, this would be the current behaviour, all accents are ignored. The second option does not matter then.
When [✅] Accent sensitive matching in user database
is on, and the list of exception characters is empty, all accents would be kept in the user database.
When [✅] Accent sensitive matching in user database
is on, and the list of exception characters is not empty, the exception characters would be stored with their accents in the user database when the user types such characters but all other accented characters still get their accents stripped.
These options for the user database would only have an effect on new input.
Stuff which is already in the user database cannot be changed later. Theoretically I could remove accents from user input which is already in the user database, but there is no way to put them back. Because if there is onl a
in the user database, one cannot know whether the user really typed a
or ã
, ä
, .... and the accent was stripped.
So after switching the [✅] Accent sensitive matching in user database
to a different value, the effect would become visible slowly after more typing as the newly typed words get higher weight.
Side note: For a long time already I am thinking of a kind of expire feature in the user database, words which have not been typed for a long, long time should fade away from the user database. Currently everything is kept forever. My user database which is many years old has 32 megabytes now:
$ ls ~/.local/share/ibus-typing-booster/user.db -lh
-rw-r--r--. 1 mfabian mfabian 32M 8月 24 08:43 /home/mfabian/.local/share/ibus-typing-booster/user.db
There is old junk inside because I tested some stuff many years ago.
That hurts less than one might think because if I never type it again, it never gets a higher score. I stays with a count of 1 in the database forever but with such a low count it is unlikely that it will ever show up as a candidate. Words which I type often have much higher counts. But a needlessly huge database makes everything slower. I am thinking about something similar to radioactive decay. if an entry which has been typed 10 times, don’t keep that count forever but slowly reduce it over time and drop the entry if it reaches 0. As time passes, old junk which is never typed again would be dropped automatically. Words typed recently would get higher scores than words typed months ago ...
Hi, Mike,
Preedit of accented words can be improved. Let me explain my idea. In Portuguese, many words contain accented character (typically, only one accented character). Suppose the word
estão
The prediction of the word should be fired as soon as the accent
~
is typed and not after the entire accented character is typed (ã), in order to improve the typing speed.
Thanks!