[BUG] [Bengali] Alternative Character Replaces the Intended Character

mike-fabian / ibus-typing-booster

ibus-typing-booster is a completion input method for faster typing

https://mike-fabian.github.io/ibus-typing-booster/

Other

233 stars 15 forks source link

[BUG] [Bengali] Alternative Character Replaces the Intended Character #531

Closed rank-coder closed 2 months ago

rank-coder commented 2 months ago

Describe the bug I have a layout that uses ড় character. There is another way of writing the letter: using ড and a dot below it. Since I used the standalone character in the layout, I expected typing-booster to use the same character used in the layout. But typing-booster is not using that dedicated character and instead, using ড and a dot below it. I confirmed it by using the m17n layout with and without typing-booster.

To Reproduce Steps to reproduce the behavior:

Use khipro layout from https://github.com/rank-coder/khipro-m17n
Try typing ড় by writing 'rf'
ড় will show up instead of ড়
To confirm it press backspace once next to the character and youll see the dot to be erased. The same wont happen if you use that layout without typing-booster.
I have also confirmed this issue in other ways such as in dictionaries. Words wont match if they are written in the other way.

Expected behavior The layout should type the same character with and without typing-booster

Screenshots or videos I'm showing you writing the character with the same layout but with and without typing booster.

https://github.com/user-attachments/assets/a89181b2-e4df-4e3b-82c6-378570b63a07

ibus-typing-booster version? 2.25.14

ibus version? IBus 1.5.30

Distribution and version? Fedora Workstation 40 with Gnome

Desktop and version? Gnome 46

Xorg or Wayland? wayland

Additional context

mike-fabian commented 2 months ago

3. ড় will show up instead of ড়

I think these are both the same, both are (ড U+09A1 BENGALI LETTER DDA ় U+09BC BENGALI SIGN NUKTA), for both of them I can remove the dot with one backspace. Maybe cut&paste into github didn’t work right?

Anyway, you are right, when typing rf with your latest https://github.com/rank-coder/khipro-m17n/blob/main/bn-khipro-20.mim I get

ibus-m17n: ড় (ড় U+09DC BENGALI LETTER RRA) ibus-typing-booster: ড় (ড U+09A1 BENGALI LETTER DDA ় U+09BC BENGALI SIGN NUKTA)

Investigating why this happens ...

mike-fabian commented 2 months ago

The difference is that ibus-typing-booster normalizes all output to NFC and ibus-m17n does not normalize the output.

mfabian@hathi:~
$ python3
Python 3.12.5 (main, Aug  7 2024, 00:00:00) [GCC 14.2.1 20240801 (Red Hat 14.2.1-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import unicodedata
>>> [f'{ord(x):04x}' for x in unicodedata.normalize('NFC', '\u09dc')]
['09a1', '09bc']
>>> [f'{ord(x):04x}' for x in unicodedata.normalize('NFD', '\u09dc')]
['09a1', '09bc']
>>> [f'{ord(x):04x}' for x in unicodedata.normalize('NFKC', '\u09dc')]
['09a1', '09bc']
>>> [f'{ord(x):04x}' for x in unicodedata.normalize('NFKD', '\u09dc')]
['09a1', '09bc']
>>> [f'{ord(x):04x}' for x in unicodedata.normalize('NFC', '\u09a1\u09bc')]
['09a1', '09bc']
>>> [f'{ord(x):04x}' for x in unicodedata.normalize('NFD', '\u09a1\u09bc')]
['09a1', '09bc']
>>> [f'{ord(x):04x}' for x in unicodedata.normalize('NFKC', '\u09a1\u09bc')]
['09a1', '09bc']
>>> [f'{ord(x):04x}' for x in unicodedata.normalize('NFKD', '\u09a1\u09bc')]
['09a1', '09bc']
>>> '\u09a1\u09bc'
'ড়'
>>> '\u09dc'
'ড়'
>>>

rank-coder commented 2 months ago

The difference is that ibus-typing-booster normalizes all output to NFC and ibus-m17n does not normalize the output.

mfabian@hathi:~
$ python3
Python 3.12.5 (main, Aug  7 2024, 00:00:00) [GCC 14.2.1 20240801 (Red Hat 14.2.1-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import unicodedata
>>> [f'{ord(x):04x}' for x in unicodedata.normalize('NFC', '\u09dc')]
['09a1', '09bc']
>>> [f'{ord(x):04x}' for x in unicodedata.normalize('NFD', '\u09dc')]
['09a1', '09bc']
>>> [f'{ord(x):04x}' for x in unicodedata.normalize('NFKC', '\u09dc')]
['09a1', '09bc']
>>> [f'{ord(x):04x}' for x in unicodedata.normalize('NFKD', '\u09dc')]
['09a1', '09bc']
>>> [f'{ord(x):04x}' for x in unicodedata.normalize('NFC', '\u09a1\u09bc')]
['09a1', '09bc']
>>> [f'{ord(x):04x}' for x in unicodedata.normalize('NFD', '\u09a1\u09bc')]
['09a1', '09bc']
>>> [f'{ord(x):04x}' for x in unicodedata.normalize('NFKC', '\u09a1\u09bc')]
['09a1', '09bc']
>>> [f'{ord(x):04x}' for x in unicodedata.normalize('NFKD', '\u09a1\u09bc')]
['09a1', '09bc']
>>> '\u09a1\u09bc'
'ড়'
>>> '\u09dc'
'ড়'
>>>

Hm, I guess that means it's not possible to solve. And also maybe this is the reason you found the 2 characters same in my post earlier.

rank-coder commented 2 months ago

3. ড় will show up instead of ড়
I think these are both the same, both are (ড U+09A1 BENGALI LETTER DDA ় U+09BC BENGALI SIGN NUKTA), for both of them I can remove the dot with one backspace. Maybe cut&paste into github didn’t work right?

Anyway, you are right, when typing rf with your latest https://github.com/rank-coder/khipro-m17n/blob/main/bn-khipro-20.mim I get

ibus-m17n: ড় (ড় U+09DC BENGALI LETTER RRA) ibus-typing-booster: ড় (ড U+09A1 BENGALI LETTER DDA ় U+09BC BENGALI SIGN NUKTA)

Investigating why this happens ...

The same thing happens with ড় ঢ় য় these three.

mike-fabian commented 2 months ago

Hm, I guess that means it's not possible to solve.

Well I could possibly make an option to not normalize to NFC. Although that normalization is in most cases very useful.

Currently ibus-typing-booster normalizes to NFC on display (preedit or lookup tables) and output and normalizes to NFD internally and in the database. The internal normalization to NFD (decomposed style) is to make the single character ড match ড U+09A1 BENGALI LETTER DDA ় U+09BC BENGALI SIGN NUKTA. If that were stored as ড় U+09DC BENGALI LETTER RRA then ড would not match.

And also maybe this is the reason you found the 2 characters same in my post earlier.

My guess is that something normalized it when you were pasting into github and apparently normalization never produces U+09DC, it always produces U+09A1 U+09BC. I find that a bit weird because usually normalization to NFC produces “maximally precomposed” result, so I had expected that [f'{ord(x):04x}' for x in unicodedata.normalize('NFC', '\u09a1\u09bc')] would have produced ['\u09dc'], but it doesn’t do that. I still find that a bit weird.

mike-fabian commented 2 months ago

Doing this with Latin characters for example, one gets this:

u (U+0075 LATIN SMALL LETTER U) ü (u LATIN SMALL LETTER U ̈ U+0308 COMBINING DIAERESIS) ü (ü U+00FC LATIN SMALL LETTER U WITH DIAERESIS)

mfabian@hathi:~
$ python3
Python 3.12.5 (main, Aug  7 2024, 00:00:00) [GCC 14.2.1 20240801 (Red Hat 14.2.1-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import unicodedata
>>> [f'{ord(x):04x}' for x in unicodedata.normalize('NFC', '\u0075\u0308')]
['00fc']
>>> [f'{ord(x):04x}' for x in unicodedata.normalize('NFD', '\u0075\u0308')]
['0075', '0308']
>>> [f'{ord(x):04x}' for x in unicodedata.normalize('NFC', '\u00fc')]
['00fc']
>>> [f'{ord(x):04x}' for x in unicodedata.normalize('NFD', '\u00fc')]
['0075', '0308']
>>> '\u0075'
'u'
>>> '\u0075\u0308'
'ü'
>>> '\u00fc'
'ü'
>>>

mike-fabian commented 2 months ago

So for the Latin characters, NFC produces the precomposed version of the character.

But for the Bengali characters, it does not do that, I m still confused why not.

mike-fabian commented 2 months ago

I asked ChatGPT ☺ and it told me this:

The difference in behavior you're observing between the Latin characters and Bengali characters during Unicode normalization, particularly in NFC (Normalization Form C), is rooted in the way Unicode normalization handles precomposed characters and canonical equivalence.

Latin Characters

Let’s look at your Latin character example first:

Latin Small Letter 'u' (U+0075) combined with Combining Diaeresis (U+0308):
- When you normalize this combination using NFC, Unicode recognizes that this combination can be represented by a precomposed character, 'ü' (U+00FC). Hence, NFC converts u + ̈ (U+0075 U+0308) into the single character 'ü' (U+00FC).
- NFD does the opposite: it decomposes 'ü' (U+00FC) back into 'u' (U+0075) and the combining diaeresis '̈' (U+0308).

This is expected behavior, as NFC is designed to replace sequences of characters with their precomposed counterparts if such exist.

Bengali Characters

Now, let’s consider the Bengali characters:

Bengali Letter DDA (U+09A1) combined with Bengali Sign Nukta (U+09BC):
- These two code points together represent a different character: Bengali Letter RRA (U+09DC), which is a precomposed character. In principle, NFC should convert this sequence into the precomposed form 'ড়' (U+09DC).
- However, in the Unicode normalization algorithm, not all sequences are recomposed into precomposed characters. The primary purpose of NFC is to ensure that characters that can be canonically composed are composed. Still, the algorithm doesn’t always force a re-composition if the decomposition has been explicitly provided, especially when dealing with sequences that are treated differently across languages.

Key Differences:

Explicit Canonical Decomposition:
- In the Unicode standard, Bengali characters like 'ড়' (U+09DC) have an explicit canonical decomposition to 'ড' (U+09A1) + Nukta '়' (U+09BC). This is stored as the canonical decomposition mapping for U+09DC.
- When a normalization algorithm encounters 'ড়' (U+09DC), it will decompose it according to this mapping.
Lack of Forced Recomposition:
- In NFC, while Latin characters are often recomposed because they frequently appear in Western texts and the precomposed forms are preferred, Bengali text handling may differ slightly due to the complex interactions between base characters and combining marks.
- The NFC algorithm doesn’t always force recomposition back into the precomposed form for all sequences, particularly if the input is already in a decomposed form like 'ড' + '়'. This behavior ensures that equivalent sequences remain stable across different normalization forms.
Language-Specific Handling:
- Unicode aims to handle text in a way that makes sense for each script. For Latin scripts, recomposing characters like 'ü' is more common and expected. However, in scripts like Bengali, the handling can be more conservative to avoid altering the intended composition of characters in a way that might not align with the script's usage.

Conclusion:

The behavior you see with the Latin script is a typical NFC operation where sequences are recomposed into precomposed characters. However, with Bengali characters, the NFC normalization doesn’t always recompose decomposed sequences into their precomposed forms, likely due to language-specific considerations and the desire to maintain stability across different representations of the same text. This distinction ensures that text remains consistent and predictable, even if it means not always recomposing certain sequences.

mike-fabian commented 2 months ago

Still not sure. Does that somehow mean that the decomposed for is the preferred form in Bengali?

In Latin script, the precomposed forms are surely preferred.

If NFC produces the preferred form (which is what I thought so far), then converting the output to NFC would be good.

I think I could add an option like

Normalize output to    [ NFC | NFD | NFKC | NFKD | do not normalize ]

But that would be a quite weird option only experts could understand at all.

And actually I recently thought of changing ibus-m17n to convert the output to NFC as well. I forgot what the reason was why I thought changing ibus-m17n as well to convert the output to NFC was a good idea, there was something which made me think that, but I cannot remember at the moment.

mike-fabian commented 2 months ago

@rank-coder

I wonder why this behaves differently in Begali and Devanagari script:

Bengali:

ড় U+09A1 BENGALI LETTER DDA U+09BC BENGALI SIGN NUKTA ড় U+09DC BENGALI LETTER RRA র U+09B0 BENGALI LETTER RA

mfabian@hathi:~
$ python3
Python 3.12.5 (main, Aug  7 2024, 00:00:00) [GCC 14.2.1 20240801 (Red Hat 14.2.1-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import unicodedata
>>> [f'{ord(x):04x}' for x in unicodedata.normalize('NFC', '\u09a1\u09bc')]
['09a1', '09bc']
>>> [f'{ord(x):04x}' for x in unicodedata.normalize('NFD', '\u09a1\u09bc')]
['09a1', '09bc']
>>> [f'{ord(x):04x}' for x in unicodedata.normalize('NFC', '\u09dc')]
['09a1', '09bc']
>>> [f'{ord(x):04x}' for x in unicodedata.normalize('NFD', '\u09dc')]
['09a1', '09bc']
>>> [f'{ord(x):04x}' for x in unicodedata.normalize('NFC', '\u09b0')]
['09b0']
>>> [f'{ord(x):04x}' for x in unicodedata.normalize('NFD', '\u09b0')]
['09b0']
>>> '\u09a1\u09bc'
'ড়'
>>> '\u09dc'
'ড়'
>>> '\u09b0'
'র'
>>>

Although ড় U+09A1 BENGALI LETTER DDA U+09BC BENGALI SIGN NUKTA and ড় U+09DC BENGALI LETTER RRA seem to be a pair of precomposed/decomposed representations of a character, normalization always converts to (or keeps the) decomposed representation.

Devanagari:

ड़ U+0921 DEVANAGARI LETTER DDA U+093C DEVANAGARI SIGN NUKTA ऱ U+0931 DEVANAGARI LETTER RRA ऱ U+0930 DEVANAGARI LETTER RA ़ U+093C DEVANAGARI SIGN NUKTA

mfabian@hathi:~
$ python3
Python 3.12.5 (main, Aug  7 2024, 00:00:00) [GCC 14.2.1 20240801 (Red Hat 14.2.1-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import unicodedata
>>> [f'{ord(x):04x}' for x in unicodedata.normalize('NFC', '\u0921\u093c')]
['0921', '093c']
>>> [f'{ord(x):04x}' for x in unicodedata.normalize('NFD', '\u0921\u093c')]
['0921', '093c']
>>> [f'{ord(x):04x}' for x in unicodedata.normalize('NFC', '\u0931')]
['0931']
>>> [f'{ord(x):04x}' for x in unicodedata.normalize('NFD', '\u0931')]
['0930', '093c']
>>> [f'{ord(x):04x}' for x in unicodedata.normalize('NFC', '\u0930\u093c')]
['0931']
>>> [f'{ord(x):04x}' for x in unicodedata.normalize('NFD', '\u0930\u093c')]
['0930', '093c']
>>> '\u0921\u093c'
'ड़'
>>> '\u0931'
'ऱ'
>>> '\u0930\u093c'
'ऱ'
>>>

In Devanagari script, ड़ U+0921 DEVANAGARI LETTER DDA U+093C DEVANAGARI SIGN NUKTA seems to have no precomposed representation.

ऱ U+0930 DEVANAGARI LETTER RA ़ U+093C DEVANAGARI SIGN NUKTA and ऱ U+0931 DEVANAGARI LETTER RRA seem to be a pair of precomposed/decomposed representations of a character, and normalization converts between the precomposed/decomposed representations.

rank-coder commented 2 months ago

In Devanagari RRA = Nuqta below RA But in Bangla RRA = Nuqta below DA

mike-fabian commented 2 months ago

In Devanagari RRA = Nuqta below RA But in Bangla RRA = Nuqta below DA

In Bangla RRA = Nuqta below DDA

Yes, that’s right.

I still do not understand why the Unicode normalization to NFC produces the decomposed versions in Bangla. That seems quite unusual for me. But there must be some reason for it, I doubt that this is an accident or a bug in Unicode.

mike-fabian commented 2 months ago

The difference is that ibus-typing-booster normalizes all output to NFC and ibus-m17n does not normalize the output.

mfabian@hathi:~
$ python3
Python 3.12.5 (main, Aug  7 2024, 00:00:00) [GCC 14.2.1 20240801 (Red Hat 14.2.1-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import unicodedata
>>> [f'{ord(x):04x}' for x in unicodedata.normalize('NFC', '\u09dc')]
['09a1', '09bc']
>>> [f'{ord(x):04x}' for x in unicodedata.normalize('NFD', '\u09dc')]
['09a1', '09bc']
>>> [f'{ord(x):04x}' for x in unicodedata.normalize('NFKC', '\u09dc')]
['09a1', '09bc']
>>> [f'{ord(x):04x}' for x in unicodedata.normalize('NFKD', '\u09dc')]
['09a1', '09bc']
>>> [f'{ord(x):04x}' for x in unicodedata.normalize('NFC', '\u09a1\u09bc')]
['09a1', '09bc']
>>> [f'{ord(x):04x}' for x in unicodedata.normalize('NFD', '\u09a1\u09bc')]
['09a1', '09bc']
>>> [f'{ord(x):04x}' for x in unicodedata.normalize('NFKC', '\u09a1\u09bc')]
['09a1', '09bc']
>>> [f'{ord(x):04x}' for x in unicodedata.normalize('NFKD', '\u09a1\u09bc')]
['09a1', '09bc']
>>> '\u09a1\u09bc'
'ড়'
>>> '\u09dc'
'ড়'
>>>

Just to make sure that this is not a Python problem, here is a comparison with the output of the uconv tool from the icu package on Fedora 40:

mfabian@hathi:~
$ echo -n -e '\x09\xdc' | uconv -f UTF-16BE -t UTF-16LE -x any-nfc | od -t x2
echo -n -e '\x09\xdc' | uconv -f UTF-16BE -t UTF-16LE -x any-nfc | od -t x2
0000000 09a1 09bc
0000004
mfabian@hathi:~
$ echo -n -e '\x09\xdc' | uconv -f UTF-16BE -t UTF-16LE -x any-nfd | od -t x2
0000000 09a1 09bc
0000004
mfabian@hathi:~
$ echo -n -e '\x09\xdc' | uconv -f UTF-16BE -t UTF-16LE -x any-nfkc | od -t x2
0000000 09a1 09bc
0000004
mfabian@hathi:~
$ echo -n -e '\x09\xdc' | uconv -f UTF-16BE -t UTF-16LE -x any-nfkd | od -t x2
0000000 09a1 09bc
0000004
mfabian@hathi:~
$ echo -n -e '\x09\xa1\x09\xbc' | uconv -f UTF-16BE -t UTF-16LE -x any-nfc | od -t x2
0000000 09a1 09bc
0000004
mfabian@hathi:~
$ echo -n -e '\x09\xa1\x09\xbc' | uconv -f UTF-16BE -t UTF-16LE -x any-nfd | od -t x2
0000000 09a1 09bc
0000004
mfabian@hathi:~
$ echo -n -e '\x09\xa1\x09\xbc' | uconv -f UTF-16BE -t UTF-16LE -x any-nfkc | od -t x2
0000000 09a1 09bc
0000004
mfabian@hathi:~
$ echo -n -e '\x09\xa1\x09\xbc' | uconv -f UTF-16BE -t UTF-16LE -x any-nfkd | od -t x2
0000000 09a1 09bc
0000004
mfabian@hathi:~
$

Same results as with Python.

mike-fabian commented 2 months ago

http://www.unicode.org/reports/tr15/#Primary_Exclusion_List_Table says:

5 Composition Exclusion

The concept of composition exclusion is a key part of the Unicode Normalization Algorithm. For normalization forms NFC and NFKC, which normalize Unicode strings to Composed forms, where possible, the basic process is first to fully decompose the string, and then to compose the string, except where blocked or excluded. (See D117, Canonical Composition Algorithm, in Section 3.11, Normalization Forms in [Unicode].) This section provides information about the types of characters which are excluded from composition during application of the Unicode Normalization Algorithm, and describes the data files which provide the definitive lists of those characters.

Composition exclusion characters have an associated binary character property in the [UCD]: Composition_Exclusion. It is a notable characteristic of the Unicode Normalization Algorithm that no composition exclusion character can occur in any normalized form of Unicode text: NFD, NFC, NFKD, or NFKC. 5.1 Composition Exclusion Types

Four types of canonically decomposable characters are excluded from composition in the Canonical Composition Algorithm. These four types are described and exemplified here. Script-specific Exclusions

The term script-specific exclusion refers to certain canonically decomposable characters whose decomposition includes one of a small set of combining marks for particular Indian scripts, for Tibetan, or for Hebrew.

The list of such characters cannot be computed from the decomposition mappings in the Unicode Character Database, and must instead be explicitly listed.

The character U+0958 (क़) DEVANAGARI LETTER QA is an example of a script-specific composition exclusion.

The list of script-specific composition exclusions constituted a one-time adjustment to the Unicode Normalization Algorithm, defined at the time of the composition version in 2001 and unchanged since that version. The list can be divided into the following three general groups, all added to the Unicode Standard before Version 3.1:
Many precomposed characters using a nukta diacritic in the Bangla/Bengali, Devanagari, Gurmukhi, or Odia/Oriya scripts, mostly consisting of additions to the core set of letters for those scripts.
Tibetan letters and subjoined letters with decompositions that include either U+0FB7 TIBETAN SUBJOINED LETTER HA or U+0FB5 TIBETAN SUBJOINED LETTER SSA, and two two-part Tibetan vowel signs involving top and bottom pieces.
A large collection of compatibility precomposed characters for Hebrew involving dagesh and/or other combining marks.
Although, in principle, the list of script-specific composition exclusions could be expanded to add newly encoded characters in future versions of the Unicode Standard, it is very unlikely to be extended for such characters, because the normalization forms of sequences are now taken into account before new characters are encoded.

mike-fabian commented 2 months ago

The character U+0958 (क़) DEVANAGARI LETTER QA is an example of a script-specific composition exclusion.

And indeed, this character shows the same behaviour:

$ python3
Python 3.12.5 (main, Aug 23 2024, 00:00:00) [GCC 14.2.1 20240801 (Red Hat 14.2.1-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import unicodedata
>>> [f'{ord(x):04x}' for x in unicodedata.normalize('NFC', '\u0958')]
['0915', '093c']
>>> [f'{ord(x):04x}' for x in unicodedata.normalize('NFD', '\u0958')]
['0915', '093c']
>>>

mike-fabian commented 2 months ago

@rank-coder

https://www.unicode.org/Public/draft/UCD/ucd/CompositionExclusions.txt contains:

# ================================================
# (1) Script Specifics
#
# This list of characters cannot be derived from the UnicodeData.txt file.
#
# Included are the following subcategories:
#
# - Many precomposed characters using a nukta diacritic in the Devanagari,
#   Bangla/Bengali, Gurmukhi, or Odia/Oriya scripts.
# - Tibetan letters and subjoined letters with decompositions including 
#   U+0FB7 TIBETAN SUBJOINED LETTER HA or U+0FB5 TIBETAN SUBJOINED LETTER SSA.
# - Two two-part Tibetan vowel signs involving top and bottom pieces.
# - A large collection of compatibility precomposed characters for Hebrew
#   involving dagesh and/or other combining marks.
#
# This list is unlikely to grow.
#
# ================================================

0958    #  DEVANAGARI LETTER QA
0959    #  DEVANAGARI LETTER KHHA
095A    #  DEVANAGARI LETTER GHHA
095B    #  DEVANAGARI LETTER ZA
095C    #  DEVANAGARI LETTER DDDHA
095D    #  DEVANAGARI LETTER RHA
095E    #  DEVANAGARI LETTER FA
095F    #  DEVANAGARI LETTER YYA
09DC    #  BENGALI LETTER RRA
09DD    #  BENGALI LETTER RHA
09DF    #  BENGALI LETTER YYA
0A33    #  GURMUKHI LETTER LLA
0A36    #  GURMUKHI LETTER SHA
0A59    #  GURMUKHI LETTER KHHA
0A5A    #  GURMUKHI LETTER GHHA
0A5B    #  GURMUKHI LETTER ZA
0A5E    #  GURMUKHI LETTER FA
0B5C    #  ORIYA LETTER RRA
0B5D    #  ORIYA LETTER RHA
[... TIBETAN and HEBREW letters follow here ...]

mike-fabian commented 2 months ago

Checking in which version of Unicode these 3 Bengali characters first appeared:

ড় U+09A1 BENGALI LETTER DDA U+09BC BENGALI SIGN NUKTA ড় U+09DC BENGALI LETTER RRA

https://www.unicode.org/Public/draft/UCD/ucd/DerivedAge.txt contains:

0993..09A8    ; 1.1 #  [22] BENGALI LETTER O..BENGALI LETTER NA
09AA..09B0    ; 1.1 #   [7] BENGALI LETTER PA..BENGALI LETTER RA
09B2          ; 1.1 #       BENGALI LETTER LA
09B6..09B9    ; 1.1 #   [4] BENGALI LETTER SHA..BENGALI LETTER HA
09BC          ; 1.1 #       BENGALI SIGN NUKTA
09BE..09C4    ; 1.1 #   [7] BENGALI VOWEL SIGN AA..BENGALI VOWEL SIGN VOCALIC RR
09C7..09C8    ; 1.1 #   [2] BENGALI VOWEL SIGN E..BENGALI VOWEL SIGN AI
09CB..09CD    ; 1.1 #   [3] BENGALI VOWEL SIGN O..BENGALI SIGN VIRAMA
09D7          ; 1.1 #       BENGALI AU LENGTH MARK
09DC..09DD    ; 1.1 #   [2] BENGALI LETTER RRA..BENGALI LETTER RHA

So all 3 were already in Unicode 1.1 in 1993.

mike-fabian commented 2 months ago

As the 3 characters U+09A1 (BENGALI LETTER DDA), U+09BC (BENGALI SIGN NUKTA), and U+09DC (BENGALI LETTER RRA) were introduced in the same version of Unicode 1.1 and the U+09DC was probably (I am not 100% sure!) as a composition exclusion in the same version (Unicode 1.1), the reason for that composition exclusion cannot be compatibility with an earlier version of Unicode.

mike-fabian commented 2 months ago

@rank-coder

So I am still puzzled, why ড় U+09DC BENGALI LETTER RRA is excluded from recomposition when normalizing to NFC.

Could it be that for some historical reasons, the decomposed version is the preferred version when writing Bengali?

I wonder what the right solution for this bug is.

If the decomposed version were to be preferred when writing Bengali, then Typing Booster’s normalization to NFC on output would probably be OK, as it produces the preferred version.

But is this really the case??

rank-coder commented 2 months ago

@rank-coder

So I am still puzzled, why ড় U+09DC BENGALI LETTER RRA is excluded from recomposition when normalizing to NFC.

Could it be that for some historical reasons, the decomposed version is the preferred version when writing Bengali?

I wonder what the right solution for this bug is.

If the decomposed version were to be preferred when writing Bengali, then Typing Booster’s normalization to NFC on output would probably be OK, as it produces the preferred version.

But is this really the case??

Even I'm not sure which one is preferred. Google's Gboard app normally produces the ড + ় version. Gboard gives an option to use the other one character version too (by long press). Samsung's keyboard does not provide a way to write the one character version. That's why for some dictionaries I can't use Samsung's keyboard. On Linux desktop, I've tried a popular keyboard layout named Probhat which by default uses the one character versions for ড়, ঢ়, য়. I don't know which one's preferred. But I know that I may need either one in situations. The expected behavior for typing-booster should be to follow the keyboard layout as is; so that when needed we can provide options to write both the versions. Right?

mike-fabian commented 2 months ago

@rank-coder

I've tried a popular keyboard layout named Probhat which by default uses the one character versions for ড়, ঢ়, য়.

I guess the layout you tried was /usr/share/m17n/bn-probhat.mim.

Among the currently existing m17n input methods, all of them currently produce the single character version:

mfabian@hathi:~
$ grep ড় /usr/share/m17n/*.mim
/usr/share/m17n/as-inscript2.mim:  ((G-[) "ড়")
/usr/share/m17n/as-itrans.mim:  (".D" "ড়্")
/usr/share/m17n/as-phonetic.mim:  ("R" ?ড়)
/usr/share/m17n/bn-disha.mim:  ("R","ড়")                                ; 09DC
/usr/share/m17n/bn-inscript2.mim:  ((G-[) "ড়")
/usr/share/m17n/bn-itrans.mim:  (".D" "ড়্")
/usr/share/m17n/bn-national-jatiya.mim:  ("p" "ড়")  ; U+09DC BENGALI LETTER RRA ≡ 09A1 ড 09BC ◌়
/usr/share/m17n/bn-probhat.mim:  ("R" ?ড়)
/usr/share/m17n/bn-unijoy.mim:  ("p" "ড়") ;; BENGALI LETTER RRA
/usr/share/m17n/mni-inscript2-beng.mim:  ((G-[) "ড়")
mfabian@hathi:~
$

And none produces the two character version:

mfabian@hathi:~
$ grep ড়  /usr/share/m17n/*.mim
mfabian@hathi:~
$

Although we could change that, if we want, I am maintaining these m17n input methods.

rank-coder commented 2 months ago

@rank-coder

I've tried a popular keyboard layout named Probhat which by default uses the one character versions for ড়, ঢ়, য়.

I guess the layout you tried was /usr/share/m17n/bn-probhat.mim.

Among the currently existing m17n input methods, all of them currently produce the single character version:
mfabian@hathi:~
$ grep ড় /usr/share/m17n/*.mim
/usr/share/m17n/as-inscript2.mim:  ((G-[) "ড়")
/usr/share/m17n/as-itrans.mim:  (".D" "ড়্")
/usr/share/m17n/as-phonetic.mim:  ("R" ?ড়)
/usr/share/m17n/bn-disha.mim:  ("R","ড়")                                ; 09DC
/usr/share/m17n/bn-inscript2.mim:  ((G-[) "ড়")
/usr/share/m17n/bn-itrans.mim:  (".D" "ড়্")
/usr/share/m17n/bn-national-jatiya.mim:  ("p" "ড়")  ; U+09DC BENGALI LETTER RRA ≡ 09A1 ড 09BC ◌়
/usr/share/m17n/bn-probhat.mim:  ("R" ?ড়)
/usr/share/m17n/bn-unijoy.mim:  ("p" "ড়") ;; BENGALI LETTER RRA
/usr/share/m17n/mni-inscript2-beng.mim:  ((G-[) "ড়")
mfabian@hathi:~
$ 
And none produces the two character version:
mfabian@hathi:~
$ grep ড়  /usr/share/m17n/*.mim
mfabian@hathi:~
$ 
Although we could change that, if we want, I am maintaining these m17n input methods.

No please don't change those layouts. Since the built-in probhat layout in linux systems behave that way. Those layouts should use the one character version. And also, with probhat, there's another way of writing the two character version. (I've never needed the two character version) And as I said earlier, one character version is needed for some dictionary apps. Otherwise searching will cause issues.

mike-fabian commented 2 months ago

@rank-coder

Even I'm not sure which one is preferred. Google's Gboard app normally produces the ড + ় version. Gboard gives an option to use the other one character version too (by long press).

That is quite surprising that Gboard has such an option. Do "normal" users know that this is possible or only computer nerds?

But even Gboard seems to produce the two character version by default ...

Samsung's keyboard does not provide a way to write the one character version.

That's why for some dictionaries I can't use Samsung's keyboard.

Hm.

In the hunspell dictionaries for Bengali on Fedora 40, the dictionaries contain the single character version many times:

mfabian@hathi:/usr/share/hunspell
$ grep -c  ড় bn_*
bn_BD.aff:1
bn_BD.dic:14260
bn_IN.aff:1
bn_IN.dic:14260
mfabian@hathi:/usr/share/hunspell
$

But the two character version 0 times:

mfabian@hathi:/usr/share/hunspell
$ grep -c ড় bn_*
bn_BD.aff:0
bn_BD.dic:0
bn_IN.aff:0
bn_IN.dic:0
mfabian@hathi:/usr/share/hunspell
$

Not sure what to make of this.

When ibus-typing-booster loads hunspell dictionaries, it converts them to NFD internally before attempting a match.

And what the user has typed is also converted to NFD before attempting to match against dictionary words or entries in the database of words ibus-typing-booster has already learned.

As you say, for some dictionaries you cannot use the Samsung keyboard because these dictionaries contain only the single character version and the Samsung keyboard does not produce that.

To avoid problems like that and to get consistent matching, I convert internally to NFD before matching. NFD is better than NFC for matching because NFD is the "maximally decomposed" version, so if the user has typed only the beginning of such a decomposed sequence, it can already match.

And when ibus-typing-booster needs to display something in the preedit or a candidate list or commit something (commit = insert the final output at your writing position), then it converts to NFC, because the NFC style is the preferred style. At least usually it is the preferred style, I am not so sure about Bengali now.

But I know that I may need either one in situations.

Is this really important?

Currently your input method only produces the one character version:

mfabian@hathi:/local/mfabian/src/khipro-m17n (main)
$ grep ড় *.mim
                ("k" "ক") ("kh" "খ") ("g" "গ") ("gh" "ঘ") ("ng" "ঙ") ("c" "চ") ("ch" "ছ") ("j" "জ") ("jh" "ঝ") ("nff" "ঞ") ("tf" "ট") ("tff" "ঠ") ("df" "ড") ("dff" "ঢ") ("nf" "ণ") ("t" "ত")
 ("th" "থ") ("d" "দ") ("dh" "ধ") ("n" "ন") ("p" "প") ("ph" "ফ") ("b" "ব") ("v" "ভ") ("m" "ম") ("z" "য") ("r" "র") ("l" "ল") ("sh" "শ") ("sf" "ষ") ("s" "স") ("h" "হ") ("y" "য়") ("rf" "ড়") ("
rff" "ঢ়")
("ddf" "ড্ড") ("dfdf" "ড্ড") ("dfb" "ড্ব") ("dfz" "ড্য") ("dfr" "ড্র") ("rfg" "ড়্‌গ")
mfabian@hathi:/local/mfabian/src/khipro-m17n (main)
$

And never the two character version:

mfabian@hathi:/local/mfabian/src/khipro-m17n (main)
$ grep ড়  *.mim
mfabian@hathi:/local/mfabian/src/khipro-m17n (main)
$

The expected behavior for typing-booster should be to follow the keyboard layout as is; so that when needed we can provide options to write both the versions. Right?

I think that would be very confusing to users.

Also I am afraid something like this could happen: The user types the single character version, it is matched against the database of learned words. To get a consistent match, the user input is converted to NFD to match against the database which also contains NFD. If a match of a completion of the word typed is found, the user could select that match in the candidate list and commit it with space. Now the word which was matched in the database and completed the short user input is inserted at the writing position, and this will always be the 2 character version (it is converted to NFC on output but that does not even make a difference here for these Bengali characters). Note also that the word(s) (sometimes multiple words are matched) could contain more than one ড় U+09A1 BENGALI LETTER DDA U+09BC BENGALI SIGN NUKTA and maybe only the first one or part of the first one was typed by the user.

Not normalizing the contents of the database would not be a good idea at all, I think. Then one might have different forms of the same letters in the database and matching agains user input would not give consistent results.

The Bengali hunspell dictionaries only contain the single character version at the moment. At least that is consistent. Containing a mix of both would probabaly be a really bad idea.

So I think it would also be a really bad idea if the typing booster database contained different forms of the same letters.

ibus-m17n does not have these problems with matching user input against dictionaries or databases, ibus-m17n just outputs exactly what the user has typed, it does not offer any completions.

But to make completions work consistently, a consistent representation has to be used when matching. So using NFD when matching is required, I think.

Maybe it is better to figure out what the preferred version is and always output only the preferred version.

If the preferred version for these 3 Bengali characters

09DC # BENGALI LETTER RRA 09DD # BENGALI LETTER RHA 09DF # BENGALI LETTER YYA

are the one character versions, then ibus-typing-booster probably should always output only these one character versions, never the two character versions.

Currently, the conversion on output to NFC does not give me that.

But I could add another processing step after converting to NFC before output.

I am thinking of maintaining a list of exceptions, maybe all the characters mentioned in

https://www.unicode.org/Public/16.0.0/ucd/CompositionExclusions.txt

I.e. something like this:

exclusions_map = {
    "\u09A1\u09BC": "\u09DC",  # ড় -> ড়
    # Add other exclusions for different scripts as needed
}

and after converting to NFC and before producing the final output, add a step to forcibly recompose all sequences found in this exclusions_map.

What do you think about this idea?

That would consistently produce the preferred results on output.

rank-coder commented 2 months ago

If it always produces the single character version then that would be just fine in my opinion.

mike-fabian commented 2 months ago

@rank-coder

No please don't change those layouts.

OK!

Since the built-in probhat layout in linux systems behave that way.

I guess you are talking about the X11 keyboard layout here

Those layouts should use the one character version.

If you are talking about the X11 keyboard layouts, these can only produce one character per key because of limitations.

/usr/share/X11/xkb/symbols/bd has a probhat layout, but that is just included from /usr/share/X11/xkb/symbols/in:

// Probhat keyboard layout for Bangla/Bengali.
xkb_symbols "probhat" {
    include "in(ben_probhat)"
    name[Group1]= "Bangla (Probhat)";
};

So the "real" definition of the probhat layout is in /usr/share/X11/xkb/symbols/in and it looks like this:

xkb_symbols "ben_probhat" {
   name[Group1]= "Bangla (India, Probhat)";
   key.type[group1]="FOUR_LEVEL";

   // Digits row:
   key <TLDE> { [ U200D, asciitilde   ] };
   key <AE01> { [ U09E7, exclam, U09F4 ] };
   key <AE02> { [ U09E8, at, U09F5 ] };
   key <AE03> { [ U09E9, numbersign, U09F6 ] };
   key <AE04> { [ U09EA, U09F3, U09F7, U09F2 ] };
   key <AE05> { [ U09EB, percent      ] };
   key <AE06> { [ U09EC, asciicircum  ] };
   key <AE07> { [ U09ED, U099E, U09FA ] };
   key <AE08> { [ U09EE, U09CE    ] };
   key <AE09> { [ U09EF, parenleft    ] };
   key <AE10> { [ U09E6, parenright, U09F8, U09F9 ] };
   key <AE11> { [ minus,     underscore   ] };
   key <AE12> { [ equal,     plus         ] };

   // Q row:
   key <AD01> { [   U09A6,  U09A7  ] };
   key <AD02> { [   U09C2,  U098A  ] };
   key <AD03> { [   U09C0,  U0988  ] };
   key <AD04> { [   U09B0,  U09DC, U20B9 ] }; // Rupee
   key <AD05> { [   U099F,  U09A0  ] };
   key <AD06> { [   U098F,  U0990  ] };
   key <AD07> { [   U09C1,  U0989  ] };
   key <AD08> { [   U09BF,  U0987  ] };
   key <AD09> { [   U0993,  U0994  ] };
   key <AD10> { [   U09AA,  U09AB  ] };
   key <AD11> { [   U09C7,  U09C8  ] };
   key <AD12> { [   U09CB,  U09CC, U09D7 ] };

   // A row:
   key <AC01> { [   U09BE,  U0985, U098C, U09E0 ] };
   key <AC02> { [   U09B8,  U09B7, U09E1, U09E3 ] };
   key <AC03> { [   U09A1,  U09A2, U09C4, U09E2 ] };
   key <AC04> { [   U09A4,  U09A5  ] };
   key <AC05> { [   U0997,  U0998  ] };
   key <AC06> { [   U09B9,  U0983, U09BD ] };
   key <AC07> { [   U099C,  U099D  ] };
   key <AC08> { [   U0995,  U0996  ] };
   key <AC09> { [   U09B2,  U0982  ] };
   key <AC10> { [   semicolon,  colon      ] };
   key <AC11> { [   apostrophe, quotedbl   ] };

   // Z row:
   key <AB01> { [   U09DF,  U09AF  ] };
   key <AB02> { [   U09B6,  U09DD  ] };
   key <AB03> { [   U099A,  U099B  ] };
   key <AB04> { [   U0986,  U098B  ] };
   key <AB05> { [   U09AC,  U09AD  ] };
   key <AB06> { [   U09A8,  U09A3  ] };
   key <AB07> { [   U09AE,  U0999  ] };
   key <AB08> { [   comma,  U09C3  ] };
   key <AB09> { [   U0964,  U0981, U09BC ] };
   key <AB10> { [   U09CD,  question   ] };
   key <BKSL> { [   U200C,  U0965  ] };

   include "level3(ralt_switch)"
};

As you see, it contains U09DC. But these X11 keyboard layouts cannot have more than one Unicode code point per key.

So that U09DC is there does not necessarily mean that this is the desired output (Although after our discussion, I think U+09DC is indeed the preferred output for Bengali).

There are some keyboard layouts which want to produce more than one character per key press but as this is impossible with the current X11 keyboard implentation, they use Compose as a hack. One such example is the Arabic layout.

/usr/share/X11/xkb/symbols/ara contains:

 key <AB05> {[           UFEFB,                UFEF5,                      any,     any ]};  // ‎ﻻ‎ ‎ﻵ‎

ﻻ U+FEFB ARABIC LIGATURE LAM WITH ALEF ISOLATED FORM

That is not the desired character when pressing that key, it is a ligature, but one wants the individual characters. To achieve that, the Compose mechanism is used, /usr/share/X11/locale/en_US.UTF-8/Compose contains:

# Decomposition of four Arabic Lam-Alef ligatures
<UFEFB> : "لا"  # ARABIC LETTER LAM plus ARABIC LETTER ALEF

So the U+FEFB produced by the key is expanded into U+0644 ARABIC LETTER LAM U+0627 ARABIC LETTER ALEF.

This weird hack is currently only used for Arabic and Khmer.

And it was broken for years and we fixed it only recently.

So that there are only single characters in the X11 Probhat layout does not prove that these are the preferred ones, it might be just because of the technical limitations.

And also, with probhat, there's another way of writing the two character version. (I've never needed the two character version)

ড় U+09A1 BENGALI LETTER DDA U+09BC BENGALI SIGN NUKTA ড় U+09DC BENGALI LETTER RRA

These are all in the X11 probhat layout, on different keys,

key { [ U09B0, U09DC, U20B9 ] }; // Rupee key { [ U09A1, U09A2, U09C4, U09E2 ] }; key { [ U0964, U0981, U09BC ] };

So yes, you could write both versions with that layout by hitting different keys.

With /usr/share/m17n/bn-probhat.mim it looks like only writing the one character version is possible:

mfabian@hathi:~
$ grep ড় /usr/share/m17n/bn-probhat.mim 
mfabian@hathi:~
$ grep ড় /usr/share/m17n/bn-probhat.mim 
  ("R" ?ড়)
mfabian@hathi:~
$ grep ় /usr/share/m17n/bn-probhat.mim
mfabian@hathi:~
$

And as I said earlier, one character version is needed for some dictionary apps. Otherwise searching will cause issues.

mike-fabian commented 2 months ago

@rank-coder

If it always produces the single character version then that would be just fine in my opinion.

This seems to be the way to go, I’ll try to do that.

I hope it will be fast enough, as I have to process this list of exceptions for every commit.

mike-fabian commented 2 months ago

https://www.unicode.org/Public/16.0.0/ucd/CompositionExclusions.txt

also contains

1D15E # MUSICAL SYMBOL HALF NOTE

which is even weirder than the Bengali characters:

ᴕ U+1D15 LATIN LETTER SMALL CAPITAL OU e U+0065 LATIN SMALL LETTER E 𝅗𝅥 U+1D15E MUSICAL SYMBOL HALF NOTE

$ python3
Python 3.12.5 (main, Aug 23 2024, 00:00:00) [GCC 14.2.1 20240801 (Red Hat 14.2.1-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import unicodedata
>>> [f'{ord(x):04x}' for x in unicodedata.normalize('NFC', '\u1d15e')]
['1d15', '0065']

Why could anybody want 𝅗𝅥 U+1D15E MUSICAL SYMBOL HALF NOTE to change into ᴕ U+1D15 LATIN LETTER SMALL CAPITAL OU e U+0065 LATIN SMALL LETTER E by normalization to NFC??

rank-coder commented 2 months ago

So that there are only single characters in the X11 Probhat layout does not prove that these are the preferred ones, it might be just because of the technical limitations.

And also, with probhat, there's another way of writing the two character version. (I've never needed the two character version)

ড় U+09A1 BENGALI LETTER DDA U+09BC BENGALI SIGN NUKTA ড় U+09DC BENGALI LETTER RRA

These are all in the X11 probhat layout, on different keys,

key { [ U09B0, U09DC, U20B9 ] }; // Rupee key { [ U09A1, U09A2, U09C4, U09E2 ] }; key { [ U0964, U0981, U09BC ] };

So yes, you could write both versions with that layout by hitting different keys.

With /usr/share/m17n/bn-probhat.mim it looks like only writing the one character version is possible:

Yes, I like it the way it's in the X11 probhat layout, as I know I can type both versions. In the m17n version they couldn't add the Nuqta sign since there's no AltGr state in m17n. Only shift is there.

rank-coder commented 2 months ago

Another thing..

The letters ব and র

I have never found র as a 2 character ব ় ব and র aren't even remotely related.

ড় ঢ় য়

ড় is the retroflex flap version of ড in Bangla, Hindi, & Sanskrit. Same for ঢ়.

য় and য

Ya of sanskrit becomes ja in Bangla most of the time. In such cases য is used. When in some cases, the ya remains ya we use য়.

But র, ড়, ঢ় and য় all are considered separate letters in the alphabet and listed in the alphabet always.

They are not like জ with Nuqta (জ়), ফ + Nuqta (ফ়) which are sometimes (very rarely, mostly in the usage guideline section in dictionaries) seen in some writings but are not considered separate letters.

mike-fabian commented 2 months ago

@rank-coder

In the m17n version they couldn't add the Nuqta sign since there's no AltGr state in m17n. Only shift is there.

Actually using AltGr is possible in m17n. But all the keys on AltGr in the "in(ben_probhat)" layout in xkeyboard-config were missing in bn-probhat.mim. I have added them in this commit:

https://github.com/mike-fabian/m17n-db/commit/d36305bbb2a2364cc4797587c489c69e24241861

mike-fabian commented 2 months ago

@rank-coder

The letters ব and র

I have never found র as a 2 character ব ় ব and র aren't even remotely related.

ব U+09AC BENGALI LETTER BA র U+09B0 BENGALI LETTER RA ় U+09BC BENGALI SIGN NUKTA

No problem here, I think:

$ python3
Python 3.12.5 (main, Aug 23 2024, 00:00:00) [GCC 14.2.1 20240801 (Red Hat 14.2.1-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import unicodedata
>>> [f'{ord(x):04x}' for x in unicodedata.normalize('NFD', '\u09AC')]
['09ac']
>>> [f'{ord(x):04x}' for x in unicodedata.normalize('NFC', '\u09AC')]
['09ac']
>>> [f'{ord(x):04x}' for x in unicodedata.normalize('NFC', '\u09B0')]
['09b0']
>>> [f'{ord(x):04x}' for x in unicodedata.normalize('NFD', '\u09B0')]
['09b0']
>>> [f'{ord(x):04x}' for x in unicodedata.normalize('NFD', '\u09AC\u09BC')]
['09ac', '09bc']
>>> [f'{ord(x):04x}' for x in unicodedata.normalize('NFC', '\u09AC\u09BC')]
['09ac', '09bc']
>>>

র U+09B0 BENGALI LETTER RA is not decomposed to a two character version when normalizing to NFD. So no problem here.

And the two character sequence ব U+09AC BENGALI LETTER BA ় U+09BC BENGALI SIGN NUKTA (which might make no sense and you have never encountered it) is not composed to a single character when normalizing to NFD.

So I think we have no problem with these characters.

ড় ঢ় য়

ড় U+09DC BENGALI LETTER RRA ঢ় U+09DD BENGALI LETTER RHA য় U+09DF BENGALI LETTER YYA

These 3 are on the composition exclusion list

https://www.unicode.org/Public/16.0.0/ucd/CompositionExclusions.txt

which causes a problem for as as we want them to be recomposed when Typing Booster commits.

I think I can fix that by adding and extra tweak before committing to recompose them to single characters.

ড় is the retroflex flap version of ড in Bangla, Hindi, & Sanskrit. Same for ঢ়. য় and য

Ya of sanskrit becomes ja in Bangla most of the time. In such cases য is used. When in some cases, the ya remains ya we use য়.

But র, ড়, ঢ় and য় all are considered separate letters in the alphabet and listed in the alphabet always.

So overall I think everything should be fine when I add this extra composing step before committing.

mike-fabian commented 2 months ago

@rank-coder

In the m17n version they couldn't add the Nuqta sign since there's no AltGr state in m17n. Only shift is there.

Actually using AltGr is possible in m17n. But all the keys on AltGr in the "in(ben_probhat)" layout in xkeyboard-config were missing in bn-probhat.mim. I have added them in this commit:

mike-fabian/m17n-db@d36305b

What do you think of this improvement?
Apparently you didn’t know that you can use G- to mean AltGr in .mim files. Maybe this is useful for your development of your bn-khipro.

mike-fabian commented 2 months ago

@rank-coder

There are ibus-typing-booster-2.25.16 test builds in my copr repo: https://copr.fedorainfracloud.org/coprs/mfabian/ibus-typing-booster/

You can get the test packages by doing:

dnf copr enable mfabian/ibus-typing-booster
sudo dnf update ibus-typing-booster

I think they solve the problem. Can you please test and tell me whether it works for you as well?

rank-coder commented 2 months ago

@rank-coder

There are ibus-typing-booster-2.25.16 test builds in my copr repo: https://copr.fedorainfracloud.org/coprs/mfabian/ibus-typing-booster/

You can get the test packages by doing:
dnf copr enable mfabian/ibus-typing-booster
sudo dnf update ibus-typing-booster
I think they solve the problem. Can you please test and tell me whether it works for you as well?

Can I roll back to a stable version after testing it?

mike-fabian commented 2 months ago

@rank-coder There are ibus-typing-booster-2.25.16 test builds in my copr repo: https://copr.fedorainfracloud.org/coprs/mfabian/ibus-typing-booster/ You can get the test packages by doing:
dnf copr enable mfabian/ibus-typing-booster
sudo dnf update ibus-typing-booster
I think they solve the problem. Can you please test and tell me whether it works for you as well?
Can I roll back to a stable version after testing it?

Yes, Example:

Last metadata expiration check: 0:52:37 ago on Thu 12 Sep 2024 07:12:20 AM CEST.
Installed Packages
ibus-typing-booster.noarch             2.25.16-1.fc40             @copr:copr.fedorainfracloud.org:mfabian:ibus-typing-booster
Available Packages
ibus-typing-booster.noarch             2.25.3-1.fc40              fedora                                                     
ibus-typing-booster.noarch             2.25.15-1.fc40             copr:copr.fedorainfracloud.org:mfabian:ibus-typing-booster 
ibus-typing-booster.noarch             2.25.15-1.fc40             updates                                                    
ibus-typing-booster.src                2.25.15-1.fc40             copr:copr.fedorainfracloud.org:mfabian:ibus-typing-booster 
ibus-typing-booster.noarch             2.25.16-1.fc40             copr:copr.fedorainfracloud.org:mfabian:ibus-typing-booster 
ibus-typing-booster.src                2.25.16-1.fc40             copr:copr.fedorainfracloud.org:mfabian:ibus-typing-booster 
mfabian@f40:~$ sudo dnf downgrade ibus-typing-booster-2.25.3-1.fc40
Last metadata expiration check: 0:53:03 ago on Thu 12 Sep 2024 07:12:20 AM CEST.
Dependencies resolved.
=============================================================================================================================
 Package                                   Architecture           Version                       Repository              Size
=============================================================================================================================
Downgrading:
 emoji-picker                              noarch                 2.25.3-1.fc40                 fedora                  56 k
 ibus-typing-booster                       noarch                 2.25.3-1.fc40                 fedora                 1.2 M
 ibus-typing-booster-tests                 noarch                 2.25.3-1.fc40                 fedora                  81 k

Transaction Summary
=============================================================================================================================
Downgrade  3 Packages

Total download size: 1.3 M
Is this ok [y/N]: 
Operation aborted.
mfabian@f40:~$

Or call sudo dnf downgrade ibus-typing-booster without a specific version:

mfabian@f40:~$ sudo dnf downgrade ibus-typing-booster
Last metadata expiration check: 0:54:45 ago on Thu 12 Sep 2024 07:12:20 AM CEST.
Dependencies resolved.
=============================================================================================================================
 Package                     Arch     Version             Repository                                                    Size
=============================================================================================================================
Downgrading:
 emoji-picker                noarch   2.25.15-1.fc40      copr:copr.fedorainfracloud.org:mfabian:ibus-typing-booster    53 k
 ibus-typing-booster         noarch   2.25.15-1.fc40      copr:copr.fedorainfracloud.org:mfabian:ibus-typing-booster   1.2 M
 ibus-typing-booster-tests   noarch   2.25.15-1.fc40      copr:copr.fedorainfracloud.org:mfabian:ibus-typing-booster    72 k

Transaction Summary
=============================================================================================================================
Downgrade  3 Packages

Total download size: 1.3 M
Is this ok [y/N]:

You can also download a specific version from https://copr.fedorainfracloud.org/coprs/mfabian/ibus-typing-booster/builds/ and install that manually.

mike-fabian commented 2 months ago

@rank-coder Did you try it? Does it work well?

rank-coder commented 2 months ago

@rank-coder Did you try it? Does it work well?

I have not get the change to try it yet

rank-coder commented 2 months ago

@rank-coder Did you try it? Does it work well?

I tried it rn. It works well.

mike-fabian commented 2 months ago

@rank-coder Did you try it? Does it work well?

I tried it rn. It works well.

Great, I have released 2.25.16 with this fix: https://github.com/mike-fabian/ibus-typing-booster/releases/tag/2.25.16

mike-fabian commented 1 month ago

https://corp.unicode.org/pipermail/unicode/2024-October/011095.html