orling / grapheme-splitter

A JavaScript library that breaks strings into their individual user-perceived characters.
MIT License
918 stars 45 forks source link

Support for Khmer language (non spacing mark U+17D2 COENG) #22

Open bbalet opened 5 years ago

bbalet commented 5 years ago

Thanks for your lib, it is very helpful.

However I am experiencing issues with Khmer language and the combining mark U+17D2 (See: https://r12a.github.io/scripts/khmer/block#char17D2) which is specific to Khmer language and is used to combine the next consonant as a subscript of the previous one. For example, if you consider the glyph ញ្ច which is the combination of three codepoints ញ ្ ច is considered by the splitter as the two glyphs ញ្ and ច. Note that it doesn't work as a ligature as in #12 but like the combination of consonants and vowels of other Indic scripts (and such combinations are supported by the splitter).

Let me explain further with another example. The word ខ្ញុំ is composed of only one glyph. What is interesting with this word is that the vowel OM (U + NIKAHIT) is applied to the subscript consonant NYO and not to the consonant KHA but all the sequence forms only one glyph and it looks like the vowel is applied to the first consonant KHA:

But the splitter considers this glyph as two glyphs (note that the combining mark ្ COENG is not discarded but just combined with ខ KHA as the algo considers it as Other character): ខ្ ញុំ

Btw, some useful tools: https://r12a.github.io/app-analysestring/ https://r12a.github.io/uniview/ https://r12a.github.io/pickers/khmer/

And sample text for testing purpose: ខ្ញុំ កញ្ចក់

bbalet commented 5 years ago

For those having the issue. I am not very clear neither about how to patch the lib nor about the feedback of the author. Meanwhile you can wrap the call to the lib as follow (might not be very optimized because I am not so good in JS):

function splitKhmerRunes(text) {
    text.normalize();
    const splitter = new GraphemeSplitter();
    let graphemes = splitter.splitGraphemes(text);
    let previousCodePoint = null;

    for (i = 0; i < graphemes.length; i++) {
        if (0x17D2 == previousCodePoint) {
            graphemes[i - 1] += graphemes[i];
            graphemes.splice(i, 1);
            i--;
        }
        previousCodePoint = [...graphemes[i]].pop().charCodeAt(0); //get last codepoint
    }
    return graphemes;
}
JLHwung commented 5 years ago

@bbalet

But the splitter considers this glyph as two glyphs

grapheme-spiltter implements UAX #29 3.1 Extended Grapheme Cluster Boundary Algorithm. So yes, it works as intended as U+1789 is neither SpacingMark nor Extend, the break happens between U+17D2 and U+1789 so there is 2 grapheme clusters.

The definitions are supposed to be defaults and not to exclude any more advanced definitions of tailored grapheme clusters. Besides Khmer ខ្ញុំ, for example, the devanagari क्षि is also a linguistic grapheme but both of them have two extended grapheme clusters. As is stated in UAX #29

The default definitions are, however, designed to provide a much more accurate match to overall user expectations for what the user perceives of as characters than is provided by individual Unicode code points. ... The term cluster is used to emphasize that the term grapheme is used differently in linguistics.

So grapheme-splitter would not fix this issue since it is a truthful implementation of UAX #29 3.1.

On the workaround you pasted, I suggest you scrutinize the whole Khmer block if you would like to develop a serious Khmer tailored grapheme clusters algorithm. Otherwise it looks good as it matches your requirement.

bbalet commented 5 years ago

I think you should change the short description of the lib. Indeed stating that the lib is A JavaScript library that breaks strings into their individual user-perceived characters is not true for all Brahmic scripts (devanagari that you quoted in your example is a Brahmic script the same as Khmer).

The content of the README is misleading with sentences such as It can be used to properly split JavaScript strings into what a human user would call separate letters or examples that makes you feel it works for all cases.

Unicode standard is a bit difficult to grasp or if I may say it is difficult to understand the consequences of some of their choices. So you should keep the quote with one or two examples of the limitations of the lib.

rakusai commented 5 years ago

@bbalet To satisfy you needs, you may try https://www.npmjs.com/package/split-graphemes . This library supports ស្ដិ៍ which include ្ (U+17D2 ).

In Crome(macOS) , when you press backspace, it will not delete them as one letter, but when you select the letter using shift + right arrow key, it will select them as one letter. I guess that the backspace is used to undo the typing (I want to confirm this to someone who uses Khmer natively if this behavior is useful or not).

Also I understand that if ស្ដិ៍ is rendered into two lines like

Hello, ស
ដ! How are you?

people can not read them. macOS never split them into two lines. So if you want more practical library, try the library above.

bbalet commented 5 years ago

@rakusai yes you are right. Westerners can still read a text if the font doesn't support the ligature. For example ex-æquo and ex-aequo or œuf and oeuf. But Khmer people consider ស្ដ as a separate entity than សដ. They have a different meaning and pronunciation (sɗɑː and sɑːt). I will give a try to your lib :)

rakusai commented 5 years ago

FYI, I have to indicate that many editors can not render Khmer at all.

Khmer: ការសាងសង់ស្ពាននេះ បានឆ្លងកាត់ឧបសគ្គជាច្រេីន។ English: The construction of the bridge went through many obstacles.

Sublime Text: image

BBEdit (= TextWranger): image

You see lots of meaningless signs: U+17D2 (rendered with dotted circle with plus mark). This is not the font problem, the app's implementation has the problem. We can not trust these apps to test the rendering.

Also even macOS default components, they have some bugs: In TextEdit, paste the sample text above and place the caret at the end of the line, press shift + left arrow key. It will not allow me to move the caret to the left. image So as for Khmer, maybe chromium is the most tested and reliable implementation. You may check the source code.

bbalet commented 5 years ago

VSCode, NotePad++, and NetBeans are working fine with Khmer. I am currently working in Cambodia and I've asked my team to work with VSCode.

kotpal commented 5 years ago

Microsoft's Uniscribe library functions ScriptItemize and ScriptBreak provide the necessary character break information that I am looking for. Uniscribe FTW