Support for Hebrew diacritics and other grapheme extenders

mathiasbynens / esrever

A Unicode-aware string reverser written in JavaScript.

https://git.io/esrever

MIT License

890 stars 31 forks source link

Support for Hebrew diacritics and other grapheme extenders #5

Open noomorph opened 10 years ago

noomorph commented 10 years ago

Hello, I've used your online demo http://mothereff.in/reverse-string and tried entering some hebrew with niqqud (diacritics) there and what I've got:

Actual result: שָׁלוֹם (shalom) got reversed to םֹולָׁש (which is nonsense, because lamed ל got diacritics from ש, look שָׁ -> לָׁ ) Expected: שָׁלוֹם - at least should be reversed to םוֹלשָׁ (so that each letters keeps it diacritics).

What do you think?

mathiasbynens commented 10 years ago

Why this doesn’t work right now: U+05B8 HEBREW POINT QAMATS is not strictly a combining mark (as in, it is not strictly assigned to any of the combining mark blocks), but it does act like one. As CodePoints.net says:

In text U+05B8 behaves as Combining Mark regarding line breaks. It has type Extend for sentence and Extend for word breaks. The Grapheme Cluster Break is Extend.

You’re right: Esrever should probably support grapheme extenders, and not just combining marks.

As per http://www.unicode.org/reports/tr44/#Grapheme_Extend:

Grapheme_Extend property = Me category + Mn category + Other_Grapheme_Extend property

@Boldewyn Can you confirm this is the correct way to get all Grapheme Extenders?

noomorph commented 10 years ago

Thanks for very quick response. For my needs (just a tiny demo) I've decided to use RegExp instead of String.prototype.split.

"שָׁבּת שָׁלוֹם".match(/.[\u0591-\u05C7]*/g).reverse().join('').

If you run in browser, it will give you: "םוֹלשָׁ תבּשָׁ". The idea is to greedy include diacritics for every match until other (NON-niqqud) character is met.

Of course, this way is not universal and accurate, partially because some of U+05CX symbols are not diacritics and it works just for hebrew. I'm just saying that it worked for me.

If there is any way I can be helpful for you, just tell me. Thanks!

Boldewyn commented 10 years ago

About getting the Grapheme Extenders: Well, that's the definition ;-) The UCD sustains the Grapheme_Extend property separately, so you should be fine using that directly.

The glossary of UAX44 for diacritics also suggests, that combining chars alone are not sufficient:

[...] Some diacritics are not combining characters, and some combining characters are not diacritics.

noomorph commented 10 years ago

Do you mean that esrever works as expected? If yes then I also agree: sha-l-o-m gets reversed to m-o-la-sh, and this is phonetically correct. I've marked letters which get qamatz, with asterisk.

The only problem I see here is: 0x05C1 and 0x05C2 characters – sin and shin dots for שׂ (sin), שׁ (shin). They do not make any sense when reversed, because their only destination is to specify which ש letter is that.

I think it's better to keep them together with ש.

The other thing is "final forms of hebrew letters (sofit)". מ (not in end) -> ם (in end), e.g: שלום - מולש

I think this also is worth a note when reversing words.

mathiasbynens commented 10 years ago

@noomorph I was explaining that Esrever works as currently advertised, in that it only takes care of combining marks.

But I agree we should change that and also take care of grapheme extenders.

mathiasbynens commented 10 years ago

Do Grapheme_Extend characters only apply to Grapheme_Base characters?

noomorph commented 10 years ago

Thank you, will be watching this thread. Unfortunately, I never dived into depths of Unicode so I cannot help. =(

patch commented 10 years ago

The job would be much easier if JavaScript supported \X in regular expressions for matching a Unicode extended grapheme cluster.

$ perl -CS -Mutf8 -E 'say join("", reverse("שָׁלוֹם" =~ /\X/g))'
םוֹלשָׁ

Boldewyn commented 10 years ago

@patch not necessarily for this project. When the browser is built against an old Unicode version, the results are outdated and incorrect for newer codepoints. With the major Unicode 7 update on the horizon, this is not only an academic problem.

(E.g., for the API in codepoints.net I use PHP's implementation of NFC/NFD transformations. The PHP version uses some Unicode 5.X data internally, therefore some newer Unicode 6 codepoints get incorrect transformations.)

mathiasbynens commented 10 years ago

The correct (well, the most correct) way to do this is to implement text segmentation as per TR29 and then reverse each grapheme cluster (as well as swapping surrogate pairs) before further processing the string as usual.