k-takata / Onigmo

Onigmo is a regular expressions library forked from Oniguruma.
Other
617 stars 94 forks source link

Name2CType data wrong for many Indic scripts? #146

Open deepestblue opened 4 years ago

deepestblue commented 4 years ago

I found this when trying to use Ruby Regexp on Tamil Unicode codepoint data.

irb(main):002:0> "\u0BAE\u0BC0\u0BA9\u0BCD\u0BA9".scan(/[[:alpha:]]+/).each { |s| puts s.dump }
"\u0BAE\u0BC0\u0BA9"
"\u0BA9"
=> ["மீன", "ன"]
irb(main):003:0>

Notice that both \u0BC0 and \u0BCD are combining vowel markers in the Mark, Nonspacing [Mn] character category, which should match the [:alpha:] class. But \u0BCD does not seem to match the class. Stackoverflow told me Ruby uses Onigmo under the hood, and I found the following except in name2ctype.h in CR_Alpha, CR_Alnum, etc.

    0x0bca, 0x0bcc,
    0x0c01, 0x0c03,

Notice the missing 0x0bcd.

P.S. I found a number of other missing Indic codepoints as well in that file. If you agree this is a bug I can look in the file some more and do an audit. Thanks!

JoergWMittag commented 4 years ago

See Why do some Unicode combining markers (like \u0BCD) not match [:alpha:] in Ruby? on Stack Overflow for a discussion, partially reproduced below:

The two characters in question are (I have marked some interesting things in bold):

The Ruby documentation for the Regexp class does not explicitly spell out what [[:alpha:]] matches, but it does say that the POSIX bracket expressions match non-ASCII characters, and it gives [[:digit:]] as an example, saying it matches anything with the Unicode property Nd (Decimal Number).

While not explicitly documented, it makes sense to equate the Regexp POSIX bracket expression [[:alpha:]] with the Unicode property Alphabetic, which would mean that U+0BC0 matches and U+0BCD doesn't.

On the other hand, the documentation for Onigmo does explicitly specify the workings of [[:alpha:]]. In fact, it specifies it in two different places, and they contradict each other:

So, what seems to be going on, is that the Unicode Consortium does not consider U+0BCD to be alphabetic, and therefore, Onigmo and Ruby do not classify it as [[:alpha:]]. In that case, the Onigmo documentation is incorrect, and the Ruby documentation is imprecise.

deepestblue commented 4 years ago

Thanks, Joerg.

While not explicitly documented, it makes sense to equate the Regexp POSIX bracket expression [[:alpha:]] with the Unicode property Alphabetic, which would mean that U+0BC0 matches and U+0BCD doesn't.

Given [[:digit:]] matches Unicode category Nd, for the sake of consistency I'd rather [[:alpha:]] match the union of Unicode category Letter and Unicode category Mark, rather than Unicode property Alphabetic.