Open deepestblue opened 4 years ago
See Why do some Unicode combining markers (like \u0BCD) not match [:alpha:] in Ruby? on Stack Overflow for a discussion, partially reproduced below:
The two characters in question are (I have marked some interesting things in bold):
The Ruby documentation for the Regexp
class does not explicitly spell out what [[:alpha:]]
matches, but it does say that the POSIX bracket expressions match non-ASCII characters, and it gives [[:digit:]]
as an example, saying it matches anything with the Unicode property Nd (Decimal Number).
While not explicitly documented, it makes sense to equate the Regexp
POSIX bracket expression [[:alpha:]]
with the Unicode property Alphabetic, which would mean that U+0BC0 matches and U+0BCD doesn't.
On the other hand, the documentation for Onigmo does explicitly specify the workings of [[:alpha:]]
. In fact, it specifies it in two different places, and they contradict each other:
doc/RE
, it says that [[:alpha:]]
matches Letter | Mark.doc/UnicodeProps.txt
, it seems to imply that [[:alpha:]]
matches Alphabetic.So, what seems to be going on, is that the Unicode Consortium does not consider U+0BCD to be alphabetic, and therefore, Onigmo and Ruby do not classify it as [[:alpha:]]
. In that case, the Onigmo documentation is incorrect, and the Ruby documentation is imprecise.
Thanks, Joerg.
While not explicitly documented, it makes sense to equate the Regexp POSIX bracket expression [[:alpha:]] with the Unicode property Alphabetic, which would mean that U+0BC0 matches and U+0BCD doesn't.
Given [[:digit:]]
matches Unicode category Nd
, for the sake of consistency I'd rather [[:alpha:]]
match the union of Unicode category Letter
and Unicode category Mark
, rather than Unicode property Alphabetic
.
I found this when trying to use Ruby Regexp on Tamil Unicode codepoint data.
Notice that both
\u0BC0
and\u0BCD
are combining vowel markers in theMark, Nonspacing [Mn]
character category, which should match the[:alpha:]
class. But\u0BCD
does not seem to match the class. Stackoverflow told me Ruby uses Onigmo under the hood, and I found the following except inname2ctype.h
inCR_Alpha
,CR_Alnum
, etc.Notice the missing
0x0bcd
.P.S. I found a number of other missing Indic codepoints as well in that file. If you agree this is a bug I can look in the file some more and do an audit. Thanks!