Closed GoogleCodeExporter closed 9 years ago
Sorry for the delay.
RE2 works on a sequence of Unicode code points, so it sees "u:" and "ü" as
completely different.
Typically people address this ambiguity by normalizing their strings into a
standard form and then writing regular expressions that expect that form. For
example, you might convert all strings to NFC before doing any matching. RE2
takes the position that this is beyond the scope of the regular expression
library: it is usually much better to have those conventions in the surrounding
code.
See http://www.unicode.org/reports/tr15/ for more than you ever wanted to know
about normalization.
Original comment by rsc@golang.org
on 1 Nov 2012 at 4:45
hey, thanks for the answer. we do normalization. currently that leads to
separated diacritics, i.e. the "u:" kinda encoding. the question is then, is
that still recovered as a letter?
Original comment by danielob...@gmail.com
on 1 Nov 2012 at 10:35
RE2 doesn't know about letters. It knows about Unicode code points, and so it
will see the "u:" as two Unicode code points. If you write a regular expression
containing "u:" then it will match, of course, but a single character class
like "." or "[^a-z]" will only match one of those two code points. You need
".." to match "u:".
Original comment by rsc@swtch.com
on 2 Nov 2012 at 12:59
it doesn't? what about "\pL"?
Original comment by danielob...@gmail.com
on 2 Nov 2012 at 10:24
If you believe that the two-code point form we've been referring to as
"u:" is a letter, then RE2 doesn't know about letters. It does know
about Unicode properties, and there is a Unicode property for single
code point that are called 'letters', and that it what \pL means. But
if you want to match "u:" you need to match it as two separate code
points. The u matches \pL and the diaeresis matches \p{Mn}
(non-spacing mark) so if you want to match letters in decomposed form,
you need to write something like (\pL\p{Mn})+ instead of \pL+. Again,
RE2 works on Unicode code points. That's all it knows.
Original comment by rsc@golang.org
on 6 Nov 2012 at 7:22
ok, I get it. sorry to cause so much agitation. I am not familiar enough with
the unicode vocabulary and should have looked up "code point" when you
mentioned it.
Though it might warrant some mention in the docs that effectively it makes a
lot of sense to always normalize to NFC :). Because otherwise the right pattern
for a letter effectively becomes (\pL\p{Mn}*). Or maybe some kind of shorthand
for this could be introduced into re2?
Thanks a lot for bearing with me.
Original comment by danielob...@gmail.com
on 6 Nov 2012 at 8:51
Added a section to http://code.google.com/p/re2/wiki/CplusplusAPI.
Original comment by rsc@golang.org
on 26 Nov 2012 at 7:10
Original issue reported on code.google.com by
danielob...@gmail.com
on 24 Oct 2012 at 2:50