lamuguo / re2

Automatically exported from code.google.com/p/re2
BSD 3-Clause "New" or "Revised" License
0 stars 0 forks source link

implement or comment on diacritics handling #71

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
in unicode things like umlauts, accents, etc can be encoded either in the 
symbol itself or as a "suffix". i.e. ü can be encoded as u: (where the : here 
is a special version of it that marks it as "diacritic", as these things are 
called). In our application we map everything to the second form, since it 
makes a lot of things easier. Does re2 handle this form, i.e. does it parse the 
u:-sequence as a single letter? If not, is there a known workaround?

Original issue reported on code.google.com by danielob...@gmail.com on 24 Oct 2012 at 2:50

GoogleCodeExporter commented 9 years ago
Sorry for the delay.

RE2 works on a sequence of Unicode code points, so it sees "u:" and "ü" as 
completely different. 

Typically people address this ambiguity by normalizing their strings into a 
standard form and then writing regular expressions that expect that form. For 
example, you might convert all strings to NFC before doing any matching. RE2 
takes the position that this is beyond the scope of the regular expression 
library: it is usually much better to have those conventions in the surrounding 
code.

See http://www.unicode.org/reports/tr15/ for more than you ever wanted to know 
about normalization.

Original comment by rsc@golang.org on 1 Nov 2012 at 4:45

GoogleCodeExporter commented 9 years ago
hey, thanks for the answer. we do normalization. currently that leads to 
separated diacritics, i.e. the "u:" kinda encoding. the question is then, is 
that still recovered as a letter? 

Original comment by danielob...@gmail.com on 1 Nov 2012 at 10:35

GoogleCodeExporter commented 9 years ago
RE2 doesn't know about letters. It knows about Unicode code points, and so it 
will see the "u:" as two Unicode code points. If you write a regular expression 
containing "u:" then it will match, of course, but a single character class 
like "." or "[^a-z]" will only match one of those two code points. You need 
".." to match "u:".

Original comment by rsc@swtch.com on 2 Nov 2012 at 12:59

GoogleCodeExporter commented 9 years ago
it doesn't? what about "\pL"?

Original comment by danielob...@gmail.com on 2 Nov 2012 at 10:24

GoogleCodeExporter commented 9 years ago
If you believe that the two-code point form we've been referring to as
"u:" is a letter, then RE2 doesn't know about letters. It does know
about Unicode properties, and there is a Unicode property for single
code point that are called 'letters', and that it what \pL means. But
if you want to match "u:" you need to match it as two separate code
points. The u matches \pL and the diaeresis matches \p{Mn}
(non-spacing mark) so if you want to match letters in decomposed form,
you need to write something like (\pL\p{Mn})+ instead of \pL+. Again,
RE2 works on Unicode code points. That's all it knows.

Original comment by rsc@golang.org on 6 Nov 2012 at 7:22

GoogleCodeExporter commented 9 years ago
ok, I get it. sorry to cause so much agitation. I am not familiar enough with 
the unicode vocabulary and should have looked up "code point" when you 
mentioned it. 
Though it might warrant some mention in the docs that effectively it makes a 
lot of sense to always normalize to NFC :). Because otherwise the right pattern 
for a letter effectively becomes (\pL\p{Mn}*). Or maybe some kind of shorthand 
for this could be introduced into re2?
Thanks a lot for bearing with me.

Original comment by danielob...@gmail.com on 6 Nov 2012 at 8:51

GoogleCodeExporter commented 9 years ago
Added a section to http://code.google.com/p/re2/wiki/CplusplusAPI.

Original comment by rsc@golang.org on 26 Nov 2012 at 7:10