jamadden / mrab-regex-hg

Automatically exported from code.google.com/p/mrab-regex-hg
0 stars 2 forks source link

Support for regex in Property Values #18

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
I just found a chapter in the Unicode guidelines for regular expressions 
concerning property values and would like to ask, whether supporting of regex 
patterns (or some subset thereof) here would be possible.
http://unicode.org/reports/tr18/#Wildcard_Properties

I am not aware of any implementation already supporting it, nor do I know how 
much extra complexity for the parser would be needed, but it looks like a 
feature orthogonal with unicode properties and the set operations which regex 
already has.

I see, that the usecases would cover rather special approaches - in my case it 
would allow for investigating the unicode character repertoire itself 
(Currently I can do something like that after grabbing all the character names 
via unicodedata).

Otherwise, on "normal" text, the cases could be covered, where there are 
multiple character ranges, that should be considered (i.e. basic xxx, xxx 
supplement, xxx extended ...). (I am not sure how the current Script property 
relates to this exactly.) 
E.g. some errors or even spoofing attempts might be checked for on graphically 
similar characters from different ranges. cf.
o (dec: 111; hex: 0x6f) LATIN SMALL LETTER O
ο (dec: 959; hex: 0x3bf) GREEK SMALL LETTER OMICRON
о (dec: 1086; hex: 0x43e) CYRILLIC SMALL LETTER O
օ (dec: 1413; hex: 0x585) ARMENIAN SMALL LETTER OH

I'd like to stress, that this is only meant as proposal for consideration - it 
surely wouldn't be worth some extensive effort or the risk for being possible 
bug source.

Regards
  Vlastimil  Brom

Original issue reported on code.google.com by Vlastimil.Brom@gmail.com on 15 Sep 2011 at 3:51

GoogleCodeExporter commented 9 years ago
I'm not sure how easy it is to parse, compile and use a regex which is embedded 
in a regex in order to find the matches while parsing the regex in which it is 
embedded. (Sometimes I think that they just write the specification without 
thinking about how it might be implemented! :-) They've already had second 
thoughts on requiring matching to handle normalisation transparently in the 
higher levels because of the practical problems with it.)

I'll let you know if I ever decide to attempt it. :-)

Original comment by re...@mrabarnett.plus.com on 15 Sep 2011 at 4:45