Closed GoogleCodeExporter closed 9 years ago
It folds the case according to the contents of:
http://www.unicode.org/Public/UNIDATA/CaseFolding.txt
"ß" folds to "ss", therefore "ß" is equivalent to "ss". Perl does the same.
An accented omicron doesn't match a non-accented omicron. That's not full
case-folding, but is a form of super-insensitivity, ignoring diacritics and
involving normalisation, which is a whole new problem! Have a look at:
http://unicode.org/review/pri179/
There does appear to be a bug, in that "Σ" should match "ς".
Original comment by re...@mrabarnett.plus.com
on 17 Sep 2011 at 2:32
That bug is now fixed.
Original comment by re...@mrabarnett.plus.com
on 17 Sep 2011 at 4:50
Thanks for the quick fix and a pointer to the revised recommendation, with the
Greek accents I just relied on the example in 5.18: Caseless Matching which
involves both accented omicron and sigma variants.
It seems, I'd have to be more careful while using the case insensitive search.
vbr
Original comment by Vlastimil.Brom@gmail.com
on 17 Sep 2011 at 10:14
It occurred to me that it might be better to allow full case-folding to be
turned off, which leads to the questions:
Should there be a flag for full case-folding?
Should full case-folding be off by default?
Should full case-folding be off at least with VERSION0? (I think perhaps it
should!)
This leads to the proposed examples:
Ignore case, simple case-folding: (?i)ss
Ignore case, full case-folding: (?if)ss
The flag could be scoped, for example:
(?if)cla(?-f:ss)
Original comment by re...@mrabarnett.plus.com
on 18 Sep 2011 at 8:15
I think, the ability to set the case-folding on and off for case-insensitive
matches would be very useful.
I believe in the VERSION0 it should be off by default,
for VERSION1, it would be probably expected to have it on by default, as it is
mentioned in the Unicode recomendation after all.
(Although I would personally prefer to have it off too, as I consider it to
much magic and strongly language and application dependent and possibly
incompletely defined - in the unicode datafiles - e.g. for German, one can in
certain circumstances also expect ß <=> sz, Ue <=> Ü etc.)
But as long as it is (un)settable, I am happy with either default.
Just a comment to the "f" flag status and syntax - would a be an individual
flag rather than a compound one "if"?
How would this work as inline flag in absence of "i"? (ignoring/exception)
Would it stay "alive" after the single "i" was un-set, so that an the next "i"
will become "if"?
(?if)ab(?-i)cd(?i)efß...
I am not sure, however, whether a compound "if" in alternation with "i" would
be clearer...
Regards,
vbr
Original comment by Vlastimil.Brom@gmail.com
on 18 Sep 2011 at 9:17
I believe that most of the time you would want full case-folding either on or
off, so the flag would indicate "if ignoring case, perform full case-folding".
In your example:
(?if)ab(?-i)cd(?i)efß...
you turn on full case-folding for the pattern, but it would affect only those
parts which are case-insensitive.
As for having it on by default, how would you turn it off? Would that mean that
in VERSION1, IGNORECASE is really `regex.IGNORECASE | regex.FULLCASEFOLDING`,
and that for simple case-folding you would need to write `regex.IGNORECASE &
~regex.FULLCASEFOLDING`, or "(?i-f)" inline? (Remember that for VERSION0 would
be off by default, turned on by `regex.FULLCASEFOLDING` or "(?f)".)
Original comment by re...@mrabarnett.plus.com
on 18 Sep 2011 at 9:57
Ok, I somehow anticipated the full case-folding to be on by default, as this is
what regex does now, after this feature was introduced. In that case the
separate f-flag would need to be "subtracted" as in your example, if simple
case conversion were needed. It might not read very elegantly, but I guess, now
it is rather a question, what the users of regex consider "normal" for case
insensitivity (beyond ascii).
Personally, I would rather have full case-folding switched off by default, but
I might be biased towards a more explicit control over the matching behaviour
(which would involve studying UNIDATA/CaseFolding.txt and possibly the relevant
Unicode documentation before using this feature). (On the other hand, this is
the same for other features, like properties etc., but these are more explicit.)
vbr
Original comment by Vlastimil.Brom@gmail.com
on 18 Sep 2011 at 10:51
regex 0.1.20110922 now supports both simple and full case-folding.
In version 0 behaviour, full case-folding is off by default.
In version 1 behaviour, full case-folding is on by default.
Full case-folding is controlled by the FULLCASE flag or "(?f)". The flag
affects how the IGNORECASE flag works.
Original comment by re...@mrabarnett.plus.com
on 22 Sep 2011 at 12:30
Thanks for the update,
I'd just like to clarify a possible interference of the new flag with the
custom modification to achieve the VERSION1 default mentioned in
http://code.google.com/p/mrab-regex-hg/issues/detail?id=20#c2
There may by some bug in my "converter" to V1 or the new f flag isn't set with
default V1, however, after manually setting V1 or f this works as expected.
(Another v1 feature - nested sets - works ok by default.)
>>> import regex
>>> regex.DEFAULT_VERSION == regex.VERSION0
True
>>> import regex_v1
>>> regex_v1.DEFAULT_VERSION == regex_v1.VERSION1
True
>>> regex_v1.DEFAULT_VERSION == regex.VERSION1
True
>>> regex.findall(u"ss", u"-ss-sS-ß-")
[u'ss']
>>> regex.findall(u"(?i)ss", u"-ss-sS-ß-")
[u'ss', u'sS']
>>> regex.findall(u"(?V1i)ss", u"-ss-sS-ß-")
[u'ss', u'sS', u'\xdf']
>>> regex.findall(u"(?if)ss", u"-ss-sS-ß-")
[u'ss', u'sS', u'\xdf']
>>>
>>> regex_v1.findall(u"ss", u"-ss-sS-ß-")
[u'ss']
>>> regex_v1.findall(u"(?i)ss", u"-ss-sS-ß-")
[u'ss', u'sS'] ############################# <-- no full casefolding here ####
>>> regex_v1.findall(u"(?V1i)ss", u"-ss-sS-ß-")
[u'ss', u'sS', u'\xdf']
>>> regex_v1.findall(u"(?if)ss", u"-ss-sS-ß-")
[u'ss', u'sS', u'\xdf']
>>>
>>> regex.findall(ur"[\w--[aeiouy]]", u"abcdefghij")
[]
>>> regex.findall(ur"(?V1)[\w--[aeiouy]]", u"abcdefghij")
[u'b', u'c', u'd', u'f', u'g', u'h', u'j']
>>> regex_v1.findall(ur"[\w--[aeiouy]]", u"abcdefghij")
[u'b', u'c', u'd', u'f', u'g', u'h', u'j'] ####### <-- nested set ok ####
>>>
Am I missing something?
vbr
Original comment by Vlastimil.Brom@gmail.com
on 22 Sep 2011 at 7:14
No, you're not missing something, that's a bug. Now fixed in regex
0.1.20110922a.
Original comment by re...@mrabarnett.plus.com
on 22 Sep 2011 at 8:20
Original issue reported on code.google.com by
Vlastimil.Brom@gmail.com
on 17 Sep 2011 at 12:37