casefolding specification

GoogleCodeExporter commented 9 years ago

First, thanks for the new release (regex 0.1.20110917) ! (I especially like the 
changed fuzzy matching behaviour as discussed in 
http://code.google.com/p/mrab-regex-hg/issues/detail?id=12#c28 )

I'd like to ask about the specification of the case-folding behaviour used in 
case insensitive matching.
Is it the chapter 5.18 in the Unocode standard
http://www.unicode.org/versions/Unicode6.0.0/ch05.pdf
Or did I miss something else?
I tried some patterns, where I thought these would be "caselessly" equivalent 
(based on the above)
>>> for m in regex.findall(ur"(?V1i)[ΣΟ]",u"ρς στΣόο"): print m,
... 
σ Σ ο

Here I'd have thought, the accented lowercase omicron or the positional lower 
sigma variant would be matched too.

On the other hand the sharp s (which is more frequent in my texts) seems to be 
matched in all directions.)

>>> for m in regex.findall(ur"(?V1i)ẞ",u"-s-S-ss-SS-ß-ẞ-"): print m,
... 
ss SS ß ẞ
>>> for m in regex.findall(ur"(?V1i)ss",u"-s-S-ss-SS-ß-ẞ-"): print m,
... 
ss SS ß ẞ
>>> 

I thought, that only the changes in case should be reflected in matching, now 
there is effectively an equivalence between both lowercase ss and ß, which is 
not (at least not always) what is expected. (Both, with respect to the current 
German orthography or for dealing with text preceeding that official 
orthography regulation.)

Is there now some way to handle these characters as distinct (other than not 
using the i flag)?

Where can I maybe find the specification for this behaviour? - it seems, that I 
will need to reflect it in the search patterns.

(I can't comment competently on the behaviour  of the "prominent" case of the 
Turkic "i"s; personally I believe, there must be other comparable cases, once 
we begin to care about them... I'd support the view of some contributors in the 
respective py-list thread ( 
http://mail.python.org/pipermail/python-list/2011-September/1280544.html ), 
that such cases are better dealt with individually, on an application basis, if 
it need be. (I'd just prefer keeping the flags repertoire shorter, if 
possible:-)

Regards,
 Vlastimil Brom

Original issue reported on code.google.com by Vlastimil.Brom@gmail.com on 17 Sep 2011 at 12:37

GoogleCodeExporter commented 9 years ago

It folds the case according to the contents of:

http://www.unicode.org/Public/UNIDATA/CaseFolding.txt

"ß" folds to "ss", therefore "ß" is equivalent to "ss". Perl does the same.

An accented omicron doesn't match a non-accented omicron. That's not full 
case-folding, but is a form of super-insensitivity, ignoring diacritics and 
involving normalisation, which is a whole new problem! Have a look at:

http://unicode.org/review/pri179/

There does appear to be a bug, in that "Σ" should match "ς".

Original comment by re...@mrabarnett.plus.com on 17 Sep 2011 at 2:32

Changed state: Accepted

GoogleCodeExporter commented 9 years ago

That bug is now fixed.

Original comment by re...@mrabarnett.plus.com on 17 Sep 2011 at 4:50

Changed state: Fixed

GoogleCodeExporter commented 9 years ago

Thanks for the quick fix and a pointer to the revised recommendation, with the 
Greek accents I just relied on the example in 5.18: Caseless Matching which 
involves both accented omicron and sigma variants.
It seems, I'd have to be more careful while using the case insensitive search.
vbr

Original comment by Vlastimil.Brom@gmail.com on 17 Sep 2011 at 10:14

GoogleCodeExporter commented 9 years ago

It occurred to me that it might be better to allow full case-folding to be 
turned off, which leads to the questions:

Should there be a flag for full case-folding?

Should full case-folding be off by default?

Should full case-folding be off at least with VERSION0? (I think perhaps it 
should!)

This leads to the proposed examples:

Ignore case, simple case-folding: (?i)ss
Ignore case, full case-folding: (?if)ss

The flag could be scoped, for example:

(?if)cla(?-f:ss)

Original comment by re...@mrabarnett.plus.com on 18 Sep 2011 at 8:15

GoogleCodeExporter commented 9 years ago

I think, the ability to set the case-folding on and off for case-insensitive 
matches would be very useful.
I believe in the VERSION0 it should be off by default, 
for VERSION1, it would be probably expected to have it on by default, as it is 
mentioned in the Unicode recomendation after all.
(Although I would personally prefer to have it off too, as I consider it to 
much magic and strongly language and application dependent and possibly 
incompletely defined - in the unicode datafiles - e.g. for German, one can in 
certain circumstances also expect ß <=> sz, Ue <=> Ü etc.)

But as long as it is (un)settable, I am happy with either default.

Just a comment to the "f" flag status and syntax - would a be an individual 
flag rather than a compound one "if"?

How would this work as inline flag in absence of "i"? (ignoring/exception)
Would it stay "alive" after the single "i" was un-set, so that an the next "i" 
will become "if"?

(?if)ab(?-i)cd(?i)efß...

I am not sure, however, whether a compound "if" in alternation with "i" would 
be clearer...

Regards,
  vbr

Original comment by Vlastimil.Brom@gmail.com on 18 Sep 2011 at 9:17

GoogleCodeExporter commented 9 years ago

I believe that most of the time you would want full case-folding either on or 
off, so the flag would indicate "if ignoring case, perform full case-folding". 
In your example:

(?if)ab(?-i)cd(?i)efß...

you turn on full case-folding for the pattern, but it would affect only those 
parts which are case-insensitive.

As for having it on by default, how would you turn it off? Would that mean that 
in VERSION1, IGNORECASE is really `regex.IGNORECASE | regex.FULLCASEFOLDING`, 
and that for simple case-folding you would need to write `regex.IGNORECASE & 
~regex.FULLCASEFOLDING`, or "(?i-f)" inline? (Remember that for VERSION0 would 
be off by default, turned on by `regex.FULLCASEFOLDING` or "(?f)".)

Original comment by re...@mrabarnett.plus.com on 18 Sep 2011 at 9:57

GoogleCodeExporter commented 9 years ago

Ok, I somehow anticipated the full case-folding to be on by default, as this is 
what regex does now, after this feature was introduced. In that case the 
separate f-flag would need to be "subtracted" as in your example, if simple 
case conversion were needed. It might not read very elegantly, but I guess, now 
it is rather a question, what the users of regex consider "normal" for case 
insensitivity (beyond ascii).

Personally, I would rather have full case-folding switched off by default, but 
I might be biased towards a more explicit control over the matching behaviour 
(which would involve studying UNIDATA/CaseFolding.txt and possibly the relevant 
Unicode documentation before using this feature). (On the other hand, this is 
the same for other features, like properties etc., but these are more explicit.)

vbr

Original comment by Vlastimil.Brom@gmail.com on 18 Sep 2011 at 10:51

GoogleCodeExporter commented 9 years ago

regex 0.1.20110922 now supports both simple and full case-folding.

In version 0 behaviour, full case-folding is off by default.

In version 1 behaviour, full case-folding is on by default.

Full case-folding is controlled by the FULLCASE flag or "(?f)". The flag 
affects how the IGNORECASE flag works.

Original comment by re...@mrabarnett.plus.com on 22 Sep 2011 at 12:30

GoogleCodeExporter commented 9 years ago

Thanks for the update,
I'd just like to clarify a possible interference of the new flag with the 
custom modification to achieve the VERSION1 default mentioned in
http://code.google.com/p/mrab-regex-hg/issues/detail?id=20#c2

There may by some bug in my "converter" to V1 or the new f flag isn't set with 
default V1, however, after manually setting V1 or f this works as expected.
(Another v1 feature - nested sets - works ok by default.)

>>> import regex
>>> regex.DEFAULT_VERSION == regex.VERSION0
True
>>> import regex_v1
>>> regex_v1.DEFAULT_VERSION == regex_v1.VERSION1
True
>>> regex_v1.DEFAULT_VERSION == regex.VERSION1
True
>>> regex.findall(u"ss", u"-ss-sS-ß-")
[u'ss']
>>> regex.findall(u"(?i)ss", u"-ss-sS-ß-")
[u'ss', u'sS']
>>> regex.findall(u"(?V1i)ss", u"-ss-sS-ß-")
[u'ss', u'sS', u'\xdf']
>>> regex.findall(u"(?if)ss", u"-ss-sS-ß-")
[u'ss', u'sS', u'\xdf']
>>> 
>>> regex_v1.findall(u"ss", u"-ss-sS-ß-")
[u'ss']
>>> regex_v1.findall(u"(?i)ss", u"-ss-sS-ß-")
[u'ss', u'sS'] ############################# <-- no full casefolding here ####
>>> regex_v1.findall(u"(?V1i)ss", u"-ss-sS-ß-")
[u'ss', u'sS', u'\xdf']
>>> regex_v1.findall(u"(?if)ss", u"-ss-sS-ß-")
[u'ss', u'sS', u'\xdf']
>>> 
>>> regex.findall(ur"[\w--[aeiouy]]", u"abcdefghij")
[]
>>> regex.findall(ur"(?V1)[\w--[aeiouy]]", u"abcdefghij")
[u'b', u'c', u'd', u'f', u'g', u'h', u'j']
>>> regex_v1.findall(ur"[\w--[aeiouy]]", u"abcdefghij")
[u'b', u'c', u'd', u'f', u'g', u'h', u'j'] ####### <-- nested set ok ####
>>> 

Am I missing something?

vbr

Original comment by Vlastimil.Brom@gmail.com on 22 Sep 2011 at 7:14

GoogleCodeExporter commented 9 years ago

No, you're not missing something, that's a bug. Now fixed in regex 
0.1.20110922a.

Original comment by re...@mrabarnett.plus.com on 22 Sep 2011 at 8:20

jamadden / mrab-regex-hg

casefolding specification #19