evaluation of case-folding (quantifiers, alternates ...)?

GoogleCodeExporter commented 9 years ago

Hi,
I just found a case, where I am not quite sure about the expected matching 
behaviour, namely the combination of simple casefolding and quantifiers, but 
also alternates and possibly other cases.

It probably all depends on the point where casefolding equivalence is checked. 
It seems, that only literal part of the pattern is considered:

>>> for m in regex.findall(ur"(?V1i)ss",u"-s-S-sS-ß-ẞ-"): print m,
... 
sS ß ẞ

But not the quantifiers or other metacharacters and the equivalent matches:

>>> for m in regex.findall(ur"(?V1i)s+",u"-s-S-sS-ß-ẞ-"): print m,
... 
s S sS

>>> for m in regex.findall(ur"(?V1i)s[st]",u"-s-S-sS-ß-ẞ-"): print m,
... 
sS
>>> for m in regex.findall(ur"(?V1i)s(?:s|t)",u"-s-S-sS-ß-ẞ-"): print m,
... 
sS

>>> for m in regex.findall(ur"(?V1i)[s][s]",u"-s-S-sS-ß-ẞ-"): print m,
... 
sS ß ẞ

Is it the expected behaviour?

in http://unicode.org/reports/tr18/#RL1.5
it isn't quite clear to me, whether literals are meant or even such patterns, 
which would itself also match the "foldable" substring.

(I noticed this while trying to match any possibly complex combination of s, z, 
ß in a historical text and tried to use casefolding simply using "[sz]+" )

(regex-0.1.20120705, py 2.7.3, win XPp)

regards
   vbr

Original issue reported on code.google.com by Vlastimil.Brom@gmail.com on 27 Jul 2012 at 10:18

GoogleCodeExporter commented 9 years ago

A pattern like, say, "s+" could conceivably match "ß" ("\N{LATIN SMALL LETTER 
SHARP S}"), but the extra complexity it would introduce into the implementation 
is just not worthwhile.

The rule is that a string of codepoint literals in the pattern must match a 
string of codepoints in the text being searched, without 'splitting' codepoints 
in either one.

Consider this: trying to match "s"+ against "ß" would mean it would have to 
match the first half of the codepoint on the first iteration and the second 
half of the codepoint on the second iteration.

Or this: trying to match "s[st]" against "ß" would mean it would have to match 
the first half of the codepoint with the literal "s" and second half of the 
codepoint with the character set "[st]".

Or this: trying to match "(s)s" against "ß" would mean it would have to 
capture only the first half of the codepoint. Clearly impossible.

It's just not worth the trouble; the cost outweighs any potential benefits. 
Such problems with implementation have been recognised more recently by the 
Unicode Consortium.

The regex of the Perl programming language gives pretty much the same results 
as the regex module. I say "pretty much" because the regex module optimises 
"[s]" to "s", so "[s][s]" becomes "ss", which can then match both "ß" and 
"ẞ"; Perl's regex doesn't return those particular matches.

So, in summary, yes, that is the expected behaviour.

Original comment by re...@mrabarnett.plus.com on 28 Jul 2012 at 1:02

Changed state: WontFix

GoogleCodeExporter commented 9 years ago

Thanks for the confirmation, I somehow suspected it and am glad I didn't 
misunderstand the casefolding behaviour.
I agree, that it isn't worth the extra complexity, it would require.

regards,
   vbr

Original comment by Vlastimil.Brom@gmail.com on 28 Jul 2012 at 6:32

jamadden / mrab-regex-hg

evaluation of case-folding (quantifiers, alternates ...)? #76