Closed GoogleCodeExporter closed 9 years ago
A pattern like, say, "s+" could conceivably match "ß" ("\N{LATIN SMALL LETTER
SHARP S}"), but the extra complexity it would introduce into the implementation
is just not worthwhile.
The rule is that a string of codepoint literals in the pattern must match a
string of codepoints in the text being searched, without 'splitting' codepoints
in either one.
Consider this: trying to match "s"+ against "ß" would mean it would have to
match the first half of the codepoint on the first iteration and the second
half of the codepoint on the second iteration.
Or this: trying to match "s[st]" against "ß" would mean it would have to match
the first half of the codepoint with the literal "s" and second half of the
codepoint with the character set "[st]".
Or this: trying to match "(s)s" against "ß" would mean it would have to
capture only the first half of the codepoint. Clearly impossible.
It's just not worth the trouble; the cost outweighs any potential benefits.
Such problems with implementation have been recognised more recently by the
Unicode Consortium.
The regex of the Perl programming language gives pretty much the same results
as the regex module. I say "pretty much" because the regex module optimises
"[s]" to "s", so "[s][s]" becomes "ss", which can then match both "ß" and
"ẞ"; Perl's regex doesn't return those particular matches.
So, in summary, yes, that is the expected behaviour.
Original comment by re...@mrabarnett.plus.com
on 28 Jul 2012 at 1:02
Thanks for the confirmation, I somehow suspected it and am glad I didn't
misunderstand the casefolding behaviour.
I agree, that it isn't worth the extra complexity, it would require.
regards,
vbr
Original comment by Vlastimil.Brom@gmail.com
on 28 Jul 2012 at 6:32
Original issue reported on code.google.com by
Vlastimil.Brom@gmail.com
on 27 Jul 2012 at 10:18