Hi,
I just noticed a sligtly surprising behaviour of the nested character sets in
regex, however, it might rather be a documentation issue, or simply my
incorrect understanding. According to the examples and explanations in:
http://code.google.com/p/mrab-regex-hg/wiki/GeneralDetails#Nested_sets_and_set_o
perations
I somheow expected, that for the set operations the respective elements of the
"outer set" must either be an explicit set between square brackets [...], or a
predefined character class shorthands like \s.
It turns out, that the inner brackets aren't required, hence:
>>> regex.findall(r"(?V1)[[b-e]--cd]", "abcdef")
['b', 'e']
and even:
>>> regex.findall(r"(?V1)[b-e--cd]", "abcdef")
['b', 'e']
>>>
work. On the other hand, a pattern without hyphens in the first inner set
causes an error
>>> regex.findall(r"(?V1)[bcde--cd]", "abcdef")
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "C:\Python27\lib\regex.py", line 318, in findall
return _compile(pattern, flags, kwargs).findall(string, pos, endpos,
File "C:\Python27\lib\regex.py", line 499, in _compile
caught_exception.pos)
error: bad character range at position 12
>>>
(I noticed this while debugging a typo in the character property; e.g., instead
of:
>>> regex.findall(r"(?V1)[a-zA-Z--\p{Lu}]", "aBcDeF pLu")
['a', 'c', 'e', 'p', 'u']
there was something like:
>>> regex.findall(r"(?V1)[a-zA-Z--p\{Lu}]", "aBcDeF pLu")
['a', 'B', 'c', 'D', 'e', 'F']
>>>
i.e. the second part is interpretted as literal set and subtracted, while I
expected, that brackets would have been needed for this (in which case an error
like "bad character range" would be raised on the above typo).
Anyway, I'd like to ask for clarification, whether this is the expected
behaviour, and possibly for specifying this in the docs more explicitly.
Thanks and regards
vbr
Original issue reported on code.google.com by Vlastimil.Brom@gmail.com on 15 Dec 2014 at 4:03
Original issue reported on code.google.com by
Vlastimil.Brom@gmail.com
on 15 Dec 2014 at 4:03