jamadden / mrab-regex-hg

Automatically exported from code.google.com/p/mrab-regex-hg
0 stars 2 forks source link

nested sets behaviour #131

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Hi,
I just noticed a sligtly surprising behaviour of the nested character sets in 
regex, however, it might rather be a documentation issue, or simply my 
incorrect understanding. According to the examples and explanations in:
http://code.google.com/p/mrab-regex-hg/wiki/GeneralDetails#Nested_sets_and_set_o
perations
I somheow expected, that for the set operations the respective elements of the 
"outer set" must either be an explicit set between square brackets [...], or a 
predefined character class shorthands like \s.

It turns out, that the inner brackets aren't required, hence:

>>> regex.findall(r"(?V1)[[b-e]--cd]", "abcdef")
['b', 'e']

and even:
>>> regex.findall(r"(?V1)[b-e--cd]", "abcdef")
['b', 'e']
>>> 

work. On the other hand, a pattern without hyphens in the first inner set 
causes an error
>>> regex.findall(r"(?V1)[bcde--cd]", "abcdef")
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "C:\Python27\lib\regex.py", line 318, in findall
    return _compile(pattern, flags, kwargs).findall(string, pos, endpos,
  File "C:\Python27\lib\regex.py", line 499, in _compile
    caught_exception.pos)
error: bad character range at position 12
>>> 

(I noticed this while debugging a typo in the character property; e.g., instead 
of:
>>> regex.findall(r"(?V1)[a-zA-Z--\p{Lu}]", "aBcDeF pLu")
['a', 'c', 'e', 'p', 'u']
there was something like:
>>> regex.findall(r"(?V1)[a-zA-Z--p\{Lu}]", "aBcDeF pLu")
['a', 'B', 'c', 'D', 'e', 'F']
>>> 

i.e. the second part is interpretted as literal set and subtracted, while I 
expected, that brackets would have been needed for this (in which case an error 
like "bad character range" would be raised on the above typo).

Anyway, I'd like to ask for clarification, whether this is the expected 
behaviour, and possibly for specifying this in the docs more explicitly.

Thanks and regards
    vbr

Original issue reported on code.google.com by Vlastimil.Brom@gmail.com on 15 Dec 2014 at 4:03

GoogleCodeExporter commented 9 years ago
It wasn't handling the set difference operator '--' correctly.

Fixed in regex 2014.12.15.

Original comment by re...@mrabarnett.plus.com on 15 Dec 2014 at 8:04