"bad set" error for unescaped ] at the beginning of the set

GoogleCodeExporter commented 9 years ago

Hi,
I just found one inconsistence of regex against re in handling of the sets (it 
might depend on the newest addition of set operations).
I thought, a pattern like "[][]" would be legal (although probably not very 
readable). It also does work in re, but in regex it causes a "bad set" error:

>>> print re.sub(r"([][])", r"-", u"a[b]c")
a-b-c
>>> print regex.sub(r"([][])", r"-", u"a[b]c")
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "regex.pyc", line 194, in sub
  File "regex.pyc", line 334, in _compile
  File "_regex_core.pyc", line 243, in _parse_pattern
  File "_regex_core.pyc", line 257, in _parse_sequence
  File "_regex_core.pyc", line 270, in _parse_item
  File "_regex_core.pyc", line 369, in _parse_element
  File "_regex_core.pyc", line 503, in _parse_paren
  File "_regex_core.pyc", line 243, in _parse_pattern
  File "_regex_core.pyc", line 257, in _parse_sequence
  File "_regex_core.pyc", line 270, in _parse_item
  File "_regex_core.pyc", line 382, in _parse_element
  File "_regex_core.pyc", line 924, in _parse_set
  File "_regex_core.pyc", line 933, in _parse_set_union
  File "_regex_core.pyc", line 943, in _parse_set_symm_diff
  File "_regex_core.pyc", line 950, in _parse_set_inter
  File "_regex_core.pyc", line 957, in _parse_set_diff
  File "_regex_core.pyc", line 971, in _parse_set_imp_union
  File "_regex_core.pyc", line 978, in _parse_set_member
  File "_regex_core.pyc", line 1046, in _parse_set_item
  File "_regex_core.pyc", line 933, in _parse_set_union
  File "_regex_core.pyc", line 943, in _parse_set_symm_diff
  File "_regex_core.pyc", line 950, in _parse_set_inter
  File "_regex_core.pyc", line 957, in _parse_set_diff
  File "_regex_core.pyc", line 971, in _parse_set_imp_union
  File "_regex_core.pyc", line 978, in _parse_set_member
  File "_regex_core.pyc", line 1048, in _parse_set_item
error: bad set

It can be easily remedied (after I found the problem in a more complex pattern) 
by escaping the square brackets:

>>> print re.sub(r"([\]\[])", r"-", u"a[b]c")
a-b-c
>>> print regex.sub(r"([\]\[])", r"-", u"a[b]c")
a-b-c
>>> 

Using regex-0.1.20110510 python 2.7.1, Win XP

regards,
   vbr

Original issue reported on code.google.com by Vlastimil.Brom@gmail.com on 18 May 2011 at 11:09

GoogleCodeExporter commented 9 years ago

It sees the second "[" and thinks it's the start of a nested set.

I've modified my sources to treat a "[" in a set as a literal if it fails to 
parse it as a nested set. Seems to work.

Original comment by re...@mrabarnett.plus.com on 18 May 2011 at 3:52

Changed state: Fixed

GoogleCodeExporter commented 9 years ago

Thanks for the fix;
however, now I see, I had probably oversimplified my real regex pattern causing 
problems; it seems, that some characters, here exemplified with "-",  are still 
causing problems (regex-0.1.20110610); cf.

>>> print regex.sub(r"([][-])", r"-", u"a[b]c")
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "regex.pyc", line 219, in sub
  File "regex.pyc", line 371, in _compile
  File "_regex_core.pyc", line 296, in parse_pattern
  File "_regex_core.pyc", line 310, in parse_sequence
  File "_regex_core.pyc", line 323, in parse_item
  File "_regex_core.pyc", line 427, in parse_element
  File "_regex_core.pyc", line 563, in parse_paren
  File "_regex_core.pyc", line 296, in parse_pattern
  File "_regex_core.pyc", line 310, in parse_sequence
  File "_regex_core.pyc", line 323, in parse_item
  File "_regex_core.pyc", line 440, in parse_element
  File "_regex_core.pyc", line 1024, in parse_set
  File "_regex_core.pyc", line 1034, in parse_set_union
  File "_regex_core.pyc", line 1045, in parse_set_symm_diff
  File "_regex_core.pyc", line 1053, in parse_set_inter
  File "_regex_core.pyc", line 1061, in parse_set_diff
  File "_regex_core.pyc", line 1074, in parse_set_imp_union
  File "_regex_core.pyc", line 1082, in parse_set_member
  File "_regex_core.pyc", line 1135, in parse_set_item
error: bad set
>>> print re.sub(r"([][-])", r"-", u"a[b]c")
a-b-c
>>> 

it seems, that any character in the position of "-" in the above pattern is 
causing this error, only ([][]) is currently working.
(I tried to test the new set operators like | ~ & - here, but I found that also 
general characters like "a" are causing this.) 

Just to be sure, the actual pattern I am using (which works with re) is e.g.:
print regex.sub("([][$.\\\\*+|?()^{}-])", r"\\\1", u"a[b]c.d?e*f{}gh&i\j@k")
i.e. an older homebrew version of regex.escape(..., special_only=True)

regards,
    vbr

Original comment by Vlastimil.Brom@gmail.com on 15 Jun 2011 at 9:02

GoogleCodeExporter commented 9 years ago

The problem is that regex can now have a set inside a set, so a literal "[" in 
a set needs to be escaped.

Instead of r"([][-])" write r"([]\[-])".

Or would it be better if it behaved like re and required the NEW flag for a 
nested set?

Original comment by re...@mrabarnett.plus.com on 15 Jun 2011 at 10:11

GoogleCodeExporter commented 9 years ago

Thanks for the clarification,
I thought, it would have been resolved with the fallback-fix above, but it is 
apparently not possible generally.
I thought, the nested sets are only meaningful with some operators between 
them; these duplicated symbols are (probably?) normally not present in 
non-nested sets, hence the nesting could only be evaluated, if there are some 
of those in the pattern.

As for the policy regarding  NEW, this would probably rather depend on 
requirements for the inclusion into the standard library...
For my individual usecases, I would rather like having this feature available 
by default, but it does'nt matter much; the most important thing for me is, it 
can be made work - be it by escaping the brackets or by setting (?n) in the 
patterns, depending on the decision.

On a related note, would it be possible to have some magic module-wide setting 
like 
regex.use_new(), which would enable the incompatible "new" features globally, 
e.g. right after the import without the need to set the flag individually 
afterwards?
(In my script, I am using regex, if available, but sometimes only re; in this 
case, trying this setting once in a program would be more straightforward, than 
trying the n-flag in all patterns requiring it.
(Not sure, if the internals would support it, or even whether the resetter 
"stop_using_new()" would ever be  usweful or possible...?)
Anyway, it's just a thoughtif this could possibly cause further problems or 
complications, it isn't worth it;
vbr

involve those 
For me

Original comment by Vlastimil.Brom@gmail.com on 15 Jun 2011 at 11:00

GoogleCodeExporter commented 9 years ago

Sorry for the "garbage" in the text, due to sending the message sligthly 
prematurely:
the last sentence should contain: "... thought; if ..." and the text should end 
with "vbr". :-)

Original comment by Vlastimil.Brom@gmail.com on 15 Jun 2011 at 11:06

GoogleCodeExporter commented 9 years ago

The regex is parsed by recursive descent. By the time it discovers there's a 
problem it has already returned from the function where it decided to parse the 
nested set, so it's too late to take the alternative course. (Hmm, I wonder 
whether it's fixable with a hack...)

As for the NEW flag, how could it be turned on for one importer but not any 
others? You wouldn't want it to break another module which uses regex but 
expects it to be off.

Original comment by re...@mrabarnett.plus.com on 16 Jun 2011 at 12:30

GoogleCodeExporter commented 9 years ago

Re comment 6, there isn't a clever hack. The alternative I'll try is to disable 
nested sets and parse again if it finds a bad set. Seems to work so far.

Original comment by re...@mrabarnett.plus.com on 16 Jun 2011 at 1:03

GoogleCodeExporter commented 9 years ago

Re 6: ok, that was the complication I hadn't considered ...; to keep the 
setting in the given namespaces, it would probably be necessary to provide 
something like regex_new module to import it with the NEW flag behaviour, but 
this kind of "cloning" seem rather hackish too (not sure, if it could be 
achieved somehow virtually).
In any case, I can, of course, adjust the patterns explicitely to deal with 
regex or re respectively.
Thanks for the further improvements.

Original comment by Vlastimil.Brom@gmail.com on 16 Jun 2011 at 7:46

GoogleCodeExporter commented 9 years ago

As some older bug in my code seems to have reappeared with some recent regex 
version, I'd like to clarify the set behaviour.
Is the above mentioned fallback behaviour gone for V1 flag? (Cf. comments 1, 7)

Now I made sure to escape my patterns appropriately, hence it shouldn't be 
relevant anymore, but I wanted to understand the changes.

regards,
   vbr

=== regex-0.1.20110922a ===
>>> regex.sub(r"([][])", r"-", u"a[b]c")
u'a-b-c'
>>> regex.sub(r"(?V1)([][])", r"-", u"a[b]c")
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "C:\Python27\lib\regex.py", line 245, in sub
    return _compile(pattern, flags, kwargs).sub(repl, string, count, pos,
  File "C:\Python27\lib\regex.py", line 423, in _compile
    parsed = parse_pattern(source, info)
  File "C:\Python27\lib\_regex_core.py", line 334, in parse_pattern
    branches = [parse_sequence(source, info)]
  File "C:\Python27\lib\_regex_core.py", line 350, in parse_sequence
    item = parse_item(source, info)
  File "C:\Python27\lib\_regex_core.py", line 363, in parse_item
    element = parse_element(source, info)
  File "C:\Python27\lib\_regex_core.py", line 587, in parse_element
    element = parse_paren(source, info)
  File "C:\Python27\lib\_regex_core.py", line 723, in parse_paren
    subpattern = parse_pattern(source, info)
  File "C:\Python27\lib\_regex_core.py", line 334, in parse_pattern
    branches = [parse_sequence(source, info)]
  File "C:\Python27\lib\_regex_core.py", line 350, in parse_sequence
    item = parse_item(source, info)
  File "C:\Python27\lib\_regex_core.py", line 363, in parse_item
    element = parse_element(source, info)
  File "C:\Python27\lib\_regex_core.py", line 600, in parse_element
    return parse_set(source, info)
  File "C:\Python27\lib\_regex_core.py", line 1206, in parse_set
    item = parse_set_union(source, info)
  File "C:\Python27\lib\_regex_core.py", line 1222, in parse_set_union
    items = [parse_set_symm_diff(source, info)]
  File "C:\Python27\lib\_regex_core.py", line 1232, in parse_set_symm_diff
    items = [parse_set_inter(source, info)]
  File "C:\Python27\lib\_regex_core.py", line 1242, in parse_set_inter
    items = [parse_set_diff(source, info)]
  File "C:\Python27\lib\_regex_core.py", line 1252, in parse_set_diff
    items = [parse_set_imp_union(source, info)]
  File "C:\Python27\lib\_regex_core.py", line 1277, in parse_set_imp_union
    items.append(parse_set_member(source, info))
  File "C:\Python27\lib\_regex_core.py", line 1286, in parse_set_member
    start = parse_set_item(source, info)
  File "C:\Python27\lib\_regex_core.py", line 1334, in parse_set_item
    return parse_set(source, info)
  File "C:\Python27\lib\_regex_core.py", line 1206, in parse_set
    item = parse_set_union(source, info)
  File "C:\Python27\lib\_regex_core.py", line 1222, in parse_set_union
    items = [parse_set_symm_diff(source, info)]
  File "C:\Python27\lib\_regex_core.py", line 1232, in parse_set_symm_diff
    items = [parse_set_inter(source, info)]
  File "C:\Python27\lib\_regex_core.py", line 1242, in parse_set_inter
    items = [parse_set_diff(source, info)]
  File "C:\Python27\lib\_regex_core.py", line 1252, in parse_set_diff
    items = [parse_set_imp_union(source, info)]
  File "C:\Python27\lib\_regex_core.py", line 1277, in parse_set_imp_union
    items.append(parse_set_member(source, info))
  File "C:\Python27\lib\_regex_core.py", line 1286, in parse_set_member
    start = parse_set_item(source, info)
  File "C:\Python27\lib\_regex_core.py", line 1338, in parse_set_item
    raise error("bad set", True)
error: bad set
>>>

Original comment by Vlastimil.Brom@gmail.com on 26 Sep 2011 at 5:11

GoogleCodeExporter commented 9 years ago

The answer is yes, the fallback behaviour is gone. Although it helped in some 
cases, it certainly wasn't foolproof, so I thought it better just to let it 
fail.

Version 0: simple sets.

Version 1: nested sets.

Original comment by re...@mrabarnett.plus.com on 26 Sep 2011 at 5:29

GoogleCodeExporter commented 9 years ago

Ok, thanks; this explains the behaviour I noticed; the plain failure indeed 
worked in my case, as the pattern is now finally corrected :-)
vbr

Original comment by Vlastimil.Brom@gmail.com on 26 Sep 2011 at 5:41

jamadden / mrab-regex-hg

"bad set" error for unescaped ] at the beginning of the set #9