Closed GoogleCodeExporter closed 9 years ago
It sees the second "[" and thinks it's the start of a nested set.
I've modified my sources to treat a "[" in a set as a literal if it fails to
parse it as a nested set. Seems to work.
Original comment by re...@mrabarnett.plus.com
on 18 May 2011 at 3:52
Thanks for the fix;
however, now I see, I had probably oversimplified my real regex pattern causing
problems; it seems, that some characters, here exemplified with "-", are still
causing problems (regex-0.1.20110610); cf.
>>> print regex.sub(r"([][-])", r"-", u"a[b]c")
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "regex.pyc", line 219, in sub
File "regex.pyc", line 371, in _compile
File "_regex_core.pyc", line 296, in parse_pattern
File "_regex_core.pyc", line 310, in parse_sequence
File "_regex_core.pyc", line 323, in parse_item
File "_regex_core.pyc", line 427, in parse_element
File "_regex_core.pyc", line 563, in parse_paren
File "_regex_core.pyc", line 296, in parse_pattern
File "_regex_core.pyc", line 310, in parse_sequence
File "_regex_core.pyc", line 323, in parse_item
File "_regex_core.pyc", line 440, in parse_element
File "_regex_core.pyc", line 1024, in parse_set
File "_regex_core.pyc", line 1034, in parse_set_union
File "_regex_core.pyc", line 1045, in parse_set_symm_diff
File "_regex_core.pyc", line 1053, in parse_set_inter
File "_regex_core.pyc", line 1061, in parse_set_diff
File "_regex_core.pyc", line 1074, in parse_set_imp_union
File "_regex_core.pyc", line 1082, in parse_set_member
File "_regex_core.pyc", line 1135, in parse_set_item
error: bad set
>>> print re.sub(r"([][-])", r"-", u"a[b]c")
a-b-c
>>>
it seems, that any character in the position of "-" in the above pattern is
causing this error, only ([][]) is currently working.
(I tried to test the new set operators like | ~ & - here, but I found that also
general characters like "a" are causing this.)
Just to be sure, the actual pattern I am using (which works with re) is e.g.:
print regex.sub("([][$.\\\\*+|?()^{}-])", r"\\\1", u"a[b]c.d?e*f{}gh&i\j@k")
i.e. an older homebrew version of regex.escape(..., special_only=True)
regards,
vbr
Original comment by Vlastimil.Brom@gmail.com
on 15 Jun 2011 at 9:02
The problem is that regex can now have a set inside a set, so a literal "[" in
a set needs to be escaped.
Instead of r"([][-])" write r"([]\[-])".
Or would it be better if it behaved like re and required the NEW flag for a
nested set?
Original comment by re...@mrabarnett.plus.com
on 15 Jun 2011 at 10:11
Thanks for the clarification,
I thought, it would have been resolved with the fallback-fix above, but it is
apparently not possible generally.
I thought, the nested sets are only meaningful with some operators between
them; these duplicated symbols are (probably?) normally not present in
non-nested sets, hence the nesting could only be evaluated, if there are some
of those in the pattern.
As for the policy regarding NEW, this would probably rather depend on
requirements for the inclusion into the standard library...
For my individual usecases, I would rather like having this feature available
by default, but it does'nt matter much; the most important thing for me is, it
can be made work - be it by escaping the brackets or by setting (?n) in the
patterns, depending on the decision.
On a related note, would it be possible to have some magic module-wide setting
like
regex.use_new(), which would enable the incompatible "new" features globally,
e.g. right after the import without the need to set the flag individually
afterwards?
(In my script, I am using regex, if available, but sometimes only re; in this
case, trying this setting once in a program would be more straightforward, than
trying the n-flag in all patterns requiring it.
(Not sure, if the internals would support it, or even whether the resetter
"stop_using_new()" would ever be usweful or possible...?)
Anyway, it's just a thoughtif this could possibly cause further problems or
complications, it isn't worth it;
vbr
involve those
For me
Original comment by Vlastimil.Brom@gmail.com
on 15 Jun 2011 at 11:00
Sorry for the "garbage" in the text, due to sending the message sligthly
prematurely:
the last sentence should contain: "... thought; if ..." and the text should end
with "vbr". :-)
Original comment by Vlastimil.Brom@gmail.com
on 15 Jun 2011 at 11:06
The regex is parsed by recursive descent. By the time it discovers there's a
problem it has already returned from the function where it decided to parse the
nested set, so it's too late to take the alternative course. (Hmm, I wonder
whether it's fixable with a hack...)
As for the NEW flag, how could it be turned on for one importer but not any
others? You wouldn't want it to break another module which uses regex but
expects it to be off.
Original comment by re...@mrabarnett.plus.com
on 16 Jun 2011 at 12:30
Re comment 6, there isn't a clever hack. The alternative I'll try is to disable
nested sets and parse again if it finds a bad set. Seems to work so far.
Original comment by re...@mrabarnett.plus.com
on 16 Jun 2011 at 1:03
Re 6: ok, that was the complication I hadn't considered ...; to keep the
setting in the given namespaces, it would probably be necessary to provide
something like regex_new module to import it with the NEW flag behaviour, but
this kind of "cloning" seem rather hackish too (not sure, if it could be
achieved somehow virtually).
In any case, I can, of course, adjust the patterns explicitely to deal with
regex or re respectively.
Thanks for the further improvements.
Original comment by Vlastimil.Brom@gmail.com
on 16 Jun 2011 at 7:46
As some older bug in my code seems to have reappeared with some recent regex
version, I'd like to clarify the set behaviour.
Is the above mentioned fallback behaviour gone for V1 flag? (Cf. comments 1, 7)
Now I made sure to escape my patterns appropriately, hence it shouldn't be
relevant anymore, but I wanted to understand the changes.
regards,
vbr
=== regex-0.1.20110922a ===
>>> regex.sub(r"([][])", r"-", u"a[b]c")
u'a-b-c'
>>> regex.sub(r"(?V1)([][])", r"-", u"a[b]c")
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "C:\Python27\lib\regex.py", line 245, in sub
return _compile(pattern, flags, kwargs).sub(repl, string, count, pos,
File "C:\Python27\lib\regex.py", line 423, in _compile
parsed = parse_pattern(source, info)
File "C:\Python27\lib\_regex_core.py", line 334, in parse_pattern
branches = [parse_sequence(source, info)]
File "C:\Python27\lib\_regex_core.py", line 350, in parse_sequence
item = parse_item(source, info)
File "C:\Python27\lib\_regex_core.py", line 363, in parse_item
element = parse_element(source, info)
File "C:\Python27\lib\_regex_core.py", line 587, in parse_element
element = parse_paren(source, info)
File "C:\Python27\lib\_regex_core.py", line 723, in parse_paren
subpattern = parse_pattern(source, info)
File "C:\Python27\lib\_regex_core.py", line 334, in parse_pattern
branches = [parse_sequence(source, info)]
File "C:\Python27\lib\_regex_core.py", line 350, in parse_sequence
item = parse_item(source, info)
File "C:\Python27\lib\_regex_core.py", line 363, in parse_item
element = parse_element(source, info)
File "C:\Python27\lib\_regex_core.py", line 600, in parse_element
return parse_set(source, info)
File "C:\Python27\lib\_regex_core.py", line 1206, in parse_set
item = parse_set_union(source, info)
File "C:\Python27\lib\_regex_core.py", line 1222, in parse_set_union
items = [parse_set_symm_diff(source, info)]
File "C:\Python27\lib\_regex_core.py", line 1232, in parse_set_symm_diff
items = [parse_set_inter(source, info)]
File "C:\Python27\lib\_regex_core.py", line 1242, in parse_set_inter
items = [parse_set_diff(source, info)]
File "C:\Python27\lib\_regex_core.py", line 1252, in parse_set_diff
items = [parse_set_imp_union(source, info)]
File "C:\Python27\lib\_regex_core.py", line 1277, in parse_set_imp_union
items.append(parse_set_member(source, info))
File "C:\Python27\lib\_regex_core.py", line 1286, in parse_set_member
start = parse_set_item(source, info)
File "C:\Python27\lib\_regex_core.py", line 1334, in parse_set_item
return parse_set(source, info)
File "C:\Python27\lib\_regex_core.py", line 1206, in parse_set
item = parse_set_union(source, info)
File "C:\Python27\lib\_regex_core.py", line 1222, in parse_set_union
items = [parse_set_symm_diff(source, info)]
File "C:\Python27\lib\_regex_core.py", line 1232, in parse_set_symm_diff
items = [parse_set_inter(source, info)]
File "C:\Python27\lib\_regex_core.py", line 1242, in parse_set_inter
items = [parse_set_diff(source, info)]
File "C:\Python27\lib\_regex_core.py", line 1252, in parse_set_diff
items = [parse_set_imp_union(source, info)]
File "C:\Python27\lib\_regex_core.py", line 1277, in parse_set_imp_union
items.append(parse_set_member(source, info))
File "C:\Python27\lib\_regex_core.py", line 1286, in parse_set_member
start = parse_set_item(source, info)
File "C:\Python27\lib\_regex_core.py", line 1338, in parse_set_item
raise error("bad set", True)
error: bad set
>>>
Original comment by Vlastimil.Brom@gmail.com
on 26 Sep 2011 at 5:11
The answer is yes, the fallback behaviour is gone. Although it helped in some
cases, it certainly wasn't foolproof, so I thought it better just to let it
fail.
Version 0: simple sets.
Version 1: nested sets.
Original comment by re...@mrabarnett.plus.com
on 26 Sep 2011 at 5:29
Ok, thanks; this explains the behaviour I noticed; the plain failure indeed
worked in my case, as the pattern is now finally corrected :-)
vbr
Original comment by Vlastimil.Brom@gmail.com
on 26 Sep 2011 at 5:41
Original issue reported on code.google.com by
Vlastimil.Brom@gmail.com
on 18 May 2011 at 11:09