Closed magnumripper closed 7 years ago
Hi magnum, thank's a lot.
I found a more simple test case: (a)\1 => works ((a)\1) => segfaults
I'll take a look at it when I'm back at the hotel this evening...
As with PCRE, I assign numbers to groups with respect to the opening paranthesis. So in (a(b))
, \1
refers to (a...)
and \2
refers to (b)
In ((a)\1)
, the \1
refers to the outer group, thus it indirectly refers to itself, which creates an endless recursion. Otherwise, ((a)\2)
works.
I think this problem could be mitigated by the parser, where I'll add a check. So since 1.5.0, ((a)\1)
-like regexes will be syntactically wrong. (In general: A group MUST NOT contain a reference to itself, neither directly nor indirectly.)
Regards, Jan
That makes some sense. But (([a-z])\2\2){2,2}
will give me two identical groups, never aaabbb. Is that a bug or is it correct behavior? I guess a lot of things one can come up with are actually not very well defined. BTW my man page for regex(3) even says
The standard's definition of back references is vague. For example, does
`a\(\(b\)*\2\)*d' match `abbbd'? Until the standard is clarified, behavior in such
cases should not be relied on.
In my eyes this is correct behaviour. As you say, things are not very well defined. So I always look what PCRE is doing, and try to implement that behaviour:
$ echo aa | perl -ne 'if (m/(\w)\1/)[print "match\n"}else{print "no match\n}'
match
$ echo ab | perl -ne 'if (m/(\w)\1/)[print "match\n"}else{print "no match\n}'
no match
In rexgen, a backreference does not refer to the regular expression, but to the concrete value created by the regular expression. This matches to PCRE in some way.
Perhaps one could make it configurable if backreferences refer to a regex or its created value. This should be no hard work and could be done in the parser as well.
This works fine:
Now, perhaps this is a weird regexp but I was hoping to get things like bbbYYY666 and hhhh333ffff
Workaround (but only for one length at a time):