janstarke / rexgen

API Documentation
https://github.com/janstarke/rexgen/blob/master/doc/api.md
GNU General Public License v2.0
52 stars 21 forks source link

segfault with dubious(?) regexp #34

Closed magnumripper closed 7 years ago

magnumripper commented 7 years ago
$ rexgen -v
rexgen-1.4.0

This works fine:

$ rexgen '([A-Za-z0-9])\1\1'
AAA
BBB
CCC
(...)
777
888
999

Now, perhaps this is a weird regexp but I was hoping to get things like bbbYYY666 and hhhh333ffff

$ rexgen '(([A-Za-z0-9])\1\1){3,4}'
Segmentation fault (core dumped)

Workaround (but only for one length at a time):

$ rexgen '([A-Za-z0-9])\1\1([A-Za-z0-9])\2\2([A-Za-z0-9])\3\3'
(...)
qqqTTTjjj
(...)
janstarke commented 7 years ago

Hi magnum, thank's a lot.

I found a more simple test case: (a)\1 => works ((a)\1) => segfaults

I'll take a look at it when I'm back at the hotel this evening...

janstarke commented 7 years ago

As with PCRE, I assign numbers to groups with respect to the opening paranthesis. So in (a(b)), \1 refers to (a...) and \2 refers to (b)

In ((a)\1), the \1 refers to the outer group, thus it indirectly refers to itself, which creates an endless recursion. Otherwise, ((a)\2) works.

I think this problem could be mitigated by the parser, where I'll add a check. So since 1.5.0, ((a)\1)-like regexes will be syntactically wrong. (In general: A group MUST NOT contain a reference to itself, neither directly nor indirectly.)

Regards, Jan

magnumripper commented 7 years ago

That makes some sense. But (([a-z])\2\2){2,2} will give me two identical groups, never aaabbb. Is that a bug or is it correct behavior? I guess a lot of things one can come up with are actually not very well defined. BTW my man page for regex(3) even says

     The standard's definition of back references is vague.  For example, does
     `a\(\(b\)*\2\)*d' match `abbbd'?  Until the standard is clarified, behavior in such
     cases should not be relied on.
janstarke commented 7 years ago

In my eyes this is correct behaviour. As you say, things are not very well defined. So I always look what PCRE is doing, and try to implement that behaviour:

$ echo aa | perl -ne 'if (m/(\w)\1/)[print "match\n"}else{print "no match\n}'
match
$ echo ab | perl -ne 'if (m/(\w)\1/)[print "match\n"}else{print "no match\n}'
no match

In rexgen, a backreference does not refer to the regular expression, but to the concrete value created by the regular expression. This matches to PCRE in some way.

Perhaps one could make it configurable if backreferences refer to a regex or its created value. This should be no hard work and could be done in the parser as well.