#618 regressions and more explicit methodology

This is a most useful project, especially given it can be easily incorporated into any CI workflow. So, please keep up the good work!

Until recently, our experience is its results are broadly in line with those of Nicolaas Weideman‘s RegexStaticAnalysis. #618, however, has caused significant (non-performance) regression and we’ve had to pin an earlier commit.

To use @gms1’s example from #621, a regex as simple as \s+ is currently considered unsafe because it’s interpreted as ^.*?\s+. But not having ^ is hardly equivalent to having a lazy .* in front of the expression. Lazy Kleene just backtracks in the other direction than the greedy Kleen, whereas a bare \s+ never has to backtrack:

pattern = "\s+" NFA constructed in: 1ms EDA analysis performed in: 1ms Does not contain EDA IDA analysis performed in: 1ms Does not contain IDA Total analysis time: 3

pattern = "^.*?\s+" NFA constructed in: 1ms EDA analysis performed in: 1ms Does not contain EDA IDA analysis performed in: 72ms Contains IDA, degree 1, with: a\x09\x09\x09...\x09a IDA exploit string as JSON: {"degree":1,"separators":["a\t\t"],"pumps":["\t"],"suffix":"a"} Prefix: "a\x09\x09" Pump 0: "\x09" Suffix: "a" Total analysis time: 74

The reason (a+)+ is exponential (#606) isn’t because it has the same semantics as ^.*(a+)+, but simply that it’s a nested Kleene expression.

As a side note, at some point it would probably be good to start documenting the methodology of this project more systematically. There’s unfortunately not much consensus on the exact scope of bad regexes (see eg this recent literature review), so it’s particularly helpful to at least be explicit about one’s methodology.

tjenkinson / redos-detector

#618 regressions and more explicit methodology #624