LGTM.com - false positive - polynomial-redos

ChALkeR commented 3 years ago

Description of the false positive

https://lgtm.com/projects/g/ExodusMovement/schemasafe/snapshot/5d16dfc8e862db856e8bef5c5f92845546eb6c05/files/src/pointer.js?sort=name&dir=ASC&mode=heatmap#x42c7ec6a02df0e9e:1

This regular expression that depends on library input may run slow on strings with many repetitions of '.'.

I don't think that /\/?[^/]*$/ has a polynomial ReDoS.

The actual regex pattern is /?[^/]*$.

1) The first part has length limited to 1 (i.e. 0 or 1 repetitions).

2) The first and the second parts don't have common allowed symbols (the first allows only /, the second doesn't allow /), so there should be no polynomial redos possible here, even if the first part would have allowed more repetitions.

Any of these two would suffice alone to make it not a ReDoS.

Also https://github.com/NicolaasWeideman/RegexStaticAnalysis returns this:

pattern = "/?[^/]*$"
NFA constructed in: 5ms
EDA analysis performed in: 15ms
Does not contain EDA
IDA analysis performed in: 37ms
Does not contain IDA

Unlike, for example, x*[^/]*$ which does not satisfy any of (1) and (2) above and has a polynomial ReDoS (IDA):

pattern = "x*[^/]*$"
NFA constructed in: 2ms
EDA analysis performed in: 4ms
Does not contain EDA
IDA analysis performed in: 133ms
Contains IDA, degree 1, with: xxx...xx/
        IDA exploit string as JSON:     {"degree":1,"separators":["x"],"pumps":["xx"],"suffix":"/"}
        Prefix:         "x"
        Pump 0:         "xx"
        Suffix:         "/"

Note that Weideman's tool analysis typically has to be rechecked because of differences in regex engines. And some patterns are not analyzable by it and require preparation.

Also refs: https://github.com/github/codeql/issues/2804#issuecomment-715554498

tausbn commented 3 years ago

Thank you for your (very thorough) report!

I agree that this looks like a false positive. I will ping the JavaScript team to see if they can figure out what's going on.

erik-krogh commented 3 years ago

That looks like a TP to me (at least the regular expression part).

What many forget is that a missing start anchor means that the regular expression will match any prefix.
So the regular expression /\/?[^/]*$/ is equivalent to the regular expression /^[^]*\/?[^/]*$/. (If the syntax is confusing, then this page might help).

So there are actually 3 parts in the regular expression. [^]*, \/?, and [^/]*.
The first and the third both match a potential infinite string, and they both match many of the same characters, which can cause polynomial ReDoS.

Here is a little proof of concept you can run to see the worst-case runtime:

var reg = /\/?[^/]*$/;

var str = 'a';

while(true) {
    const startTime = new Date().getTime();
    reg.test(str + "/");
    const endTime = new Date().getTime();

    console.log("Str size:" + str.length + " | Time: " + (endTime - startTime) + "ms");
    str += str;
}

The performance is only quadratic in the size of the input, so it's not that bad.

ChALkeR commented 3 years ago

@erik-krogh Hm. That's basically the same as [^/]*$ or a*$ in this case, which shows the exact same results.

It's... interesting that this isn't optimized automatically.

Thanks!

In this case, e.g. a*$ also should be detected as it's also quadratic in v8 impl:

let reg = /a*$/;
let str = 'a';
while(true) {
    const startTime = new Date().getTime();
    reg.test(str + "/");
    const endTime = new Date().getTime();
    console.log("Str size:" + str.length + " | Time: " + (endTime - startTime) + "ms");
    str = 'a'.repeat(str.length * 2); // + removed
}

ChALkeR commented 3 years ago

cc @davisjam perhaps to confirm the above?

davisjam commented 3 years ago

@ChALkeR As of ~2 years ago (when I took measurements), regexes like a*$ can be quadratic in many regexes engines under partial-match semantics, including V8. These regex engines try each starting offset, i.e. implicitly adding a leading ^.*?.

Some engines optimize this away, either by (a) choosing good starting offsets [Perl does this], or (b) testing multiple offsets in lockstep [like the Thompson algorithm, which automata-based libraries like RE2 do]
V8 published an experimental automaton-based algorithm for "simple" regexes circa January 2021. It should (?) run in linear time on this example. I'm not sure if that's in Node.js yet and/or which versions of Node.js would be protected.

github / codeql

LGTM.com - false positive - polynomial-redos #6525