Incorrect backref expansion, and ignored look ahead assrtions

pygy commented 2 years ago

Hello, in this hairy example, the detector incorrectly expands ("|'|(?=(?:[^'"]|\\['"\]])+\])) ... (?!\5) to ("|'|(?=(?:[^'"]|\\['"\]])+\])) ... (?!(?:"|'|)) which is wrong. It also fails to take the negative look ahead into account and thinks that the character class that follows the (?!\5) bit can match a quote when it can't.

tjenkinson commented 2 years ago

Yikes that's complicated!

Is there something in particular that you think is wrong with it? (?!(?:"|'|)) looks like it's working as expected to me.

thinks that the character class that follows the (?!\5) bit can match a quote when it can't

Are you sure? Couldn't \5 contain " or ', meaning \5 could be " and therefore ' could match [^\\\]?

pygy commented 2 years ago

What \5 matches depend on what the capture caught. See here for a simple example

Generally, (?!x)y is equivalent to a set difference y \ x (if x and y are character classes)

So when " is captured, (?!\5)[^\\\]] is equivalent to [^"\\\]]. When ' is captured, it is equivalent to [^'\\\]].

You'll note that the RegExp is otherwise buggy because the capture can be empty, and in that case, (?!\5) becomes (?!) which will never match... I was trying to debug a ReDOS using your tool, and I caught this while experimenting.

Edit: form... aargh

tjenkinson commented 2 years ago

What \5 matches depend on what the capture caught. See here for a simple example

Exactly. So when you say

thinks that the character class that follows the (?!\5) bit can match a quote when it can't

I don't think that's correct?

If we simplify yours to

("|')stuff(?!\1)[^\\\]]

then "stuff' or 'stuff" would match this, so it could match a quote?

(?!) which will never match

huh that's an interesting edge case

tjenkinson commented 2 years ago

Oh I think I see what you're saying.

\5 gets downgraded to (?:"|'|) because it copies the contents of the referenced group, but removes the lookahead. In reality it would only be one of " or ' or empty, but I'm not sure how we could represent that in the downgraded pattern. Getting the detector to work without the downgrade and understand that behaviour would be nice but is complicated.

Here's some info on when the downgrade happens:

https://github.com/tjenkinson/redos-detector/blob/94ac2398b367f15b48551e4c5bf18713528471df/src/downgrade-pattern.ts#L142-L153

So I think in this case the reason it downgrades is the second point, because of the + in the positive lookahead in ("|'|(?=(?:[^'"]|\\['"\]])+\])). So really given the + is in the lookahead the downgrade might not actually need to happen for this case, given the group would always match a fixed length string 🤔

pygy commented 2 years ago

^^^ This is what drove me here, specifically the last line

The tool has captured " as the 5th value and thinks that [[^\\\]] can match the same thing as \5, i.e. also ".

But it can't because it should take (?!\5)[[^\\\]] in consideration. The [[^\\\]] part will not be matched, because (?!\5) will fail, since \5 matches ".

pygy commented 2 years ago

Reducing the test case:

/(a|)(?:(?:(?!\1)[^b]|b.))+\1/ is detected as a ReDOS, whereas /(a)(?:(?:(?!\1)[^b]|b.))+\1/ isn't.

The empty choice in the capture is throwing the engine astray.

There is no ReDOS though. with no initial a, the regexp will only match (?:b.)+.

tjenkinson commented 2 years ago

For that we can reduce it to

(a)([^b])+

👍 vs

(a|)([^b])+

:-1:

Which does look like it could be a bug. Will dig into it thanks!

tjenkinson commented 2 years ago

Ok so this is just a limitation of how it works at the moment

(a|).+$

fails just like

(a?).+$

does.

This is because it generates

a .
. .
. .

and stops going further because it detects it would be an infinite loop.

If it has to stop due to an infinite loop and there is at least one trail, right now it always assumes infinite backtracks as sometimes that is the case.

With

a.+

internally it gets to

a a
. .
. .

and still detects an infinite loop, but no trails are emitted, given every trail it generates has both sides being identical, so it's considered safe.

I think https://github.com/tjenkinson/redos-detector/issues/87#issuecomment-1133640620 might still be fixable though with a bit of work, given group 5 is actually a fixed length

pygy commented 2 years ago

I saw your post about downgrading after posting my last two, and didn't have the time to respond immediately...

After downgrading, the detector logic isn't tripped by the (?=(x))\1 pattern used to polyfill atomic groups, that's already awesome.

In this case, there's a lookahead in the group, not a group in a lookahead, and it is indeed finite in size, so that's probably a bug in your pattern size detection logic.

Not that I'd want to use the RegExp in the example, it tries to do too much with too few captures and it ultimately fails at what was intended.

... and I just see your latest post as I'm typing this... We agree :-)

pygy commented 2 years ago

I've yet to look into the details of your detector logic, and I'm not sure I'll have the time to dig in...

You probably know that the folks over at lgtm.com/codeql (i.e. GitHub/MS) also do ReDoS detection, it may be worth looking into the way they handle the a?.+$. It is a polynomial ReDoS (O(length ^ 2) for aaa\n-like inputs)...

pygy commented 2 years ago

(a?)(?:(?!a)[^])+\1 is also a false positive.

(a?)[^a]+\1 which is equivalent is properly detectes as ReDoS-free.

tjenkinson commented 2 years ago

Yeh I had a look at the codeql stuff. Not super familiar with the language though so found it quite confusing. Interestingly there are patterns it didn’t catch that this does, so don’t think that could be used to guarantee something is redos free, but it’s a good quick indicator :)

That last example is expected at the moment as it doesn’t actually track lookaheads along with what it’s matching

pygy commented 2 years ago

Oh, wow, I thought the QL was a high level veneer on top of imperative code... It's a bloody language... that's insane (-ly cool, eh ;-).

Indeed, their logic isn't perfect either, I've also reported false positive to them...

For example they don't understand look arounds, so they miss atomic polyfills.

You may want to exchange with @erik-krogh though, assuming both of you have the time, even if the implementation language differs, both project could benefit from conceptual cross-pollination.

tjenkinson commented 2 years ago

@pygy can you remember what pattern you put in for https://github.com/tjenkinson/redos-detector/issues/87#issuecomment-1133640620 ?

I’m wondering why \5 is there given in your original pattern it gets downgraded due to it thinking the group is infinite size

tjenkinson commented 2 years ago

Actually did you hack the \5 into that screenshot?

pygy commented 2 years ago

I didn't... I tinkered with it quite a bit I'm afraid, I don't remember exactly what the RegExp was at that point I'll try to reproduce what I did. the \5 was the one at the end of the RegExp, the closing quote of the string.

tjenkinson commented 2 years ago

Cool that’s fine I was just confused for a minute how it output that

tjenkinson commented 2 years ago

The downgrade on the fixed group size is fixed now, and your initial link now shows slightly fewer results. Results like

are now gone.

It does still show invalid ones though given it doesn't use the lookaheads to invalidate some of the options.

If we pretend it did though tweaking the pattern there's still another problem later on.

(?:[^'"]|\\['"\]])+ because \\ can match both sides of the disjunction.

If you then add \ to the left side it goes green, although might not match what you actually need anymore 😆

pygy commented 2 years ago

Thanks :-). I actually fixed the RegExp in the mean time (for a point release that just fixes the ReDoS while minimally changing the original buggy behavior. A better one will be released later, with complete disjunctions for bare and quoted values, rather than trying to match both with the same capture.

tjenkinson commented 2 years ago

Cool. Do you have a link to the project?

pygy commented 2 years ago

That's for https://mithril.js.org, it's the selector parser for the hyperscript DSL, e.g. m('input[type=password].fancy').

Nothing has been released yet, the fix must be back-ported ported to older versions and we only have the infrastructure code to make releases on master...

tjenkinson commented 2 years ago

Ah nice, so this helped find a ReDoS in mithriljs?

pygy commented 2 years ago

No, the CodeQL/LGTM folks have scanned GitHub and raised security issues all around.

That was a while ago. The vulnerability is very minor though, because the DSL is only meant to be called with static strings, not with app user input (as that could lead to arbitrary content injection). So at worse, a dev could ReDoS their own dev env if they tried deliberately. while(true); is less effort for the same result.

I was playing with your detector to verify attempts at fixing the problem (another approach is to time how long it takes to match pathological input of growing size, I was hoping to get a formal seal of approval rather than empirical, but spotty evidence).

erik-krogh commented 2 years ago

Indeed, their logic isn't perfect either, I've also reported false positive to them...

For example they don't understand look arounds, so they miss atomic polyfills.

You may want to exchange with @erik-krogh though, assuming both of you have the time, even if the implementation language differs, both project could benefit from conceptual cross-pollination.

Yes, the analysis implemented in CodeQL doesn't really handle lookaheads/lookbehinds, backreferences, and some other non-standard features of regular expressions.
And there are other limitations.
Our implementation starts with a description of what we are doing, and some references to the research our implementation is based on.

The test cases I've build are pretty comprehensive, and also covers a decent amount of what our analysis doesn't catch (or where we get an FP).

My usual goto for a third-party regexp checker is https://regex.rip/, which is based on some of the same research as our own CodeQL analysis, but we don't share any implementation.
(It's open-source :smiley:).

What I linked to above all attempts to catch ReDoS with exponential worst-case.
We also have an analysis that attempts to catch ReDoS where the worst-case is only polynomial.
Theorem 3 from this paper very well describes the idea behind the implementation.

I haven't looked at your implementation, but I tried to throw some tests at it, and it looks like you are not really checking if the attack-string is rejected by the regexp (which is required for backtracking to kick in).
~~For example your tool falsely reports a regexp like /(a*a*x)*/ as vulnerable. That regexp is not vulnerable to ReDoS because it literally matches any string (it's not anchored).~~ Edit: I see now. That regexp is vulnerable to polynomial redos. I was too focused on the polynomial case.

the CodeQL/LGTM folks have scanned GitHub and raised security issues all around.

Yep, I was part of that. We scanned a lot of code across multiple programming languages.
We found a decent amount of issues, most of which were more severe than the one in Mithril.

tjenkinson commented 2 years ago

hey @erik-krogh thanks for that. loads of useful info!

We found a decent amount of issues, most of which were more severe than the one in Mithril.

That's awesome

Theorem 3 from this paper very well describes the idea behind the implementation.

Will have a read of that

https://github.com/superhuman/rxxr2 looks great but with all my initial searching I never managed to find it :/

Could do with some tags adding and maybe mentioning ReDoS somewhere in the readme 😆

For example your tool falsely reports a regexp like /(a*a*x)*/ as vulnerable.

rxxr2 doesn't catch patterns like

/(a*a*x)?/.test('a'.repeat(1000000) + '!')

or /(a*a*a*a**a*x)?/ etc

though? Isn't this also a case where

That regexp is not vulnerable to ReDoS because it literally matches any string (it's not anchored).

applies? It could match any string.

erik-krogh commented 2 years ago

rxxr2 doesn't catch patterns like

/(a*a*x)?/.test('a'.repeat(1000000) + '!')

That's correct, it also doesn't catch /X(a*a*)Y/, because they only have a polynomial worst-case runtime, and that analysis is only looking for regular expression with exponential worst-case runtime.
Something like //g (real example) has a exponential worst-case performance.
For an example of just how catastrophic exponential worst-case can be: (takes a minute on my machine).

time node -e '/<!--(.|\s)*?-->/g.test("<!--" + " ".repeat(30) +  "--")'

or /(a*a*a*a*a*x)?/ etc

Still only polynomial. Although the runtime is something like O(n^5).

My CodeQL query catches those polynomial examples, but we only flag them if those regular expressions are run on inputs that might be controlled by an attacker.
(As opposed to the exponential worst-case regular expressions, we always flag those).

erik-krogh commented 2 years ago

though? Isn't this also a case where

That regexp is not vulnerable to ReDoS because it literally matches any string (it's not anchored).

applies? It could match any string.

I'm not sure if I understand the question.
Yes a regular expression like /(a*a*x)?/ is vulnerable to polynomial ReDoS.
Yes the regular expression as a whole matches any string.
But the subterm (a*a*x) does not, and the backtracking can still happen inside that subterm.
If you read theorem 3 it only requires that you reach a rejecting state (q_r ∉ F) starting from the end of the vulnerable subterm (q'), not that the regular expression as a whole has strings it doesn't match.

(I wasn't clear enough in my original comment).

This stuff is complicated. I've on multiple occasions gotten false positive reports that turned out to be true positives.

tjenkinson commented 2 years ago

My CodeQL query catches those polynomial examples, but we only flag them if those regular expressions are run on inputs that might be controlled by an attacker.

Interesting. Is the reason you don't flag those in all cases to prevent too much noise? I get that a*a*x would need a long input string to be a problem, but something like a*a*a*a*a*a*a*x wouldn't, even if it's not exponential.

I'm not sure if I understand the question.

I was confused by

For example your tool falsely reports a regexp like /(a*a*x)*/ as vulnerable. That regexp is not vulnerable to ReDoS because it literally matches any string (it's not anchored).

But I think we have the same understanding and actually the a*a*x bit is vulnerable, but polynomial not exponential? Would "SuperlinearBackTracking.qll" flag a*a*x? Also is there an environment somewhere for running those queries easily?

Maybe this tool could also optionally take in the max length of the input string and then that could be factored in to cap the max number of potential backtracks 🤔

I will have a read of that paper, but it sounds very mathematical so not sure if I'll follow it all

erik-krogh commented 2 years ago

I get that aax would need a long input string to be a problem, but something like aaaaaaa*x wouldn't, even if it's not exponential.

Yeah. Only distinguishing between polynomial and exponential might be too simple, but it's what we do for now.
I remember seeing some tool that could tell you how high the exponent was in the polynomial worst-case, but I don't remember where I found it. (I think it was a Python thing, but I'm most likely wrong).

But I think we have the same understanding and actually the aax bit is vulnerable, but polynomial not exponential? Would "SuperlinearBackTracking.qll" flag a*a*x?.

Yes. It's vulnerable, yes we flag it.

Also is there an environment somewhere for running those queries easily?

The easiest way is probably to use https://lgtm.com/.
You can find the help pages for the queries here: exponential-redos, polynomial-redos.
If you click the Open in query console button you can run the query on open-source projects (you can add projects to that list in Project list).

Otherwise you can see some more documentation at https://codeql.github.com/, which includes how to run CodeQL locally.
It's free for open-source and academia.

Maybe this tool could also optionally take in the max length of the input string and then that could be factored in to cap the max number of potential backtracks thinking

We do some of that for the polynomial case.
But we just assume that any length check is good enough.

I will have a read of that paper, but it sounds very mathematical so not sure if I'll follow it all

Yeah.
I learned that syntax in one of those computer science courses that no one thinks they'll every need.
But I just happened to end up in a position where I need all of that.

tjenkinson commented 2 years ago

We do some of that for the polynomial case. But we just assume that any length check is good enough.

Cool thanks good to know it's not a crazy idea. I might look into that at some point.

I think the example on https://lgtm.com/rules/1511427506129/ is wrong by the way. Shouldn't it be

^0\.\d+E?\d+

?

The paper was really interesting. I didn't follow a lot of the snippets but some of the background is still useful

erik-krogh commented 2 years ago

I think the example on https://lgtm.com/rules/1511427506129/ is wrong by the way. Shouldn't it be
^0\.\d+E?\d+

Yes it should.
That also how it shows in the source-code.
I'll look into it.

tjenkinson commented 1 year ago

Closing this in favour of https://github.com/tjenkinson/redos-detector/issues/426

tjenkinson / redos-detector

Incorrect backref expansion, and ignored look ahead assrtions #87