rust-lang / rust-clippy

A bunch of lints to catch common mistakes and improve your Rust code. Book: https://doc.rust-lang.org/clippy/
https://rust-lang.github.io/rust-clippy/
Other
11.34k stars 1.53k forks source link

lint suboptimal regexps #646

Open llogiq opened 8 years ago

llogiq commented 8 years ago

See https://en.wikipedia.org/wiki/ReDoS e.g. (a+)+

@BurntSushi I think we could have a big benefit here. However I'm not as deep in the implementation as you are – could you give us some hints how to detect those cases?

mcarton commented 8 years ago

:+1: It would be a nice one.

llogiq commented 8 years ago

I am not aware of any lint tool that does this yet, so...yeah, it'd be plenty cool. :smile:

BurntSushi commented 8 years ago

Thankfully, we don't need this!

The attack exploits the fact that most regular expression implementations have exponential time worst case complexity:

Rust's regex crate guarantees worst case linear time complexity. :D

BurntSushi commented 8 years ago

If someone were using pcre bindings, then maybe this lint would be useful. ;-)

BurntSushi commented 8 years ago

To clarify, from the OP:

  • the engine may convert it to a deterministic finite-state automaton (DFA) and run the input through the result;
  • the engine may try one by one all the possible paths until a match is found or all the paths are tried and fail ("backtracking").[4][5]
  • the engine may consider all possible paths through the nondeterministic automaton in parallel;
  • the engine may convert the nondeterministic automaton to a DFA lazily (i.e., on the fly, during the match).

The OP (correctly) states that the first two approaches are problematic while the latter two do not exhibit any pathological behavior. Indeed, regex implements the latter two but not the former two. (Well, it does have a backtracking engine, but it is bounded.)

There are other ways to cause pathological behavior, e.g., a{100000000} could create a regex of a rather large size, which could cause DoS by exhausting memory. regex also protects against this by placing a user configurable bound on how big regexes can be. Similarly, doing (((((((((((...)))))))))))) is also bounded. (Although this last one may only be fixed on my branch.)

llogiq commented 8 years ago

@BurntSushi Thanks, that's totally excellent. :smile:

In that case, we could lint suboptimal regexps, which can be simplified without losing meaning. E.g. (a+a*) is equal to just (a+).

BurntSushi commented 8 years ago

@llogiq Hmm, that sounds doable for some easier patterns. :-)