rust-lang / regex

An implementation of regular expressions for Rust. This implementation uses finite automata and guarantees linear time matching on all inputs.
https://docs.rs/regex
Apache License 2.0
3.54k stars 442 forks source link

apply AFL to regex #203

Closed BurntSushi closed 1 year ago

BurntSushi commented 8 years ago

Currently, the parser is "fuzzed" to some extent with quickcheck, which ensures that a randomly generated AST can be roundtripped to concrete syntax and back. However, we should also try to apply more generic fuzzing techniques not only to the parser, but to regex searching itself. AFL seems like an ideal approach.

SeanRBurton commented 8 years ago

I'm willing to do this if you want.

BurntSushi commented 8 years ago

@SeanRBurton Yes, that'd be lovely, thank you!

lukaslueg commented 8 years ago

I'm currently hitting regex with AFL, calling RegEx::new on a AFL-generated pattern and ::is_match(), ::find() and ::captures() on a static test-string. So far I've only seen some "crashes" because regex exceeds the memory limit set by AFL - this could probably be circumvented by using ::with_size_limit(). I wonder what a more thorough strategy would look like.

BurntSushi commented 8 years ago

@lukaslueg Could you please open issues with test cases that reproduce your problem? What is the memory limit?

lukaslueg commented 8 years ago

The memory limit is AFL's default of 50mb VM-size. Re-running with a manually-set ulimit -v reproduced the out-of-memory-hard-abort. I'll post an example when I get one again - I did not pay too much attention to it since there is always ::with_size_limit().

Does it make sense to test the generated patterns against a randomly generated instead of a static string? Basically the only thing AFL is looking for right now are arithmetic overflows and panics...

BurntSushi commented 8 years ago

I'm definitely interested in regexes that use more than 50MB of heap. If you could create a new issue with examples that'd be great!

Does it make sense to test the generated patterns against a randomly generated instead of a static string? Basically the only thing AFL is looking for right now are arithmetic overflows and panics...

Yes, definitely.

lukaslueg commented 8 years ago

Is there a minimum length for the haystack? We want the regex to switch strategies (if it does that at all)

BurntSushi commented 8 years ago

It can switch strategies depending on the length, but the specific length isn't fixed. If the combined size of the input and the regex exceeds ~256KB, then the capturing engine will switch from backtracking to the Pike VM. So if your haystack is over 256KB, that should do it. Note though that this only happens when extracting captures (the captures method but not is_match or find).

lukaslueg commented 8 years ago

I'll continue running AFL-generated regexes against a small haystack since matching a string of 260kb is painfully slow. Two crashing bugs and one OOM-situation so far - I'll switch strategies when things quite down on the small haystack

BurntSushi commented 8 years ago

@lukaslueg Thank you! :-)

lukaslueg commented 8 years ago

Can you give some info on when regex actually executes the simd path?

BurntSushi commented 8 years ago

@lukaslueg Any time there are multiple literals in the prefix. For example, aaa|bbb|ccc|ddd|eee|fff should execute the SIMD algorithm.

Note that you need to compile like so:

RUSTFLAGS="-C target-feature=+ssse3" cargo build --release --features simd-accel

And certainly, you'll need a nightly compiler.

lukaslueg commented 8 years ago

That's what I'm doing. I've also patched MAX_SIZE_BYTES in backtrack.rs to be 256 instead of 256kb in order to give PikeVM a rub without slowing things down too much. Any more suggestions?

BurntSushi commented 8 years ago

That's all I can think of at the moment for things based on length. There is also a reverse suffix literal optimization, e.g., \w+Foobar will search for Foobar before actually matching \w+, but I don't know if you have that kind of control with AFL or not.

lukaslueg commented 8 years ago

You do. It would be of great help if you could post some examples that you know will trigger certain behavior.

BurntSushi commented 8 years ago

Examples that should use the reverse suffix literal optimization:

lukaslueg commented 8 years ago

"\w+a" also? the shorter the better, since it's a random string we are matching against.

Other behavior like the simd-path?

BurntSushi commented 8 years ago

No to \w+a. The reverse suffix literal optimization only kicks in when the literal is long enough (under the assumption that a longer literal yields fewer false positives). You can see the logic here: https://github.com/rust-lang-nursery/regex/blob/master/src/exec.rs#L1154-L1174

Other examples:

lukaslueg commented 8 years ago

I've added some examples as above which increase the number of execution paths covered. Things have quieted down though and I'll wait for the reported bugs to get fixed before continuing running AFL. I've got everything in a docker container so things can pick up fast once regex's codebase changes.

lukaslueg commented 8 years ago

Looking at #241: Is there a way for reliably force a certain regex engine without patching the code? It would be interesting to compare the result of different engines. I've also thought about comparing the result to RE2, yet this seems hard to actually get right without too many false positives.

BurntSushi commented 8 years ago

@lukaslueg Sorry, I missed your most recent comment here. Issue #241 actually contains the relevant code to force a particular regex engine. You may be surprised to find out that things like ExecBuilder are technically exposed through a publicly-accessible-but-not-public-interface.

Comparing with RE2 (or even PCRE2) is probably a good idea.

FYI, I have a PR incoming that fixes most of your bugs. :-)

BurntSushi commented 7 years ago

I think @lukaslueg has done enough of this to make this issue closeable for now. I welcome more fuzzing though! :-)

lukaslueg commented 7 years ago

As a matter of fact, I just spent half a day reworking the fuzzing architecture to make it self-contained. The latest HEAD is as of just now being fuzzed in The Cloud© on a 24/7 machine.

There already have been some patterns turning up where the different regex-engines disagree on match-groups, most likely exposing bugs in at least one of them. I'll spend some time during the next days to minimize them and file bugs individually.

BurntSushi commented 7 years ago

@lukaslueg Awesome, thanks!

utkarshkukreti commented 7 years ago

<offtopic slightly> @lukaslueg that sounds very cool! Is the code/scripts you wrote for this available online somewhere? And are you using https://github.com/frewsxcv/afl.rs?

lukaslueg commented 7 years ago

The code is currently not available, I'll sync a repo once things have stabilized. I used to have a docker container with a custom-built llvm/rust/afl-combo but rust moved ahead of afl.rs and I switched to the container provided by afl.rs just recently. Turns out it is also quite old, unstable and not reproducible so I'm unhappy with this solution as well.

lukaslueg commented 7 years ago

I've started comparing results from libpcre to rust-regex using AFL. This is probably way more tricky that once thought because the regex engines seem to interpret things differently. For starters, the regex \b.\b produces a match with rust-regex but does not with libpcre in the same randomly generated haystack. I suspect libpcre handles \bdifferently than rust-regex. Some guidance on this @BurntSushi ? I could filter out patterns that are known to produce different results.

BurntSushi commented 7 years ago

@lukaslueg In rust-regex, \b is Unicode aware by default. In fact, most everything is Unicode aware. You'll probably need to enable Unicode support in PCRE, although I'm not sure if that will cover word boundaries. For rust-regex, you can revert to the ASCII versions of various things by disabling Unicode support. e.g., (?-u:\b) is an ASCII-only word boundary, (?-u:\w) is an ASCII-only word character class and so on.

Thanks for doing this by the way. :-)

lukaslueg commented 7 years ago

I've given regex fuzzing love for the last 12 days on a dedicated machine and around 900 million rounds were executed. Variations of #375 show up, due to #345 mismatches were not examined. All Quiet on the Western Front; close?

BurntSushi commented 7 years ago

@lukaslueg Hope so! Sorry I'm dragging my feet. These bugs take a lot of lead up time to load context into my brain, so I tend to put them off.

frewsxcv commented 6 years ago

fyi, i ran AFL on this crate a few weeks back and found this panic. i've also got a fuzz target linked if anyone wants to run afl.rs themselves for this crate

BurntSushi commented 6 years ago

@frewsxcv Thanks! It takes a while to fix these bugs. It takes a lot of time just to build up the context. I tend to let them pile up a little so I can amortize it.

frewsxcv commented 6 years ago

oh for sure! i didn't meant to suggest urgency in getting it fix, just linking in case others wanted to see the fuzz target i used and the sort of bug it found. keep up the great work @BurntSushi :)

BurntSushi commented 6 years ago

@frewsxcv Oh ya no worries! Thanks for filing bugs!

sanmai-NL commented 6 years ago

@lukaslueg: Do you have thoughts on the advantages/disadvantages of using rust-fuzz/honggfuzz-rs instead/additionally?

lukaslueg commented 6 years ago

Back in June 2017 AFL ran for several weeks on some cloud-node and found no (further) crashes and variations of #321 (which is yet unresolved). The low hanging fruits due to obvious bugs have been picked up.

I could still be interesting to have honggfuzz take a look at regex with respect to "behavior", though. My thought on how to proceed would be to try to fix #321 (so we have consistent behavior within regex) and then start running fuzzed inputs against regex and re2 / pcre2 to see if they agree.

I started doing that with pcre2 as a reference. It turned out that the rust-glue available at the time was not the best and idiosyncrasies in pcre2 produced an unmanageable amount of noise. I did not put too much effort into it, though.

ethanpailes commented 6 years ago

For anyone looking for more ways to fuzz, I believe any regex that parses should pass backends_are_consistent.

See #468 for more info.

lukaslueg commented 6 years ago

I've not fuzzed regexp since #345 showed up because the duplicates cloud the view: One would have to manually and carefully inspect whether or not two diverging test cases are actually the same underlying problem in order not to waste anyones time.

Turns out fuzzers are surprisingly good at producing funny regular expressions, so I think the works are worthwhile. Yet I also think we need to take the slow route, fix one problem at a time and start over - which is not as bad as it sounds.

Ping me if #345 gets fixed by cough someone. @BurntSushi maybe we can get something like backends_are_consistent as a #[doc(hidden)] into regex?

ethanpailes commented 6 years ago

There is already an internal module for doing various dubious things like lobotomizing the execution planner for testing purposes, so that could be a good place to put backends_are_consistent. It might also be worthwhile to expose an input param in backends_are_consistent so that the fuzzer can drive the randomness. I understand fuzzers and quickcheck operate a little differently.

lukaslueg commented 6 years ago

The input probably should &[u8] for both the haystack and the regex. In case it's valid utf8, it runs a utf8-version; in any case it runs the byte-version. Return type should be something like Option<Result<(), &str>>, which is None if the regex is invalid, Some(Ok(())) if nothing happened and Some(Err(e)) if the results are inconsistent.

davisjam commented 5 years ago

Some ideas here on a starting set of regexes for fuzzing.

  1. The rxxr2 project has a mechanism to automatically generate regex patterns, and tests them against itself and some other engines that follow a similar spec. Possibly a nice way to seed fuzzing efforts.
  2. I have a giant corpus of real JavaScript and Python regexes (~300K) from my FSE'18 paper that could provide a realistic set of starting points.
BurntSushi commented 5 years ago

@davisjam Those sound like great ideas! The corpus may need a bit of massaging since this crate doesn't support all the same features that Javascript/Python does.

davisjam commented 5 years ago

@BurntSushi If leveraging my regex corpus would be of interest, I have a filtered set that compiles in Rust. I don't currently have the cycles to filter out redundant regexes (e.g. those that use identical feature sets) or attempt to bring those into the Rust engine, but I'd be happy to share what I've got so far.

BurntSushi commented 5 years ago

@davisjam Thanks! If you can put it in an accessible place, then I can grab it. I don't know when I'll get a chance to look at it, but it seems useful. One point of clarity: what is the license on your test cases?

davisjam commented 5 years ago

@BurntSushi

If you can put it in an accessible place, then I can grab it

I've added this to my TODO list. Won't be in the short term though. Maybe a month or two...

What is the license?

I'm a researcher, so the license is "Whatever benefits the world and also acknowledges/cites me".

BurntSushi commented 5 years ago

@davisjam Sounds good. The MIT license would be simplest.

BurntSushi commented 1 year ago

This fuzzer landed and it has been running on Google's infrastructure for a few years now.