Closed BurntSushi closed 1 year ago
I'm willing to do this if you want.
@SeanRBurton Yes, that'd be lovely, thank you!
I'm currently hitting regex
with AFL, calling RegEx::new
on a AFL-generated pattern and ::is_match()
, ::find()
and ::captures()
on a static test-string. So far I've only seen some "crashes" because regex
exceeds the memory limit set by AFL - this could probably be circumvented by using ::with_size_limit()
.
I wonder what a more thorough strategy would look like.
@lukaslueg Could you please open issues with test cases that reproduce your problem? What is the memory limit?
The memory limit is AFL's default of 50mb VM-size. Re-running with a manually-set ulimit -v
reproduced the out-of-memory-hard-abort. I'll post an example when I get one again - I did not pay too much attention to it since there is always ::with_size_limit()
.
Does it make sense to test the generated patterns against a randomly generated instead of a static string? Basically the only thing AFL is looking for right now are arithmetic overflows and panics...
I'm definitely interested in regexes that use more than 50MB of heap. If you could create a new issue with examples that'd be great!
Does it make sense to test the generated patterns against a randomly generated instead of a static string? Basically the only thing AFL is looking for right now are arithmetic overflows and panics...
Yes, definitely.
Is there a minimum length for the haystack? We want the regex to switch strategies (if it does that at all)
It can switch strategies depending on the length, but the specific length isn't fixed. If the combined size of the input and the regex exceeds ~256KB, then the capturing engine will switch from backtracking to the Pike VM. So if your haystack is over 256KB, that should do it. Note though that this only happens when extracting captures (the captures
method but not is_match
or find
).
I'll continue running AFL-generated regexes against a small haystack since matching a string of 260kb is painfully slow. Two crashing bugs and one OOM-situation so far - I'll switch strategies when things quite down on the small haystack
@lukaslueg Thank you! :-)
Can you give some info on when regex actually executes the simd path?
@lukaslueg Any time there are multiple literals in the prefix. For example, aaa|bbb|ccc|ddd|eee|fff
should execute the SIMD algorithm.
Note that you need to compile like so:
RUSTFLAGS="-C target-feature=+ssse3" cargo build --release --features simd-accel
And certainly, you'll need a nightly compiler.
That's what I'm doing. I've also patched MAX_SIZE_BYTES
in backtrack.rs
to be 256 instead of 256kb in order to give PikeVM a rub without slowing things down too much. Any more suggestions?
That's all I can think of at the moment for things based on length. There is also a reverse suffix literal optimization, e.g., \w+Foobar
will search for Foobar
before actually matching \w+
, but I don't know if you have that kind of control with AFL or not.
You do. It would be of great help if you could post some examples that you know will trigger certain behavior.
Examples that should use the reverse suffix literal optimization:
\w+foobar
\w+foobar\w+foobar
\d+foobar\d+foobar
\d+foobar\d+(foo|baz)barbar
"\w+a" also? the shorter the better, since it's a random string we are matching against.
Other behavior like the simd-path?
No to \w+a
. The reverse suffix literal optimization only kicks in when the literal is long enough (under the assumption that a longer literal yields fewer false positives). You can see the logic here: https://github.com/rust-lang-nursery/regex/blob/master/src/exec.rs#L1154-L1174
Other examples:
^foobar
(begin anchored literal)foobar$
(end anchored literal)^foobar$
(anchored literal)^\w+
(begin anchored)\w+$
(end anchored)^\w+$
(anchored)I've added some examples as above which increase the number of execution paths covered. Things have quieted down though and I'll wait for the reported bugs to get fixed before continuing running AFL. I've got everything in a docker container so things can pick up fast once regex's codebase changes.
Looking at #241: Is there a way for reliably force a certain regex engine without patching the code? It would be interesting to compare the result of different engines. I've also thought about comparing the result to RE2, yet this seems hard to actually get right without too many false positives.
@lukaslueg Sorry, I missed your most recent comment here. Issue #241 actually contains the relevant code to force a particular regex engine. You may be surprised to find out that things like ExecBuilder
are technically exposed through a publicly-accessible-but-not-public-interface.
Comparing with RE2 (or even PCRE2) is probably a good idea.
FYI, I have a PR incoming that fixes most of your bugs. :-)
I think @lukaslueg has done enough of this to make this issue closeable for now. I welcome more fuzzing though! :-)
As a matter of fact, I just spent half a day reworking the fuzzing architecture to make it self-contained. The latest HEAD is as of just now being fuzzed in The Cloud© on a 24/7 machine.
There already have been some patterns turning up where the different regex-engines disagree on match-groups, most likely exposing bugs in at least one of them. I'll spend some time during the next days to minimize them and file bugs individually.
@lukaslueg Awesome, thanks!
<offtopic slightly> @lukaslueg that sounds very cool! Is the code/scripts you wrote for this available online somewhere? And are you using https://github.com/frewsxcv/afl.rs?
The code is currently not available, I'll sync a repo once things have stabilized. I used to have a docker container with a custom-built llvm/rust/afl-combo but rust moved ahead of afl.rs and I switched to the container provided by afl.rs just recently. Turns out it is also quite old, unstable and not reproducible so I'm unhappy with this solution as well.
I've started comparing results from libpcre to rust-regex using AFL. This is probably way more tricky that once thought because the regex engines seem to interpret things differently. For starters, the regex \b.\b
produces a match with rust-regex but does not with libpcre in the same randomly generated haystack.
I suspect libpcre handles \b
differently than rust-regex. Some guidance on this @BurntSushi ? I could filter out patterns that are known to produce different results.
@lukaslueg In rust-regex, \b
is Unicode aware by default. In fact, most everything is Unicode aware. You'll probably need to enable Unicode support in PCRE, although I'm not sure if that will cover word boundaries. For rust-regex, you can revert to the ASCII versions of various things by disabling Unicode support. e.g., (?-u:\b)
is an ASCII-only word boundary, (?-u:\w)
is an ASCII-only word character class and so on.
Thanks for doing this by the way. :-)
I've given regex fuzzing love for the last 12 days on a dedicated machine and around 900 million rounds were executed. Variations of #375 show up, due to #345 mismatches were not examined. All Quiet on the Western Front; close?
@lukaslueg Hope so! Sorry I'm dragging my feet. These bugs take a lot of lead up time to load context into my brain, so I tend to put them off.
fyi, i ran AFL on this crate a few weeks back and found this panic. i've also got a fuzz target linked if anyone wants to run afl.rs themselves for this crate
@frewsxcv Thanks! It takes a while to fix these bugs. It takes a lot of time just to build up the context. I tend to let them pile up a little so I can amortize it.
oh for sure! i didn't meant to suggest urgency in getting it fix, just linking in case others wanted to see the fuzz target i used and the sort of bug it found. keep up the great work @BurntSushi :)
@frewsxcv Oh ya no worries! Thanks for filing bugs!
@lukaslueg: Do you have thoughts on the advantages/disadvantages of using rust-fuzz/honggfuzz-rs
instead/additionally?
Back in June 2017 AFL ran for several weeks on some cloud-node and found no (further) crashes and variations of #321 (which is yet unresolved). The low hanging fruits due to obvious bugs have been picked up.
I could still be interesting to have honggfuzz take a look at regex
with respect to "behavior", though.
My thought on how to proceed would be to try to fix #321 (so we have consistent behavior within regex
) and then start running fuzzed inputs against regex
and re2
/ pcre2
to see if they agree.
I started doing that with pcre2
as a reference. It turned out that the rust-glue available at the time was not the best and idiosyncrasies in pcre2
produced an unmanageable amount of noise. I did not put too much effort into it, though.
For anyone looking for more ways to fuzz, I believe any regex that parses should pass backends_are_consistent
.
See #468 for more info.
I've not fuzzed regexp since #345 showed up because the duplicates cloud the view: One would have to manually and carefully inspect whether or not two diverging test cases are actually the same underlying problem in order not to waste anyones time.
Turns out fuzzers are surprisingly good at producing funny regular expressions, so I think the works are worthwhile. Yet I also think we need to take the slow route, fix one problem at a time and start over - which is not as bad as it sounds.
Ping me if #345 gets fixed by cough someone. @BurntSushi maybe we can get something like backends_are_consistent
as a #[doc(hidden)]
into regex?
There is already an internal
module for doing various dubious things like lobotomizing the execution planner for testing purposes, so that could be a good place to put backends_are_consistent
. It might also be worthwhile to expose an input param in backends_are_consistent
so that the fuzzer can drive the randomness. I understand fuzzers and quickcheck operate a little differently.
The input probably should &[u8]
for both the haystack and the regex. In case it's valid utf8, it runs a utf8-version; in any case it runs the byte-version. Return type should be something like Option<Result<(), &str>>
, which is None
if the regex is invalid, Some(Ok(()))
if nothing happened and Some(Err(e))
if the results are inconsistent.
Some ideas here on a starting set of regexes for fuzzing.
@davisjam Those sound like great ideas! The corpus may need a bit of massaging since this crate doesn't support all the same features that Javascript/Python does.
@BurntSushi If leveraging my regex corpus would be of interest, I have a filtered set that compiles in Rust. I don't currently have the cycles to filter out redundant regexes (e.g. those that use identical feature sets) or attempt to bring those into the Rust engine, but I'd be happy to share what I've got so far.
@davisjam Thanks! If you can put it in an accessible place, then I can grab it. I don't know when I'll get a chance to look at it, but it seems useful. One point of clarity: what is the license on your test cases?
@BurntSushi
If you can put it in an accessible place, then I can grab it
I've added this to my TODO list. Won't be in the short term though. Maybe a month or two...
What is the license?
I'm a researcher, so the license is "Whatever benefits the world and also acknowledges/cites me".
@davisjam Sounds good. The MIT license would be simplest.
This fuzzer landed and it has been running on Google's infrastructure for a few years now.
Currently, the parser is "fuzzed" to some extent with quickcheck, which ensures that a randomly generated AST can be roundtripped to concrete syntax and back. However, we should also try to apply more generic fuzzing techniques not only to the parser, but to regex searching itself. AFL seems like an ideal approach.