google / fuzzbench

FuzzBench - Fuzzer benchmarking as a service.
https://google.github.io/fuzzbench/
Apache License 2.0
1.09k stars 264 forks source link

Use bugs to measure fuzzer performance #165

Open jonathanmetzman opened 4 years ago

jonathanmetzman commented 4 years ago

My view of fuzzers is that "better" fuzzers can find more exploitable bugs than worse fuzzers. While coverage is a decent, easy-to-use, proxy for this, we should look into using crashes to determine fuzzer performance.

We already save the crashes found in each cycle/corpus snapshot.

Here's what's left to do that I can think of:

  1. Implementing a method for identifying crashes. ClusterFuzz's method (similar to stack hashing) works pretty well in practice.

  2. Finding a way to rank fuzzers based on the crashes they find.

lszekeres commented 4 years ago

I think the two biggest tasks for this are:

roachspray commented 4 years ago

In my initial pull request (#214), I put known CVE bugs under the benchmark's directory in known_bugs. My file naming is the CVE label.... but I am not sure this is the correct route. ...Whether there should exist some fuzzbench mapping of bugs?

jonathanmetzman commented 4 years ago

Whether there should exist some fuzzbench mapping of bugs?

You mean "where" right? I think the scheme you currently used is good for now.

roachspray commented 4 years ago

Whether there should exist some fuzzbench mapping of bugs?

You mean "where" right? I think the scheme you currently used is good for now.

OK that works. I just did not know if there might be a chance of bugs without a CVE label, but I am not going to worry about it. :)

mboehme commented 4 years ago

+1 for using bugs to measure fuzzer performance!

Additionally, we could also look at the history of oss-fuzz bugs to find very buggy versions of some oss-fuzz targets.

That would be awesome. Shouldn't ClusterFuzz have all the buggy versions, fixed versions, and reproducers?

jonathanmetzman commented 4 years ago

That would be awesome. Shouldn't ClusterFuzz have all the buggy versions, fixed versions, and reproducers?

It would have these things (it would have a range of versions rather than the exact place this was committed). The current fancy solution we have for integrating OSS-Fuzz benchmarks wouldn't work with this so well though because it uses a docker image that has a checkout where this bug occurs. Unless we make this image while the bug still reproduces on trunk, we won't have it. There's infra in OSS-Fuzz for taking an image and checking out a commit where a bug repros but we need to incorporate that into fuzzbench.

mboehme commented 4 years ago

There are two interesting questions.

In research, we have mostly focused on RQ.2 for practical reasons: (i) we don't have the computational resources to study bug discovery if we don't know whether there are any bugs in the program -- we better look where we know we can find bugs, and (ii) given two crashers, we don't know whether they reveal the same or two different bugs. Now, FuzzBench provides ample computational resources and OSS-Fuzz/Cluster-Fuzz can automatically de-duplicate.

FuzzBench enables RQ.1! For the subjects that are already in FuzzBench (and new interesting subjects taken from OSS-Fuzz), simply measure the number of (interesting) bugs found over time. You'll need to fix the oracle (sanitizer configuration) to facilitate a fair comparison. ClusterFuzz should be able to triage severity (to measure how much we care about the bug). Given Google responsible disclosure policy, you can always git checkout @{'90 days ago'} each subject.

mboehme commented 4 years ago

@lszekeres wrote:

Additionally, we could also look at the history of oss-fuzz bugs to find very buggy versions of some oss-fuzz targets.

The following 18 projects have more than 100 items (mostly closed issues) in the OSS-Fuzz issue tracker: proj counts
skia 712 items
llvm 626 items
wireshark 246 items
poppler 242 items
proj4 195 items
openthread 176 items
open62541 168 items
mupdf 156 items
unicorn 152 items
llvm_libcxxabi 151 items
llvm_libcxx 145 items
yara 145 items
sqlite3 132 items
openvswitch 128 items
systemd 126 items
opensc 117 items
solidiy 112 items
radare2 108 items

For SQLite3, the OSS-Fuzz build is already in fuzzbench: https://github.com/google/fuzzbench/tree/master/benchmarks/sqlite3_ossfuzz (3 lines of yaml :)

lszekeres commented 4 years ago

Thanks a lot for looking into this, @mboehme, this is very useful!

RQ.1 Which fuzzer finds more unknown bugs, we care about?

Yes, using unknown bugs for benchmarking might be interesting, but we aren't running FuzzBench continuously on the HEAD of the benchmarks/fuzz targets, like OSS-Fuzz does, so I don't expect to find a lot of completely new bugs with FuzzBench. The new bugs are typically found by OSS-Fuzz. What I think we could or should do instead is to create versions of oss-fuzz targets in which we can enable as many previously found bugs as possible and use them as fuzzbench benchmarks.

The following 18 projects have more than 100 items (mostly closed issues) in the OSS-Fuzz issue tracker:

These projects would be great candidates to create these very buggy benchmarks. I think we could do this automatically with some script that does something like the following:

The number of successful bug re-introductions would depend on the initial version we took in step one, so it might worth trying out several, and find the one that gives the best end result. The best initial version would probably be around the time when a lot of bugs were reported.

The relevant patches could be found on GitHub by searching for "Credit to OSS-Fuzz" or "bugs.chromium.org/p/oss-fuzz/issues" or both. @inferno-chromium and @jonathanmetzman might have more ideas on how to match the oss-fuzz bugs with patches.

WDYT?

mboehme commented 4 years ago

What I think we could or should do instead is to create versions of oss-fuzz targets in which we can enable as many previously found bugs as possible and use them as fuzzbench benchmarks.

Exactly. For example, for a buggy OSS-Fuzz project, we could check out the version where it was just added to OSS-Fuzz. All bugs found by the competing fuzzers are "naturally occurring" and rather than cherry-picked. Now, don't get me wrong. I think, benchmarking against a known set of bugs that we care about is also important (Kudos @roachspray!). Nevertheless, there are a few potential biases that you need to control for.

  • Take some initial version of the code then repeat:
    • Find a relevant patch for an oss-fuzz bug
    • Try to revert it and see if the reproducer still works
    • If it does, keep the reverted patch, otherwise undo it

Is this to generate bug-introducing "patches" and apply them to trunk, or for de-duplicating (by determining which patch fixed the found bug)? If the former, a future trunk might break the extracted "patch", no? Another idea would be to check out a known buggy version and de-dup the crashers, e.g. by building a sequence of versions where the bugs are fixed (found via your method), and go through this sequence of fixed versions one by one to find where the reproducer stops crashing.

mboehme commented 4 years ago

BTW: For AFLGo in 2017, we modified OSS-Fuzz to accept a commit hash via the helper.py script and the base-builder's compile script.

jonathanmetzman commented 4 years ago

BTW: For AFLGo in 2017, we modified OSS-Fuzz to accept a commit hash via the helper.py script and the base-builder's compile script.

I didn't know this. We did our own implementation of this recently. I'm planning on using it, I just want to make sure it's stable enough before I switch. (there also isn't much of an incentive to switch until we start measuring crashes...chicken or the egg problem).

jonathanmetzman commented 4 years ago

What I think we could or should do instead is to create versions of oss-fuzz targets in which we can enable as many previously found bugs as possible and use them as fuzzbench benchmarks.

Exactly. For example, for a buggy OSS-Fuzz project, we could check out the version where it was just added to OSS-Fuzz. All bugs found by the competing fuzzers are "naturally occurring" and rather than cherry-picked. Now, don't get me wrong. I think, benchmarking against a known set of bugs that we care about is also important (Kudos @roachspray!). Nevertheless, there are a few potential biases that you need to control for.

Interesting, I never thought of this and I think it's a very nice way to control bias.

  • Firstly, are real bugs found by one method representative? Only because libfuzzer was good at finding those bugs and because all other fuzzers turn out not to be as good, doesn't mean that the other fuzzers perform worse at bug finding, in general. The other fuzzers might be better at finding other bugs much faster, but we wouldn't measure that.
  • Secondly, how do you mitigate experimenter bias where your choice of (available) bugs influences the outcome of the experiment? How do you prevent that the chosen bugs do not (inadvertently) benefit one fuzzer more than another?
  • Take some initial version of the code then repeat:

    • Find a relevant patch for an oss-fuzz bug
    • Try to revert it and see if the reproducer still works
    • If it does, keep the reverted patch, otherwise undo it

Is this to generate bug-introducing "patches" and apply them to trunk, or for de-duplicating (by determining which patch fixed the found bug)? If the former, a future trunk might break the extracted "patch", no? Another idea would be to check out a known buggy version and de-dup the crashers, e.g. by building a sequence of versions where the bugs are fixed (found via your method), and go through this sequence of fixed versions one by one to find where the reproducer stops crashing.

I think we are all talking about how to dedup right? My favored method for deduping is ClusterFuzz' algorithm (similar to stack hashing) since it's relatively simple and has been shown to be good enough in the real world. But if there are soundness problems with this method, we could try some of the other approaches proposed. One approach I'm interested in is where we consider two bugs are the same if introduced and fixed by the same commit. This should be relatively straightforward since the code in OSS-Fuzz for building at a specific commit was written to support bisection of commits that introduced and fixed a bug. Of course, there are problems with this approach as well (there's no reason why someone can't introduce two completely different bugs together and fix them at the same time, also finding the culprit commit is probably not possible if the commit was never fuzzed in oss-fuzz since the fuzz target doesn't exist yet).

lszekeres commented 4 years ago

For example, for a buggy OSS-Fuzz project, we could check out the version where it was just added to OSS-Fuzz.

Yes, and I think we could do even better than that. Taking the oldest version might not be the best option, because some bugs might have been introduced later. The opposite is true for taking HEAD: the context of some bugs might have already be gone. The ideal might be somewhere in the middle. To demonstrate this with a dummy example, say we have the following timeline:

v0: bug A introduced v1: bug A found v2: bug A fixed v3: bug B introduced v4: bug C introduced v5: bug C found v6: context of A removed v7: bug B found v8: bug B fixed v9: context of B removed v10: bug C fixed v11: context of C removed

If we took v0, bug B and C is not in the code yet, so we'd only have bug A. If we took say v10+, the context of A and B are already gone, so we can only enable C (or nothing). A better option would be taking v4 as it has bug B and C. The best option though would be taking v4 and reverting "fix A". This way we'd have all three bugs: A, B, and C in the code.

Is this to generate bug-introducing "patches" and apply them to trunk, or for de-duplicating (by determining which patch fixed the found bug)?

Neither, the idea is to auto-magically find/construct either the above described most optimal version, or at least something close to it, via some automated search. The goal of this is to maximize the number of naturally occurring real world bugs we can have in one piece of code, rather than cherry picking a few. The more bugs are there to find in the benchmark, the more statistically meaningful the experiments will be.

Re. the potential biases:

Only because libfuzzer was good at finding those bugs and because all other fuzzers turn out not to be as good, doesn't mean that the other fuzzers perform worse at bug finding, in general. The other fuzzers might be better at finding other bugs much faster, but we wouldn't measure that.

We'd measure the union of all bugs found by all fuzzers, not just the ones found by one of them. This is also why I think it's important to use code that has as many bugs in it as possible. With more bugs there to find, the experiment is not just more meaningful, but also less biased.

Secondly, how do you mitigate experimenter bias where your choice of (available) bugs influences the outcome of the experiment? How do you prevent that the chosen bugs do not (inadvertently) benefit one fuzzer more than another?

The benefit of using real bugs is that it's not the experimenter who defines/chooses them, in contrast with artificially injected bugs where the available bugs depend on the injection technique and the experimenter. In our case there is no "choice of available bugs", the bugs are just the bugs that the developers of the tested program introduced. There'd be no "chosen bugs", the set of bugs that we'd use in the denominator is the union of the bugs that all fuzzers found in the code. Wdyt?

mboehme commented 4 years ago

I think we are all talking about how to dedup right? My favored method for deduping is ClusterFuzz' algorithm (similar to stack hashing) since it's relatively simple and has been shown to be good enough in the real world. But if there are soundness problems with this method, we could try some of the other approaches proposed. One approach I'm interested in is where we consider two bugs are the same if introduced and fixed by the same commit. This should be relatively straightforward since the code in OSS-Fuzz for building at a specific commit was written to support bisection of commits that introduced and fixed a bug.

In my opinion, stack hashing is fine, but there is some research to suggest otherwise. Of course, knowing where in the commit history the reproducer starts and stops reproducing gives a much better picture if there are no concerns about computational resources.

Of course, there are problems with this approach as well (there's no reason why someone can't introduce two completely different bugs together and fix them at the same time, also finding the culprit commit is probably not possible if the commit was never fuzzed in oss-fuzz since the fuzz target doesn't exist yet).

You could just take the cross-product of both approaches: If one of the approaches say two reproducers reveal different bugs, they are categorized as different bugs. Stack hashing is needed because there just may not be a patch for an entirely new bug found.

If we took v0, bug B and C is not in the code yet, so we'd only have bug A. If we took say v10+, the context of A and B are already gone, so we can only enable C (or nothing). A better option would be taking v4 as it has bug B and C. The best option though would be taking v4 and reverting "fix A". This way we'd have all three bugs: A, B, and C in the code.

Why not simply fuzz v0 and v4 and build v2, v8, and v10 to de-dup the bugs A, B, and C? If you mix and match, (i) you have a lot of manual work to identify v4 as a suitable version where all future and past bug-introducing/fixing patches apply, and (ii) you have all kinds of (artificial) interactions between bugs. Some bugs may be masked by other bugs. Some new bugs may be introduced.

In our case there is no "choice of available bugs", the bugs are just the bugs that the developers of the tested program introduced.

Exactly. We've been saying the same thing :) Maybe one additional comment: The proposed technique to inject real bugs that were patched in a previous version still biases the set of considered bugs to (i) bugs that were found by a previous method (predominantly libfuzzer), and (ii) bugs that were actually patched.

mboehme commented 4 years ago

Related: https://en.wikipedia.org/wiki/Survivorship_bias (bugs that were not found by a previous method may also be ones that we care about).

lszekeres commented 4 years ago

Why not simply fuzz v0 and v4 and build v2, v8, and v10 to de-dup the bugs A, B, and C?

Because of scalability: running different versions separately means that each version is essentially a separate benchmark. If our goal is to have a lot of bugs this probably won't scale, because we can't run so many trials. Say, we have 10 programs with 30+ bugs each. Using the naive approach of using a separate version per bug, that'd mean 300+ benchmarks -- 30 times more trials to run compared to having all bugs in a single version. If we find versions with "batches of bugs" we can probably reduce this, but i'm afraid even 10 x more trials would be difficult to afford for us.

Maybe one additional comment: The proposed technique to inject real bugs that were patched in a previous version still biases the set of considered bugs to (i) bugs that were found by a previous method (predominantly libfuzzer), and (ii) bugs that were actually patched. Related: https://en.wikipedia.org/wiki/Survivorship_bias (bugs that were not found by a previous method may also be ones that we care about).

These bugs are not only found by libFuzzer, but by a wide variety of techniques (e.g., by all the fuzzers in fuzzbench). I think the best way to reduce all kinds of biases is to use benchmarks with as many real world bugs in them as possible. The more buggy the program, the better and less biased the experiment is. Further, we'd normalize the results in the sense that we'd look at which bugs each fuzzer can find out of the union of the bugs found by all fuzzers. Bugs that were not found by any of the methods are bugs that we simply don't know about and cannot measure. Can you think of a less biased approach than this?

lszekeres commented 4 years ago

If you mix and match, (i) you have a lot of manual work to identify v4 as a suitable version where all future and past bug-introducing/fixing patches apply

Yes, this is true and I'm hoping that the mix and matching part won't be necessary. I think given the numbers you posted in https://github.com/google/fuzzbench/issues/165#issuecomment-616545391 we should be able to find sufficiently buggy versions of these projects e.g. just by taking all crashing inputs from the bug reports and do some search to see on which version can we reproduce most of them. (E.g. finding v4 in the dummy example.) In projects with history of ~200 total bugs, there should be specific versions where 30-50 bugs are reproducible, and I think that'd already be a pretty great benchmark, wdyt?

mboehme commented 4 years ago

Thanks Laszlo. This is a really interesting discussion!

Why not simply fuzz v0 and v4 and build v2, v8, and v10 to de-dup the bugs A, B, and C?

Because of scalability: running different versions separately means that each version is essentially a separate benchmark. If our goal is to have a lot of bugs this probably won't scale, because we can't run so many trials.

You could have (trial x runners for v0 and v4) and (1 x measurers for v2, v8, and v10). You can skip the measurers entirely if you go for stack hashing.

Say, we have 10 programs with 30+ bugs each. Using the naive approach of using a separate version per bug, that'd mean 300+ benchmarks -- 30 times more trials to run compared to having all bugs in a single version. If we find versions with "batches of bugs" we can probably reduce this, but i'm afraid even 10 x more trials would be difficult to afford for us.

My point is that there should be no a-priori distinction of bugs (because of survivorship bias). There is not one version per bug; there are only programs (that, for economic/scalability reasons, we know are buggy).

Maybe one additional comment: The proposed technique to inject real bugs that were patched in a previous version still biases the set of considered bugs to (i) bugs that were found by a previous method (predominantly libfuzzer), and (ii) bugs that were actually patched. Related: https://en.wikipedia.org/wiki/Survivorship_bias (bugs that were not found by a previous method may also be ones that we care about).

These bugs are not only found by libFuzzer, but by a wide variety of techniques (e.g., by all the fuzzers in fuzzbench).

The problem in this design is that you only look where you expect to find a bug -- you do not look where you do not expect to find a bug. Concretely, you suggest to only measure the bug finding performance of future fuzzers using bugs found by previous fuzzers. This suggestion obviously ignores bugs that we care about that can only be found by a future fuzzer. Yet, this future fuzzer might be much worse at finding those bugs we designed the benchmark for.

I think the best way to reduce all kinds of biases is to use benchmarks with as many real world bugs in them as possible. The more buggy the program, the better and less biased the experiment is.

Yes, to remain economical we should use benchmarks with as many real world bugs in them as possible. Yet, I feel artificially re-introducing future or past bugs will make you i. find bugs you are not looking for (introduce new bugs due to change interaction), ii. not find bugs you are looking for (mask bugs due to change interaction), and iii. not find bugs you are not looking for (because of survivorship bias).

Now, you can easily design a quick experiment to test i+ii, but without knowing all future fuzzers, you can't test iii.

mboehme commented 4 years ago

I think given the numbers you posted in #165 (comment) we should be able to find sufficiently buggy versions of these projects e.g. just by taking all crashing inputs from the bug reports and do some search to see on which version can we reproduce most of them. (E.g. finding v4 in the dummy example.) In projects with history of ~200 total bugs, there should be specific versions where 30-50 bugs are reproducible, and I think that'd already be a pretty great benchmark, wdyt?

Yes, this sounds most reasonable. Is there an API for the bug tracker to extract the git hashes for patches and buggy versions?

jonathanmetzman commented 4 years ago

You could just take the cross-product of both approaches: If one of the approaches say two reproducers reveal different bugs, they are categorized as different bugs. Stack hashing is needed because there just may not be a patch for an entirely new bug found.

+1

If you mix and match, (i) you have a lot of manual work to identify v4 as a suitable version where all future and past bug-introducing/fixing patches apply

Yes, this is true and I'm hoping that the mix and matching part won't be necessary. I think given the numbers you posted in #165 (comment) we should be able to find sufficiently buggy versions of these projects e.g. just by taking all crashing inputs from the bug reports and do some search to see on which version can we reproduce most of them. (E.g. finding v4 in the dummy example.) In projects with history of ~200 total bugs, there should be specific versions where 30-50 bugs are reproducible, and I think that'd already be a pretty great benchmark, wdyt?

Agree if we can find cases like this. I haven't dug into it, but my impression from seeing ClusterFuzz bug reports is that ClusterFuzz doesn't find 30-50 bugs in a single target at once, but I could easily be wrong and this may also be a limitation of ClusterFuzz (we quit on the first bug we see). I think Marcel's numbers are for projects, not for individual targets (ie: benchmark).

If we find versions with "batches of bugs" we can probably reduce this, but i'm afraid even 10 x more trials would be difficult to afford for us.

I agree with this, but I'm also a bit hesitant about the reintroduction idea :-(

Yes, this sounds most reasonable. Is there an API for the bug tracker to extract the git hashes for patches and buggy versions?

We don't have a public API, but I have enough access that I can probably get this data. I think with scraping it should be doable. From a bug report such as this one this link in the bug report: https://oss-fuzz.com/revisions?job=libfuzzer_asan_augeas&range=201711140523:201711141555 has the range of commits that introduced the bug. In some cases (like the one I just linked to) the range contains only one commit. In other cases (such as this one: https://oss-fuzz.com/revisions?job=libfuzzer_asan_imagemagick&range=201804080436:201804090435) it's an actual range. vulners.com may have an API for this data, but I couldn't figure out how to use it as we want and it would cost money beyond 1000 requests.

mboehme commented 4 years ago

This is the range of possible bug-introducing commits, right? Without the "ClusterFuzz-Verified" label, the bug probably still lives in trunk. If we could figure out which bugs are still alive for a revision, we could choose that revision with the maximimum number of bugs alive.

hazimeh commented 4 years ago

I'm happy to see this discussion taking place, and it's unfortunate that I only came across it today. I would like to point out that work in this direction has been in active development since February 2019, in the form of the Magma benchmark.

We had reached out to @lszekeres and @inferno-chromium for collaboration with or incorporation into FuzzBench. Magma has grown since, with more ground-truth bugs and targets, and more stable documentation and infrastructure.

The key aspects about Magma, which relate to this discussion:

We are still interested in collaboration with the FuzzBench team, mainly because of the shear amount of compute power their system has access to, allowing for much more frequent evaluations and reports, thus boosting the development of Magma and its impact on fuzzing research.

jonathanmetzman commented 4 years ago

I'm happy to see this discussion taking place, and it's unfortunate that I only came across it today. I would like to point out that work in this direction has been in active development since February 2019, in the form of the Magma benchmark.

Thank you for your interest. As we discuss above, we're working on adding a more comprehensive set of "buggy" benchmarks from OSS-Fuzz (example). We are interested in contributions of techniques and benchmarks though. Let's start with integrating a magma benchmark into FuzzBench (integration guide). From there we can figure out how we will do crash detection and surface this information in reports. Hopefully this can be generic for any approach to buggy benchmarks.

jonathanmetzman commented 1 year ago

I can close this issue as we've had bug based bechmarking for a while now. I am still interested in discussing using Mamga's techniques as I think it has some nice advantages over our approach. @hazimeh is this still something you find interesting?