Capturing it here since Twitter is not the best place for research discussions.

From - Brandon Falk - @gamozolabs - Aug 8 (https://twitter.com/gamozolabs/status/1292324287324381184)

Let's talk a bit about fuzzer benchmarking. I'll start off with some data directly from fuzzbench, of aflfast fuzzing libjpeg-turbo. Here's what the 20 runs of aflfast look like. https://fuzzbench.com/reports/2020-08-03/index.html

According to fuzzbench, the green line here is a better performing fuzzer. There's a 20x variance in the time-to-same-coverage between the exact same fuzzer, in our n=20 sample set.

Even though this is 20 of the exact same fuzzer, against the exact same binary, we see there is catastrophic variance. Further, we're using things like means and standard deviations to find statistical significance in these fuzzers.

It is apparent from this graph that this data (mainly, the last data points) are not even remotely normally distributed. They are clustered around largely 2 different points, ~3400, and ~3800. Using means and standard deviations here is statistically invalid.

This is very common across fuzzing. A hard block is randomly preventing the fuzzer from progressing, once the path is randomly taken, coverage avalanches for some time, leading to a new temporary plateau.

The variance in these runs easily exceeds 5-10% just in per-run noise and we're using this data to determine when something is significant because it's 5% better. It's important to note, with this graph, all but maybe 2 of these 20 lines are even close to the mean value.

Further, since all of this data is from wall-clock-time, there is variance which is introduced from just timing differences in the fuzzers and what they feed back. Here's AFL on the same binary 10 times. We can see, just in wall-clock time alone they have a 20% spread.

This means, that from just the random noise of the fuzzer picking up "unlucky", maybe expensive, inputs early. The fuzzer performance is permanently affected through the lifetime of the fuzzer. Just from fuzzer noise and timing noise there's easily 20-30% variance between runs.

This means, a 24 hour run may be equal to a 19 hour run of the exact same fuzzer. This isn't even factoring in differences between different fuzzers. When you factor that in, the variance easily jumps above 2x.

Even though you only care about the wall clock time at the end of the day, these fuzzers are so inefficient that it doesn't matter. If AFL gets 20% more performance, it's extremely unlikely that performance isn't portable to other mutation strategies.

Ultimately, I want to know the algorithmic performance of a fuzzer (eg, coverage over cases, not time), and from here, I can figure out if the slow-but-better algorithms can simply be upgraded with better code.

We see this with honggfuzz where it's using malloc() during mutation passes where it is entirely avoidable. Most fuzzers do stuff like this, they're not optimized at all. Every fuzzer I know of splices inputs by copying the last part of the input further out.

Using basic data structures like interval trees you could splice data without having to copy the remainder. There are so many improvements to fuzzer performance that it should not be a factor of benchmarks.

Case in point, AFL-qemu massively underperforms every fuzzer according to fuzz bench. There is no data provided, even in the raw CSV, to control for the performance differences.

We shouldn't be using overall performance to determine statistical significance of mutator/feedback design. We're bringing non-athletes to the olympics and trying to figure out which country is better. If we find a better fuzzing strategy, the perf likely can be solved.

TL;DR: We're comparing fuzzers with 50%+ performance differences, 20%+ same-fuzzer variance, with different feedback mechanisms, often on cloud nodes where I/O is affected by neighbors, with two schedulers (OS+hypervisor), and we're trying to find 5% improvements in mutators...

... using normal distribution statistical significance on non-normal data. And we're ignoring the first 15 minutes of the fuzzer where 95% of the coverage is found.

For sure, I'm still gathering evidence to make a much stronger case just on... getting a good feeling of fuzzers, I guess? I seem to hit some brick walls when trying to convince others that 5% improvements are quite minor, as well as just general cleanliness of test environs.

I'm trying to make a bit more comprehensive list of examples. Demonstrating what a bespoke fuzzer performance looks like compared to generic mutations (often 2-5x for structured inputs), how to evaluate fuzzers in various dimensions (speed, complexity, scalability, etc)

I feel like I'm getting close to being comfortable demonstrating many points that I've taken for granted as intuition for the past ~6 years of working with fuzzers.

@gamozolabs. We value your opinion and research in this area and using this issue as a meta bug for feature requests and suggestions.

This is a tough one. I have a few things that I think are critical, which I'll talk about at the end of this message.

First of all, I'd like to say I'm pretty strongly biased against benchmarking fuzzers in nearly any capacity. But, I'm more focused on deep bugs on hard targets, and I'm probably not fair against the wide and shallow approaches of generic fuzzers and many-target/CI-style fuzzing projects. It's just not a problem I run into, and thus, I'm probably not accurately valuing it.

I, first, have some major problems with single-result/ranking-style comparisons of nearly anything. They lead to a gamification of a problem which largely leads to people optimizing against a benchmark, not reality. This can lead to people overfitting to problems, or making tooling which is so focused on a single use case (eg. Linux-userland, single-file parsing, open-source, compile-able with modifications, seems to be such a focus of fuzzing and research, when it's ultimately one of the easiest to solve problems in fuzzing). When I've worked with organizations who've run competitions as a way to evaluate different companies and schools solutions to problems, there's often a breaking point where most people clearly start exploiting the game, and start to discard reason in favor of better funding/success. I'm highly skeptical that all of these afl-foobar modifications being made, are not largely being done against tools like fuzzbench directly, rather than being reasoned about.

Second, I think there are some pretty huge risks with comparing these fuzzers on a research level. At a research level we should be learning what these tools are doing at a much more microscopic scale than we do. Not only splitting up the feedback, compiler mods, mutators, corpus selection, queue weighting, etc as different variables to be studied in isolation... but also exploring how these perform in various situations, like early stages of fuzzing, late stages of fuzzing, state-machines, loops, etc.

The chase for coverage is really dangerous and I think has already resulted in overfitting to finding new coverage. Based on nearly every value that papers indicate, and what tools like fuzzbench indicate, a fuzzer which finds 5% more unique blocks, but explores 1/2 the state (eg. common code paths with different states), is a "much better fuzzer". This chase for just coverage seems to pretty quickly be unrealistic for anything but basic parsers.

Further, the results of fuzzbench, to me, all look nearly the same for almost every fuzzer. If they're within about 20% coverage, I typically don't really care much (which is typically 75% of all competing fuzzers). A generic mutator is pretty much always a last-ditch effort and will massively under-perform a targeted fuzzer. This is an issue I bring up a lot, and ultimately if we'd been spending our time on making a wireshark-scope coverage of different fuzzers for different protocols and formats I think we'd be in a much better place than having the extremely minor improvements we've seen to generic mutation strategies over the past decade. While there's always new code being made, it's unlikely it's doing something completely unique where there aren't aspects that could be borrowed from an existing JSON/HTML/TLV/closest related fuzzer, which would wildly outperform a basic splicer/bit flipper. At the end of the day, almost all mutation strategies we have distill down into some simple corrupt, insert, splice, repeat strategies. We're really just playing with the weights on these, which are often target specific.

Nevertheless, most of the strategies that are being brought to the table are not really mutation strategies, but input prioritization (queuing weights for example in AFL), and compiler modifications to get things like coverage, loop depth, and compare feedback. These sorts of strategies are amazing! But, they're really not specific at all to a fuzzer. To be honest, almost nothing is. The mutators, the feedback mechanisms, the compiler modifications, etc, can pretty much be ported to any environment.

This leads to situations where someone adds a mutation strategy to AFL which happens to play nice with the way AFL does feedback, which I'd argue, is often just by coincidence. In the current way we research, it's so hard to tell if the reason a mutation change is better is due to something more than just the mutation. We can theorize many situations where this can happen. For example, input splicing passes for fuzzers (often the most valuable), are almost directly related to feedback mechanisms and input saving. None of the benchmarking tools we have allow us to get introspection at a level where we can make decisions based on this, and instead, it's up to the researcher having the foresight to hopefully test this. But in a lot of situations, often due to the incentives of academia and just project adoption in general, these analysis are just ignored as the tool performed better in fuzzbench, and that's our standard. As someone who does it myself, sometimes you just get a result that is better, and you're excited to share the result or use it as a flex to make your tooling look better.

Some of the things that make me seriously questioning of large amounts of fuzzing research, are just basic things that I would expect would have been done by now. For example, using a log scale X axis when plotting coverage over time. Coverage over time is fundamentally exponentially harder to get, and thus displaying with linear axes is so lossy. I know it's stupid, but lets take a look.

On the left is a coverage over fuzz cases, on a linear scale. And on the right is the exact same data but with a log scale x axis. While this is ultimately just a way to visualize data, and everyone has their preferences, I seriously question how nearly every single coverage graph in benchmarking tools and papers is this same graph with a vertical line at t=0, and a flat line for the rest of the graph. Basic data visualization things like this make me question how much people are actually trying to analyze the data, versus just looking at which line is higher at t=final. There is SO much going on in fuzzers in the first few milliseconds-seconds of fuzzing, and almost every paper and tool I see completely discards these things. Just things like corruption amounts are so start/end polarized when it comes to fuzzing. Higher corruption, faster early coverage, but later on it ends up hurting you due to corrupting required bytes in the input. It's these sorts of takeaways I never really seen from these single-variable "which fuzzer had bigger number at the end of an arbitrary deadline" tools.

The constant use of means and standard deviations to prove that things are better or worse is painful as well, as most fuzzers have the exact same "shapes". They block on some branch which has yet to be randomly brute forced, some find it earlier, some later, but when they do, coverage avalanches and you see a huge spike in the coverage graph. By doing nearly all fuzzbench results in a time domain, with often similar time deadlines (24 hours), you end up biasing towards fuzzers which try to go wide fast... but maybe over time these start to fall behind as they corrupt too much and struggle to build upon earlier inputs.

These minor variables has such a strong affect on basic fuzzer behavior, and they can easily be seen in the first few minutes or even seconds of fuzzing. I'm often running a fuzzer over and over and over for about 1-2 minutes, comparing results, and trying to make conclusions. But, I'm not racing to make a fuzzer that wins in benchmarks, or publish a paper. I'm genuinely trying to find which tweaks cause which behaviors, and what that means for different targets. Without knowing what it is about a fuzzer that makes it better on libpng versus libjson, we're really just taking massive leaps where there are hundreds of variables at play. I'd like to see some major changes in the way we evaluate and quantify these results such that we can make predictions about which fuzzers will perform in which ways against certain targets. Lots of this is just about knowing which strategies (mutations, feedback, anything) have which effects on breadth, depth, loops, state machines, highly-structured inputs (binary formats/text formats/CRCs/etc).

Nevertheless, if the goal of fuzzbench is to just to find "which fuzzer you should use if you don't have time to decide, and it isn't meant for serious evaluation of performance of fuzzers", I'd strongly suggest that it is advertised as such and it no longer is used as a primary means in academic papers to prove effectiveness.

I recognize there's a value in someone who doesn't have the time to really be analytical about fuzzers to make decisions about what to fuzz, in which case, sure, fuzz bench provides some value. But I don't really think people in industry are making those decisions. Eg. the person using stock AFL is probably not checking latest papers and fuzzbench for what to apply. People using fuzzbench, are likely not people who are just spinning up fuzzers aimlessly on targets.

I really don't see myself working on fuzzbench to help, and I apologize for that, as I'm quite critical of these tools. But ultimately, I think fuzzing in general, including benchmarking, has departed far from some of the foundations I've relied on that I'd rather see some fresh starts than transforming lawnmowers into F1 cars piece-by-piece.

Personally, I think we need debuggers for fuzzers, not rankings. We need tooling that gives us introspection and allows us to make changes on the fly, even as a very entry level security researcher. The fact that from the beginner level, to the advanced level of fuzzing, pretty much everything is just "give an input folder, output folder, and then maybe look at a coverage graph", is kinda sad. Imagine joining a company that has zero capability to debug, profile, analyze, or tune the codebase as you're working on it. While I applaud the simplicity of a lot of these fuzzers, eventually, giving users some control and tooling to make decisions, we'd get much better results than trying to find a one-fuzzer-fits-all with dynamically-tuned-no-human-interaction tool.

Here's the things I'd like to see to really even start to get some value out of fuzzbench from the way I evaluate fuzzing:

Logscale x axis should be the standard for these graphs. Make it a slider button or side/by side if you want
The uses of means and standard deviations shouldn't be used when ultimately we're just comparing multiple clusters on graphs and which fuzzers hit them first.
From personal experience, there is much more value in comparing which fuzzers got identical coverage in a certain time. Personally, I never look at which fuzzers got more coverage, because I'd rather have a fuzzer that gets identical coverage 10x faster, than 5% more in some arbitrary deadline. Using 2 fuzzers which find 20 unique blocks each is a lot easier than trying to make one fuzzer that finds 40 unique blocks. I'm more interested in improvements in fuzzer density, not overall coverage in an arbitrary deadline. Ultimately, if your fuzzer saves 1000 inputs from coverage feedback, but has performed 10,000,000 fuzz cases, there's a theoretical 10,000x performance increase that is possible. Is it doable reasonably? No. But do I really care about 5% improvements to total coverage when there's 99.99% of fuzz cases being completely discarded with no new information learned?
The iteration count/fuzz case information should be the primary stat for graphing fuzzers, and the timeouts for fuzz runs should be in iterations and not time (with some reasonable timeout for things that are really just painfully slow), the way we fuzz has so many inefficiencies in performance that we should just comfortably assume that most minor improvements to fuzzers can be A. generically ported to almost any other fuzzer, and B. can be optimized to be the same perf as most other strategies. For complex strategies like concolic/taint tracking/etc, it's really a matter of if the performance hits can be mitigated with the algorithmic improvements to fuzzing. If we're only looking at wall clock time, we don't even see when something may be doing something good, because early prototypes are often inefficient in wall-clock time and in something like fuzzbench would immediately find their way to the bottom of the bucket.
Coverage really needs to not be so emphasized. There aren't really any good solutions to this, as bugs are just as easily (if not more) overfit to.
The data resolution needs to be significantly higher. I want to see stats every few seconds, ideally less (eg, generate CSV row on every coverage increase, regardless of timing). Looking at fuzzbench results, in many situations 90% of coverage is found in the first 15 minutes, and we don't even see any of that data as we see 15 minute samples.

I think it's a bit sad to look around the industry, and other fields in general, and you see hundreds of thousands of lines of code in specialized-and-broad tools like Wireshark, static analysis tools for almost any language, profilers, debuggers, optimizers, multiple real-time network attack visualizations, etc. Many of these tools are designed to provide humans with real time analysis to make decisions on a daily basis of how to react and change they way they code, fix bugs, and respond to situations.

Yet... in fuzzing, we have tools which take input folders, output folders, and provide no other introspection except for an 80x25 stat screen that maybe tells you your fuzzer is slow or that it has found some new coverage. Anything else you want to do? Gonna have to start modifying internals of traditionally quite hacky codebases. Maybe a power user decides to use AFL++ instead of bone-stock AFL, and maybe tries out a power schedule, but they have no reason to do so off of any data which is specific to their target, instead they just make decisions off of completely unrelated targets they maybe saw in a paper or fuzzbench.

Sorry for being critical, but I'd really like to see fuzzing advance beyond this dumb thing that you do without any human interaction. I've been tweaking fuzzers on a daily basis feedback loop for 8+ years now, and it pains me that I've almost never read a paper that really matches many of my experiences in fuzzing, even at a basic level like visualizing coverage, which I'd expect would just be natural after about ~2 minutes of looking at these near-worthless linear-scale graphs which are just flat lines. It really makes me question if people are actually trying to figure out how these things work.

Thanks @gamozolabs for looking at FuzzBench and for the feedback! This is very useful! And thanks @inferno-chromium for extracting it into an issue.

I think there might be some misunderstandings regarding the report and the statistical tests:

According to fuzzbench, the green line here is a better performing fuzzer. There's a 20x variance in the time-to-same-coverage between the exact same fuzzer, in our n=20 sample set.

According to FuzzBench a fuzzer should only be considered "better" if there's statistical significance. The reports provide plots to determine the statistical significance for each benchmark (example) and across benchmarks as well (see Critical difference diagram at the top of the report). For example in this one, the diagram shows that there’s no significant difference between afl++, hongfuzz, aflsmart, afl, mopt, entropic. But e.g., afl++optimal is significantly different from libfuzzer, according to the used test. You have a very good point that this is not emphasized enough in the report. We’ll try to make this more explicit.

Even though this is 20 of the exact same fuzzer, against the exact same binary, we see there is catastrophic variance. Further, we're using things like means and standard deviations to find statistical significance in these fuzzers.

The report doesn't use means and standard deviation to find statistical significance. The sample statistics table lists these standard stats (along median and other percentiles) to give a numerical description of the data. The statistical significance tests and plots however use the nonparametric Mann–Whitney U test on individual benchmarks, and Friedman/Nemenyi across benchmarks. We use these because - as you’ve mentioned the data isn’t normal distribution - and these tests don’t assume normal distribution. See more details in the doc: https://google.github.io/fuzzbench/reference/report/

Ultimately, I want to know the algorithmic performance of a fuzzer (eg, coverage over cases, not time), and from here, I can figure out if the slow-but-better algorithms can simply be upgraded with better code.

That's a good point! You mean coverage / test case execution, right? We’re planning to collect and include the execution number data as well. However, ultimately what matters in practice most often is coverage over time: for a symbolic execution engine one execution might take minutes, but it might find new coverage very often. For a coverage guided fuzzer, an execution might take just microseconds, but only very few executions find new coverage.

TL;DR: We're comparing fuzzers with 50%+ performance differences, 20%+ same-fuzzer variance, with different feedback mechanisms, often on cloud nodes where I/O is affected by neighbors, with two schedulers (OS+hypervisor), and we're trying to find 5% improvements in mutators... ... using normal distribution statistical significance on non-normal data. And we're ignoring the first 15 minutes of the fuzzer where 95% of the coverage is found.

These are valid concerns. Re. cloud node, so far we’ve found that our experiments are reproducible, meaning that running on cloud doesn’t affect the statistical results. And as mentioned above the stat significance tests don’t rely on normal distribution. But again, this is very useful feedback, and we'd love to hear more of your ideas on how we could improve FuzzBench and fuzzer benchmarking.

Oooh. I saw the Mann-Whitney U tests and was pretty impressed. It's actually really neat to see the clustering between the related AFL modes, and I apologize for bringing up means and standard deviations so much. What are the error bars which are displayed on the bar graphs, as well as the confidence intervals on the coverage-over-time graphs? I assumed those were stddevs. The U tests are nice, and yeah, I wish they were a bit more at the forefront!

First of all, I'd like to say I'm pretty strongly biased against benchmarking fuzzers in nearly any capacity. But, I'm more focused on deep bugs on hard targets, and I'm probably not fair against the wide and shallow approaches of generic fuzzers and many-target/CI-style fuzzing projects. It's just not a problem I run into, and thus, I'm probably not accurately valuing it.

This is part 2 of FuzzBench that is still in development (https://github.com/google/fuzzbench/issues/165). We want to have crash based benchmarking of hard to find bugs. We have historical data from OSS-Fuzz on real-world targets and bugs that took several months to hit and are still a mystery why. We are taking appropriate time to develop this feature accurately.

First of all, I'd like to say I'm pretty strongly biased against benchmarking fuzzers in nearly any capacity. But, I'm more focused on deep bugs on hard targets, and I'm probably not fair against the wide and shallow approaches of generic fuzzers and many-target/CI-style fuzzing projects. It's just not a problem I run into, and thus, I'm probably not accurately valuing it.

This is part 2 of FuzzBench that is still in development (#165). We want to have crash based benchmarking of hard to find bugs. We have historical data from OSS-Fuzz on real-world targets and bugs that took several months to hit and are still a mystery why. We are taking appropriate time to develop this feature accurately.

This sounds awesome! My concern here is largely that in hindsight a lot of these bugs are much easier to find, and thus, easier to overfit for. But, that's also kind-of what I'm arguing for. Making tools which help users identify what blocks/bugs were hard to hit and try to help guide them to make them hit faster. Personally, on new targets I get excited when I hit NULL derefs. Not because they're interesting bugs, but because it's the first tangible oracle I can start working towards finding sooner. I believe that almost any fuzzer can be compacted in to at least 1% of the time it took to run initially (eg, a 24 hour fuzz run can be analyzed and redone from scratch in 15 minutes when the fuzzer has been improved), and it's a matter of introspection and tinkering to find a way to make the fuzzer find those things faster. But often, the things that are needed are too target specific.

For example, when I found a DHCP stack overflow in Windows, it required a massive repetition of a sequence. A repetition way too large for me to reasonably have a generic fuzzer do, but I could special case that and start tuning my initially generic mutator, into a TLV-aware, edge-case aware mutator where I gave it that sweet sweet 0.1% chance of repeating a sequence way more than I ever would.

These sorts of things add up, and would allow for overfitting to even a pretty wide corpus, and unfortunately I don't think they'd really have general applicability when the bugs are being found and tuned in hindsight.

Edit: TBH, I think when it comes to benchmarking, coverage often is a bit harder to fudge than crashes.

Besides the obvious point the almost every respectable paper has been reporting U-Test and percentiles instead of p-test and std for a long time now, I'd also disagree about the "density" metric. Have a look at VUzzer if you need to see why. Optimizing for density leads to shity engineering that can't possibly scale. Optimizing for density is really much more useful to compare your own fuzzer against an older version of your own fuzzer. Lastly: Log plots. I don't like log plots. They blow all the noise completely out of proportion. A linear plot tells you very nicely where each fuzzer flatlines. We do want to see if the fuzzer was run long enough, and that we don't expect it to get better any time soon. I do strongly agree about state coverage. I know of a few projects where grammar based fuzzers failed to outperform AFL in terms of coverage, while obviously yielding more bugs. I think a more precise approach would be needed to identify such improvements.

Besides the obvious point the almost every respectable paper has been reporting U-Test and percentiles instead of p-test and std for a long time now, I'd also disagree about the "density" metric. Have a look at VUzzer if you need to see why. Optimizing for density leads to shity engineering that can't possibly scale. Optimizing for density is really much more useful to compare your own fuzzer against an older version of your own fuzzer. Lastly: Log plots. I don't like log plots. They blow all the noise completely out of proportion. A linear plot tells you very nicely where each fuzzer flatlines. We do want to see if the fuzzer was run long enough, and that we don't expect it to get better any time soon. I do strongly agree about state coverage. I know of a few projects where grammar based fuzzers failed to outperform AFL in terms of coverage, while obviously yielding more bugs. I think a more precise approach would be needed to identify such improvements.

I yield the U-test, I feel like I keep seeing papers with error bars and confidence intervals, but perhaps I've been misinterpreting them.

Could you elaborate on where optimizing for density can't possibly scale? Are we talking performance wise, or just human workload wise? I'm not arguing to make it generic, rather, make tools that allow people to optimize their tools to their target. The genericness would come from the development/tuning experience, not from the fuzzer itself.

As for log plots, I'm unsure where you're getting noise from, I feel like the noise doesn't really seem to be that significant, infact, I feel it decreases it when a 10x variance doesn't span the whole graph.

Oooh. I saw the Mann-Whitney U tests and was pretty impressed. It's actually really neat to see the clustering between the related AFL modes, and I apologize for bringing up means and standard deviations so much. What are the error bars which are displayed on the bar graphs, as well as the confidence intervals on the coverage-over-time graphs? I assumed those were stddevs.

Yes, the bar plots show median, while the coverage plots show mean, with error bands showing 95% confidence interval in both cases. The purpose of these plots - together with the violin plot of the reached coverage - is to give an easy to read visualization of the data and its distribution. Once again, these are independent from the statistical significance tests.

The U tests are nice, and yeah, I wish they were a bit more at the forefront!

Good point! Yes, originally the critical difference and the u test plots were a bit more at the forefront, but then we decided to put them under "collapsibles", as we've found that most practical users often want to see just a simple ranking first, and the stat significance only as a secondary step. We've had some discussions around this before. We can rethink this of course, please consider filing a separate issue for this, if you think it'd make sense.

There is absolutely nothing wrong with density. But for academic papers, you have to look at this from the point of view of a reviewer. If you optimize for density, papers such as VUzzer will get in that propose techniques that will be VERY difficult to implement in such a way that they beat any fuzzer on wall clock time. Density is nice if you don't have any problems with misaligned incentives. Unfortunately in publishing we don't have that luxury.

I'm all in favor of tools that help people to adapt fuzzers to their targets (Nautilus and IJON both fall into that category - and more to come). However this is also EXTREMELY difficult to measure - probably the main reason it's not done a lot in academia. Fuzzing eval is enough work as is - doing large scale user studies is pretty much out of reach. Where do you even find 20 people that have the right background to use a fuzzer effectively? How are you going to build solid experiments around this?

I'm all in favor of tools that help people to adapt fuzzers to their targets (Nautilus and IJON both fall into that category - and more to come). However this is also EXTREMELY difficult to measure - probably the main reason it's not done a lot in academia. Fuzzing eval is enough work as is - doing large scale user studies is pretty much out of reach. Where do you even find 20 people that have the right background to use a fuzzer effectively? How are you going to build solid experiments around this?

Huh, interesting thing that I never really think about. I just assume that given there's nothing there already, adding something would only be an improvement. I can't say I'm too concerned with measuring increased user effectiveness, when adoption and seeing people use a tool and blog about is good enough for me. In fact, I think going through what people have blogged about is a great way to see how tools are helping, hurting, and spreading. I recognize there's probably no academic merit to that approach, but maybe it's just not a problem for academia I guess :\

Here's the things I'd like to see to really even start to get some value out of fuzzbench from the way I evaluate fuzzing:

Thanks so much for these details suggestions, these are very valuable!

I, first, have some major problems with single-result/ranking-style comparisons of nearly anything. They lead to a gamification of a problem which largely leads to people optimizing against a benchmark, not reality.

Yes, we discuss this in the FAQ https://google.github.io/fuzzbench/faq/#how-can-you-prevent-researchers-from-optimizing-their-tools-only-for-these-benchmarks. Please let us know what you think!

Logscale x axis should be the standard for these graphs. Make it a slider button or side/by side if you want

Yes, currently log scale can only be enabled with a flag, but having both in the report would make a lot sense, e.g. where users can switch between the two with a click. We added https://github.com/google/fuzzbench/issues/655 to track this. I'd personally be fine showing the log scale the default, but would be nice to hear opinions from others as well.

The uses of means and standard deviations shouldn't be used when ultimately we're just comparing multiple clusters on graphs and which fuzzers hit them first.

I think this has been addressed above, please let us know if we're still missing something.

From personal experience, there is much more value in comparing which fuzzers got identical coverage in a certain time.

Yes, we can show the data that we measure that way as well: instead of picking a time and look at the coverages of fuzzers there, we can pick a coverage and look at the times there. We track adding this report type in https://github.com/google/fuzzbench/issues/327. One question was how to pick the coverage value. Please consider adding more suggestions/ideas there.

The iteration count/fuzz case information should be the primary stat for graphing fuzzers, and the timeouts for fuzz runs should be in iterations and not time

Yes, as mentioned above we are working on including iteration data as well. I'm not sure though about making it the default (eg. see the symbolic execution / coverage guided fuzzing example in https://github.com/google/fuzzbench/issues/654#issuecomment-671524386). If you still feel iteration should be the default, please consider filing an issue for that.

Coverage really needs to not be so emphasized. There aren't really any good solutions to this, as bugs are just as easily (if not more) overfit to.

Yes, as mentioned before, we are working on adding bugs in addition to coverage. I think coverage is still important because relying only on bugs can be problematic due to their sparsity: https://google.github.io/fuzzbench/faq/#why-are-you-measuring-coverage-isnt-the-point-of-fuzzing-to-find-bugs

The data resolution needs to be significantly higher. I want to see stats every few seconds, ideally less (eg, generate CSV row on every coverage increase, regardless of timing). Looking at fuzzbench results, in many situations 90% of coverage is found in the first 15 minutes, and we don't even see any of that data as we see 15 minute samples.

Yes, we've been kicking around this idea for a while to increase snapshot density in the beginning of trials. Added https://github.com/google/fuzzbench/issues/656 to track.

Sorry for being critical, but I'd really like to see fuzzing advance beyond this dumb thing that you do without any human interaction.

No worries at all, we really appreciate your input! Our goal is the same: to advance fuzzing, and these discussions help a lot in that!

Hello again Today!

So, I'd like to address a few things that I've thought of a bit more over time and want to emphasize.

Visualizations, and what I'm often looking for in data

When it comes to visualizations, I don't really mind much which graphs are displayed by default, linear vs logscale, time-based or per-case-based, but they should both be toggleable in the default. I'm not web dev, but having an interactive graph would be pretty nice, allowing for turning on and off of certain lines, zooming in and out, and changing scales/axes. But, I think we're in agreement here. I personally believe that logscale should be default, and I don't see how it's anything but better, unless you only care about seeing where things "flatten out". But in that case, it's just as visible in logscale, you just have to be logscale aware.

Here's an example of typically what I graph when I'm running and tuning a fuzzer. I'm using doing side-by-side comparisons of small fuzzer tweaks, to my prior best runs, and plotting both on a time domain and a fuzz case domain. I've included the linear-scale plots just for comparison with the way we currently do things, but I personally never use linear scale as I just find it to be worse in all aspects.

By using a linear scale, we're unable to see anything about what happens in the fuzzer in the first ~20 min or so. We just see a vertical line. In the log scale we see a lot more which is happening. This graph is comparing a fuzzer which does rand() % 20 rounds of mutation (medium corruption), versus rand % 5 rounds of the same corruption (low corruption). We can see that early on medium corruption has much better properties, as it explores more aggressively. But there's actually a point where they cross, and this is likely the point where the corruption becomes too great on average in the medium corruption, and ends up "ruining" previously good inputs, dramatically reducing the frequency we see good cases. It's important to note, that since the medium corruption is a superset of low corruption (eg, there's a chance to do low corruption), both graphs would eventually converge to the exact same value.

There's just so much information in this graph that stands out to me. I see that something about medium corruption performs well in the first ~100 seconds. There's a really good lead at the early few seconds, and it tapers off throughout. This gives me feedback on maybe when and where I should use this level of corruption.

Further, since I have both a fuzz case graph and a time graph, I can see that medium corruption early on actually has better performance than low corruption. Once again, this makes sense, the more corruption, the more likely you are to make a more invalid input which is parsed more shallow. But from the coverage over case, I see that this isn't a long term thing and eventually the performance seems to converge between the two. It's important to note, the intersection point of the two lines varies by quite a bit in both the case domain and the time domain. This tells me that even though I just changed the mutator, it has affected the performance, likely due to the depth of the average input in the corpus, really neat!

Example analysis conclusion

I see that medium corruption in this case is giving me about 10x speedup in time-to-same-coverage, and also some performance benefits early on. I should adopt a dynamic corruption model which tunes this corruption amount maybe against time, or ideally, some other metric I could extract from the target or stats. I see that long-term, the low corruption starts to win, and for something that I'd run for a week, I'd much rather run the low corruption.

Even though this program is very simple, these graphs could pretty arbitrary be stretched out to different time axis. If fuzzbench picks a deadline, for example, 1000 seconds, we would never know this about the fuzzer performance. I think this is likely what many fuzzers are now being tuned to, as the benchmarks often are 12/24/72 hour increments. Fuzzers often get some extra blips even deeper in the runs, and it's really hard to estimate if these crosses would ever occur.

The case for cases

I personally extract most information from graphs which are plotted against number of fuzz cases rather than time. By doing benchmarks in a time domain, you factor in the performance of the fuzzer. This is this ground truth, and what really matters at the end of the day with complete products. But it's not the ground truth for fuzzers in development. For example, if I wanted to prototype a new mutation strategy for AFL, I would be forced to do it in C, avoid inefficient copies, avoid mallocs, etc. I effectively have to make sure my mutator is at-or-better than existing AFL mutator performance to use benchmarks like this.

When you do development on fuzz cases, you can start to inspect the efficiency of the fuzzer in terms of quality of cases produced. I could prototype a mutator in python for all I care, and see if it performs better than the stock AFL mutators. This allows me to cut corners and spend 1 day trying out a mutator, rather than 1 month making a mutator and then doing complex optimizations to make it work. During early stages of development, I would expect a developer to understand the ramifications of making it faster, and to have a ballpark idea of where it could be of the O(n^3) logic was turned into O(log n), and whether it's possible.

Often times, the first pass of an attempt is going to be crude, and for no reason other than laziness (and not in a negative way)! There's a time and a place to polish and optimize a technique, and it's important that there can be information learned from very preliminary results. Most performance in standard feedback mechanisms and mutation strategies can be solved with a little bit of engineering, and most developers should be able to gauge the best-case big-O for their strategy, even if that's not the algorithmic complexity of their initial implementation.

Yep, looking at coverage over cases adds nuance, but I think we can handle it. Given most fuzzing tools, especially initial passes, are already so un-optimized, I'm really not worried about any performance differences in AFL/libfuzzer/etc when it comes to single-core performance.

Scaling

Scaling of performance is really missing from fuzzbench. At every company I've worked at, big and small, even for the most simple targets we're fuzzing we're running at least ~50-100 cores. I presume (I don't know for sure) that fuzzbench is comparing single core performance. That's great, it's a useful stat and one I often control for, as single-core, coverage/case is often controlled for both scaling and performance, leading to great introspection into the raw logic of the fuzzer.

However, in reality, the scaling of these tools is critical for actual use. If AFL is 20% faster single-core, that'll likely make it show up at the top of fuzzbench, given relative parity of mutation strategies. That's great, the performance takes engineering effort and should not be undervalued. In fact, most of my research is focused around making fuzzers fast, I've got multiple fuzzers that can handle 10s of billions of fuzz cases per second on a single machine. It's a lot of work to make these tools scale, much more so than single-core performance, which is often algorithmic fixes.

If AFL is 20% faster single-core, but bottlenecks on fork(), or write(), and thus only scales to 20-30 cores (often where I see AFL really fall apart, on medium size targets, 5-10 cores for small targets). But something like libfuzzer manages things in memory and can scale linearly with as many cores as you throw it, libfuzzer is going to blow away any 20% performance gains seen single-core.

This information is very hard to benchmark. Well, not hard, but costly. Effectively, I'd like to see benchmarks of fuzzers scaled to ~16 cores on a single server, and ~128 cores distributed across at least 4 servers. This benchmarks. A. the possibility the fuzzer can scale in the first place, if it can't that's a big hit to real-world usability. B. the possibility it can scale across servers (often, over the network). Things like AFL-over-SMB would have brutal scaling properties here. C. the scalability properties between cores on the same server, and how they transfer over the network.

I find it very unlikely that these fuzzers being benchmarked even remotely have similar scaling properties. AFL struggles to scale even on a single server, even in persistent mode, due to the heavy use of syscalls and blocking IPC every fuzz case (signal(), read(), write(), per case IIRC, ~3-4 syscalls).

Scaling also puts a lot of pressure on infeasible fuzzing strategies proposed in papers. We've all seen them, the high-introspection techniques which extract memory, register, taint state from a small program and promise great results. I don't disagree with the results, the more information you extract, pretty much directly correlates to an increase in coverage/case. But, eventually the data load gets very hard to share between cores, queue between servers, and even just process.

Measuring symbolic

Measuring symbolic was brought up a few times, as it would definitely have a much better coverage/case than a traditional fuzzer. But this nuance can easily be handled by looking at both coverage/case and coverage/time graphs. Learning what works well algorithmicly should drive our engineering efforts to solve problems. While symbolic may have huge performance issues, it's very likely, that many of the parts of it (eg. taint tracking) can be approximated with lossy algorithms and data capturing, and it's more about learning where it has strengths and weaknesses. Many of the analyses I've done on symbolic largely lead me to vectorized emulation, which allows for highly-compressed, approximated taint tracking, while still getting near-native (or even better) execution speeds.

The case against monolithic fuzzers

Learning what works is important to figure out where to invest our engineering time. Given the quality of code in fuzzing right now (often very poor), there's a lot of things that I'd hate to see us rule out because our current methodologies do not support them. I really care about my reset times of fuzz cases, (often: the fork() costs), as well as determinism. In a fully deterministic environment, with fast resets, a lot of approximate strategies can be used. Trying to approximate where bytes came from in an input, flipping the bytes because you have a branch target which is interesting, and then smashing those bytes in can give good information about the relation of those bytes to the input. Hell, with fast resets and forking, you can do partial fuzzing where you fork() and snapshot multiple times during a fuzz case, and you can progressively fuzz "from" different points in the parser. This works especially well for protocols where you can snapshot at each packet boundary.

These sorts of techniques and analyses don't really work when we have monolithic fuzzers. The performance of existing fuzzers is often quite poor (AFL fork(), etc), or does not support partial execution (persistent modes, libfuzzer, etc). This leads to us not being able to even research these techniques. As we keep bolting things onto existing fuzzers and treating them like big blobs, we'll get further and further from being able to learn the isolated properties of fuzzers and find the best places to apply certain strategies.

Why I don't care much about fuzzer performance for benchmarking

Reset speed

AFL fork() bottlenecks for me often around 10-20k execs/sec on a single core, and about 40-50k on the whole system, even with 96C/192T systems. This is largely due to just getting stuck on kernel allocations and locks. Spinning up processes is expensive, and largely out of our control. AFL allows access of the local system and kernel to the fuzz case, thus cases are not deterministic, nor are they isolated (in the case of fuzzing something with lock files). This requires using another abstraction layer like docker, which adds more overhead to the equation. My hypervisors that I use for fuzzing can reset a Windows VM 1 million times per second, and scale linearly with cores, while being deterministic. Why does this matter? Well, we're comparing tooling which isn't even remotely hitting the capabilities of the CPUs, rather they're bottlenecking on the kernel. These are solvable problems, and thus, as a consumer of good ideas but not tooling, I'm interested in what works well. I can make it go fast myself.

Determinism

Most fuzzers that we work with now are not deterministic. You cannot expect instruction-for-instruction determinism between cases. This makes it a lot harder to use complex fuzzing strategies which may rely on the results of a prior execution being identical to a future one. This is largely an engineering problem, and can be solved in both system-level and app-level targets.

Mutation performance

The performance of mutators is often not what it can be. For example, honggfuzz used (now fixed, cheers!) temporary allocations during multiple passes. During its mangle_MemSwap it made a copy of the chunk that was to be swapped, performing 3 memcpys and using a temporary allocation. This logic was able to be implemented using a single memcpy and without a dynamic allocation. This is not a criticism of honggfuzz, but more of an important note of how development often occurs. Early prototyping, success, and rare revisiting of what can be changed. What's my point here? Well, the mutation strategies in many fuzzers may introduce timing properties that are not fundamentally required to have identical behaviors. There's nothing wrong with this, but it is additional noise which factors into time-based benchmarks. This means a good strategy can be hurt by a bad implementation, or, just a naive one that was done early on. This is noise that I think is big to remove from analysis such that we can try to learn what ideas work, and engineer them later.

Further, I don't know if I'm any mutational fuzzer which doesn't mutate in-place. This means the multiple splices and removals from an input must end up memcpy()ing the remainder. This is a very common mutation pass. This means the fuzzer exponentially slows down WRT the input file size. Something we see almost every fuzzer put insane restrictions on (AFL has a fit if you give it anything but a tiny file).

There's nothing stopping us from making a tree-based fuzzer where a splice adds a node to the tree and updates metadata on other nodes. The input could be serialized once when it's ready to be consumed, or even better, serialized on-demand, only providing the parts of the file which actually were used during the fuzz case.

Example:

Initial input: "foobar", tree is [pointer to "foobar", length 6]
Splice "baz" at 3: [pointer to "foo", length 3][pointer to "baz", length 3][pointer to "bar", length 3]
Program read()s 3 bytes, return "foo" without serializing the rest
Program crashes, tree can be saved or even just what has read can be saved

In this, the cost is N updates to some basic metadata, where N is the number of mutations performed on that input (often 5-10). On a new fuzz case, you start with an initial input in one node of the tree, and you can once again split it up as needed. Pretty much no memcpys() need to be performed, nor allocations, as the input can be extended such that in-memory it's "foobarbaz", but the metadata describes that the "baz" should come between "foo", and "bar".

Restructuring the way we do mutations like this allows us to probably easily find 10x improvements in mutator performance (read, not overall fuzzer performance). Meaning, I don't really want the cost of the mutator to be part of the equation, because once again, it's likely a result of laziness or simplicity. If something really brings a strategy to the table that is excellent, we can likely make it work just as fast (but likely even faster), than existing strategies.

Not to note the value in potentially knowing which mutations were used during prior cases, and you could potentially mutate this tree (eg, change a splice from 5 bytes to 8 bytes, without changing the offset, just changing the node in the mutation tree). This could also be used as a mechanism to dynamically weight mutation strategies based on yields, while still getting a performance gain over the naive implementation.

Performance conclusion

From previous work with fuzzers, most of the reset, overhead, and corruption logic is likely not even within an order of magnitude of the possible performance. Thus, I'm far more interested in figuring out what and where strategies work, as the implementations of them are typically not indicative of their performance.

BUT! I recognize the value in treating them as whole systems. I'm a bit more on the hard-core engineering side of the problem. I'm interested in which strategies work, not which tools. There's definitely value in knowing which tools will work best, given you don't have the time to tweak or rebuild them yourself. That being said, I think scaling is much more important here, as I don't know of really anyone doing single-core fuzzing. The results of these fuzzers at scale is likely dramatically different from single-core, and would put some major pressure on some more theoretical ideas which produce way too much information to consume and handle.

Reconstructing the full picture from data

The data I would like to see fuzzbench give, and I'd give you some massive props for doing it, would be the raw, microsecond-timestamped information for each coverage gained.

This means, every time coverage increases, a new CSV record (or whatever format) is generated, including the time stamp when it was found (to the microsecond), as well as the fuzz iteration ID which indicates how many inputs have been run into the fuzzer. This should also include a unique identifier of the block which was found.

This means, in post, the entire progress of the fuzzer can be reconstructed. Every edge, which edges they were, the times they were found, and the case ID they were on when they were found allows comparing not only the raw "edge count" but also the differences between edges found. It's crazy that this information is not part of the benchmark, as almost all the fuzzers could be finding nearly the same coverage, but a fuzzer which finds less coverage, but completely unique edges, would be devalued.

This is the firehose of data, but since it's not collected on an interval, it very quickly turns into almost no data.

Hard problem: What is coverage?

This leads to a really hard problem. How do we compare coverage between tools? Can we safely create a unique block identifier which is universal between all the fuzzers and their targets. I have no idea how fuzzbench solves this (or even if it does). If fuzzbench is relying on the fuzzers to have roughly the same idea of what an edge is, I'd say the results are completely invalid. Having different passes which add different coverage gathering, compare information gathering, could easily affect the graphs. Even just non-determinism in clang (or whatever compiler) would make me uneasy about if afl-clang binaries have identical graph shapes to libfuzzer-clang binaries.

If fuzzbench does solve this problem, I'm curious as to how. I'd anticipate it would be through a coverage pass which is standardized between all targets. If this is the case, are they using the same binaries? If they're not, are the binaries deterministic, or can the fuzzers affect the benchmark coverage information due to adding their own compiler instrumentation.

Further, if this is the case, it makes it much harder to compare emulators or other tools which gather their own coverage in a unique way. For example, if my emulators, which get coverage for effectively free, had to run an instrumented binary for fuzzbench to get data, it's not a realistic comparison. My fuzzer would be penalized twice for coverage gathering, even though it doesn't need the instrumented binary.

Maybe someone solved this problem, and I'm curious what the solution is. TL;DR: Are we actually comparing the same binaries with identical graphs, and is this fair to fuzzers which do not need compile-time instrumentation.

Can't wait for more discussion. You have been very receptive even when I'm often a bit strongly opinion-ed. I respect that a lot.

Stay cute, gamozo

Thanks for sharing your thoughts!

Visualizations, and what I'm often looking for in data

I think we're in agreement here.

Yes, we’re in agreement.

The case for cases

I personally extract most information from graphs which are plotted against number of fuzz cases rather than time.

Yes, looking at coverage/cases can be very useful, e.g., when you A/B test a change in one fuzzer (like you described), where that change does not affect execution time. Doing this is nice because it eliminates the random overhead variable from the experiment. This, however, doesn’t work when we compare changes that affect overhead, or when we compare two different tools: e.g., one tool’s instrumentation might slow down the execution by 2x collecting coverage feedback, while the other one might slow it down 5x, because say it does taint tracking too. In these cases it’s not enough to look at only iterations, but the execution time needs to be taken into account as well. But again, fully agree that looking at coverage/cases can be useful sometimes, this is why we’re working on adding it (https://github.com/google/fuzzbench/pull/648), as mentioned before.

Scaling

Scaling of performance is really missing from fuzzbench.

In our experience, scaling fuzzing hasn’t really been an issue. (At Google, for example, we scale fuzzing horizontally to thousands of cores.) If I understand correctly, the issue you talk about is fuzzers being I/O bound -- which is also a good example why execution time needs to be taken into account. Also, with the PR linked above we’re planning to measure not just iteration counts, but real/user/sys times as well, to see e.g., how much time is spent in CPU and waiting for I/O, etc. Perhaps, that information will be helpful for the issue you're referring to? If not, let us know if there's a concrete example problem or question related to this that FuzzBench could help with.

Why I don't care much about fuzzer performance for benchmarking

I'm interested in which strategies work, not which tools.

We are interested in both! FuzzBench is used by fuzzer developers to find the best strategies all the time, e.g., libFuzzer devs noticed AFL did better on one benchmark and thought its handling of seeds might be responsible. So they added a patched version of libFuzzer implementing this strategy to see if libFuzzer benefits from this strategy. AFL++ devs continuously experiment and A/B test different strategies (configs). Honggfuzz went through a series of major improvements due to such FuzzBench experiments. These developments often happen by focusing on an individual benchmark first, evaluating a single change, similarly to your workflow described in the beginning of your post. One thing that FuzzBench enables is evaluating whether that change or strategy generalizes (and didn’t just happen to work for a single target). We can tell this by running the experiment on a wide, diverse set of benchmarks, with many trials, so proper statistical analysis and conclusions can be made.

Reset speed

AFL fork() bottlenecks for me often around 10-20k execs/sec on a single core, and about 40-50k on the whole system, even with 96C/192T systems.

Yes, fork()-ing is a huge bottleneck, that’s why FuzzBench runs AFL in persistent mode.

as a consumer of good ideas but not tooling, I'm interested in what works well

With FuzzBench you can figure out what strategies/features work well: if you’d like to eliminate factors such bottlenecks or engineering optimizations, you can do that by comparing two variants of a given fuzzer with the given strategy/feature enabled and disabled, just like many FuzzBench users do as described above.

Mutation performance

I don't really want the cost of the mutator to be part of the equation

Same here: you can eliminate it from the equation doing an A/B test described above.

Determinism

Most fuzzers that we work with now are not deterministic.

Not sure what issue you’re referring to, fuzzing is randomized by definition, no? Is there something FuzzBench can do related to this?

The case against monolithic fuzzers

we'll get further and further from being able to learn the isolated properties of fuzzers and find the best places to apply certain strategies.

Once again, FuzzBench is regularly used to evaluate isolated properties of fuzzers (i.e., to A/B test isolated properties, strategies, features). Is there something we can do in FuzzBench to help with this more?

Reconstructing the full picture from data

Every edge, which edges they were, the times they were found, and the case ID they were on when they were found allows comparing not only the raw "edge count" but also the differences between edges found.

For the evaluation we do (ie for the reports we generate) we do not need such details, however we save each corpus snapshot, so the exact edge information can be recovered for each snapshot when needed. (We run 24 hours trials, and currently we make snapshots every few minutes, but as mentioned we plan to increase the frequency in the beginning of trials for folks who might interested in very-short-term fuzzing effectiveness.)

It's crazy that this information is not part of the benchmark, as almost all the fuzzers could be finding nearly the same coverage, but a fuzzer which finds less coverage, but completely unique edges, would be devalued.

Yes, we’ve been working on this for a while and have a PR in review right now enabling exactly this. With that PR, reports will show unique and differential coverage.

Hard problem: What is coverage?

If fuzzbench does solve this problem, I'm curious as to how. TL;DR: Are we actually comparing the same binaries with identical graphs, and is this fair to fuzzers which do not need compile-time instrumentation.

Yes, FuzzBench solves this by using an independent coverage metric (clang coverage) to measure the coverage of the test cases generated by each fuzzer.

Can't wait for more discussion. You have been very receptive even when I'm often a bit strongly opinion-ed. I respect that a lot.

Thanks for sharing your opinion! Hope these answers address your concerns about fuzzer benchmarking. Let us know, or if you have questions, or if you have ideas on how we could make FuzzBench more useful. Also, if you have a fuzzer of your own, that you'd like to do A/B testing on, or perhaps have an improvement on an existing fuzzer (you've mentioned some optimization ideas), we’d really love to see it integrated and evaluated with FuzzBench to advance fuzzing research!

Here's some of the concerns I have. I got fuzzbench running locally for aflplusplus and did a docker copy to copy out exactly what was used inside of fuzzbench. In this case I'm using aflplusplus against libpng-1.2.56, just because that was the example in the usage of running the docker image. (https://google.github.io/fuzzbench/developing-fuzzbench/adding-a-new-benchmark/#testing-it-out)

Here's the results inside of docker (left), and outside of docker (right). It's running on the exact same machine.

Just in this alone, we can see there's about a 50-100% slowdown incurred from docker, even though AFL is in persistent mode and using shared memory. It's technically possible it's a libc difference between docker and my host, but I highly doubt it. This is likely a docker overhead on certain syscalls, likely the ones AFL uses for syncing. This isn't a cherry picked image, running it multiple times shows similar results.

This means, a fuzzer that does not use syscalls as heavily would have a major advantage as it would run at near-native speeds, where fuzzers that do hit syscalls frequently may pay a penalty that is docker specific. This is something I would see in my fuzzers which perform no syscalls during fuzzing, I would not have any penalty for running in docker and this give me about a 50-100% performance advantage that doesn't actually exist.

But that's not the end of performance. AFL really benefits from setting CPU affinity. Once again, on my local machine, I run AFL (left) with the exact invocation from fuzzbench, ./afl-fuzz -S 0 -i ./seeds -o ./corpus -d -m none -c ./cmplog/fuzz-target -x ./fuzz-target.dict -- ./fuzz-target 2147483647. On the right, I pin it to a core with taskset 1 ./afl-fuzz -S 0 -i ./seeds -o ./corpus -d -m none -c ./cmplog/fuzz-target -x ./fuzz-target.dict -- ./fuzz-target 2147483647

Here we demonstrate that AFL running outside of docker with affinity set gets 4.5x more performance than the exact same AFL binary running in docker in the configuration it runs in fuzzbench.

As for scaling, I really question that AFL is scaling with cores. Running on all cores and making your CPUs get hotter, sure, I don't doubt that, but AFL traditionally has catastrophic scaling performance (as do many fuzzers). For example, once again I docker cped out the exact image used for fuzzing, binaries and inputs, and here's what it looks like when used on N cores on the same machine (quad socket Intel 6252N, totalling 96C/192T).

This is absolute best case AFL scaling, all on one machine, no network scaling, no NFS/SMB based inputs/outputs, persistent mode, and shared memory mode enabled, which is a pretty new feature to AFL++. We can see that AFL fails to scale past 1M fuzz cases per second, and eventually there's just no reason to add more cores (seems to happen at about ~40 cores for libpng).

This is a pretty catastrophic failing to scale, and some fuzzers which do not have these scaling issues would not get the light of day in a benchmark like fuzzbench unless they were also competitive on a single core. On this machine, a fuzzer that runs at 1/3rd the speed of AFL, but scales linearly with cores added, would actually match AFL's performance. That's not even factoring in network scaling, which, I'd presume is often SMB mounted folders, which could lead to some serious performance issues. I don't think shared memory mode even supports networked AFL machines, but I'm not sure.

#!/usr/bin/env python3

import os, shutil, multiprocessing, time, subprocess

# Maximum number of threads to fuzz with AFL in parallel during benchmark
MAX_THREADS = 192

def fuzz_worker(thr_id):
    os.sched_setaffinity(0, [thr_id])
    sp=subprocess.Popen([f"./afl-fuzz -S {thr_id} -i ./seeds -o ./corpus -d -m none -c ./cmplog/fuzz-target -x ./fuzz-target.dict -- ./fuzz-target 2147483647"],
        stdout=subprocess.PIPE, stderr=subprocess.PIPE, stdin=subprocess.PIPE, shell=True)
    while sp.poll() == None:
        print(sp.stdout.read())
        print(sp.stderr.read())
        time.sleep(0.1)

def benchmark_afl(num_threads):
    assert num_threads > 0

    # Kill anything that may already have been running
    os.system("killall -9 ./fuzz-target 2> /dev/null")
    os.system("killall -9 ./cmplog/fuzz-target 2> /dev/null")

    # Kill active shared memory, AFL will run out and crash eventually
    # due to shmget() failures
    assert os.system("ipcrm -a") == 0

    # Set the environment variables we want so AFL behaves
    os.environ["AFL_NO_AFFINITY"] = "1"
    os.environ["AFL_MAP_SIZE"] = "524288"
    os.environ["AFL_I_DONT_CARE_ABOUT_MISSING_CRASHES"] = "1"
    os.environ["AFL_SKIP_CPUFREQ"] = "1"

    # Remove our output directory
    if os.path.exists("corpus"):
        shutil.rmtree("corpus")

    # Spawn threads
    threads = []
    for thr_id in range(num_threads):
        thread = multiprocessing.Process(target=fuzz_worker, args=[thr_id])
        thread.start()
        threads.append(thread)

    # Wait a bit for the benchmark
    time.sleep(15.0)

    # Kill all workers
    for thread in threads:
        thread.terminate()
        thread.join()

    # Kill the process, as `terminate` orphans processes, I'm just lazy this
    # could be done better
    os.system("killall -9 ./fuzz-target 2> /dev/null")
    os.system("killall -9 ./cmplog/fuzz-target 2> /dev/null")

    # Go through AFL status messages, pick the most recently logged line and
    # sum up the per-thread cases per second to get a total cases per second
    total_per_second = 0
    for thr_id in range(num_threads):
        with open(f"corpus/{thr_id}/plot_data", "r") as fd:
            # Get the last status line for this thread
            contents      = fd.read().splitlines()
            last_status   = contents[-1]
            iters_per_sec = float(last_status.split(",")[-1].strip())
            total_per_second += iters_per_sec

    print(f"{num_threads:6} {total_per_second:16.6f}")

# Benchmark AFL performance for each number of threads
for num_threads in range(1, MAX_THREADS + 1):
    benchmark_afl(num_threads)

fuzzbench runs each docker container in its own virtual machine so I expect this reduces the variance you are seeing in your experiment.

for the CPU affinity, all VMs have a single core (I think), and the docker container also only gets a single core, so I dont think this has any impact in the fuzzbench experiment.

the "a fuzzer with less syscalls has an advantage over those which do" is correct, and having docker increases the impact here. yes. I think (but not sure!) that libfuzzer is relatively syscall free as it only writes to disk on finding a new path. but looking at the statistics fuzzbench is not good. and even the much better variant "entropic" is beat by honggfuzz which is (AFAIK) the worst when it comes to syscall overhead. but that is rather anecdotal

I share your concern though and voiced that last week when me and dominik had a video call with the fuzzbench team. for that reason I had prepared a fuzzbench run on the weekend with three identical afl++ instances with the same -s seed and just basic usage, no cmplog etc. - which will help to show how much variance there is in just the VM + docker setup. AFAIK that experiment starts today.

I'm not sure what fuzzbench has for a docker configuration, but with my standard docker install and running the benchmark, here's what I see (stock on left, taskset 1 AFL on right):

It really comes down to how the docker install is configured for fuzzbench, and even then, the hypervisor scheduler of the guests could be round-robining cores even if the OS inside a VM is pinning to cores, as there's a disconnect between physical and virtual cores.

It's important to note that fuzzbench does give control of affinities to the docker container via the SYS_NICE capability. Thus, it's kind of on AFL to set the affinity. But, I think fuzzbench should be providing an affinity number to use to the container such that it knows what other processes have scheduled to on the system. I don't know if there's a good way from within the container to pick a free affinity if it's not being communicated to you.

@gamozolabs had forgotten to write: all afl variants (afl, afl++, laf-intel, mopt etc.) have AFL_NO_AFFINITY set in fuzzbench runs

the experiment has started => https://www.fuzzbench.com/reports/2020-08-12/index.html and we can see the aflplusplus_same1 .. aflplusplus_same3 instances to have quite a different coverage.

for several benchmarks this will even out as their overage saturate after 6 hours, so capturing this now is important.

most easy to see for curl, re2 and sqlite3:

fuzzer	fuzzer1	fuzzer2	fuzzer3
curl	17148.0	17161.5	17155.0
re2	3461.0	3455.0	3488.0
sqlite3	26895.0	26611.5	26420.5

as all PRNG values are the same and input selection based on same values this shows that the instances are running with different speeds.

the experiment has started => https://www.fuzzbench.com/reports/2020-08-12/index.html and we can see the aflplusplus_same1 .. aflplusplus_same3 instances to have quite a different coverage.

for several benchmarks this will even out as their overage saturate after 6 hours, so capturing this now is important.

most easy to see for curl, re2 and sqlite3:

fuzzer fuzzer1 fuzzer2 fuzzer3 curl 17148.0 17161.5 17155.0 re2 3461.0 3455.0 3488.0 sqlite3 26895.0 26611.5 26420.5 as all PRNG values are the same and input selection based on same values this shows that the instances are running with different speeds.

I think this is an important issue. Overall though, I think this is mostly a problem if the difference is statistically significant as some degree of variance is expected. Let's discuss this in https://github.com/google/fuzzbench/issues/667 this issue is already discussing a lot of different topics.

Here's some of the concerns I have.

Thanks for sharing! I think you've mentioned two things:

Re. docker overhead: The goal of FuzzBench is to evaluate how fuzzers perform in general compared to each other. This doesn't require running them in the most ideal environment. Docker would be an issue if we got different statistically significant end results running experiments inside and outside of docker.
Re. vertical scaling of AFL: This is very interesting, however directly evaluating how well a fuzzer was optimized for running on a single 96 core machine is not a priority for FuzzBench. But as mentioned before, we're hoping to help identifying i/o bound fuzzers with more resource usage stats in the future.

google / fuzzbench

Brandon Falk @gamozolabs FuzzBench feedback #654

From - Brandon Falk - @gamozolabs - Aug 8 (https://twitter.com/gamozolabs/status/1292324287324381184)

Visualizations, and what I'm often looking for in data

Example analysis conclusion

The case for cases

Scaling

Measuring symbolic

The case against monolithic fuzzers

Why I don't care much about fuzzer performance for benchmarking

Reset speed

Determinism

Mutation performance

Performance conclusion

Reconstructing the full picture from data

Hard problem: What is coverage?

Visualizations, and what I'm often looking for in data

The case for cases

Scaling

Why I don't care much about fuzzer performance for benchmarking

Reset speed

Mutation performance

Determinism

The case against monolithic fuzzers

Reconstructing the full picture from data

Hard problem: What is coverage?