Support tracks to compare classes of fuzzers (binary-only, source-only, hybrids, etc)

sangkilc commented 4 years ago

Thanks for maintaing an awesome project!

I am writing to discuss my concern about the unfairness of fuzzbench results due to the difference between binary-level vs. source-level fuzzers/

When I look at the current sample report, all the tools used here except Eclipser run with source-level instrumentation (with afl-cc). Eclipser, on the other hand, uses QEMU to instrument binaries.

It is well-known that binary-level instrumentation incurs significant overhead (several orders of magnitude) compared to source-level instrumentation. Therefore, comparing Eclipser with source-level fuzzers, e.g., AFL, is not entirely fair as they have different goals and uses. However, comparing Eclipser with AFL running in the QEMU mode (-Q option) would be fair, for example.

So I would like to suggest separating tracks in fuzzbench into two: binary track, and source track. In the binary track, we can include AFL-QEMU, Eclipser, VUzzer, etc. I believe showing two sets of graphs for each program would be enough. For your information, having multiple tracks in comparing tools is a common practice in other domains. For example, SMT-COMP currently has 6 tracks: https://smt-comp.github.io/2019/results.html.

This way, people can appreciate more about binary-level fuzzing research 😄 I truly believe this will benefit our community as well.

Thank you!

andreafioraldi commented 4 years ago

Not really true, in the sample evaluation AFL based fuzzers instrument the target using SanitizerCoverage that is comparable to binary-only fuzzing. AFL++ QEMU is faster than SanCov in many targets and hardware-based binary only fuzzing is even faster. Regards Vuzzer, it needs a static analysis pass using IDA Pro, I doubt that it will never be included here.

andreafioraldi commented 4 years ago

Btw, I have similar benchmarks performed months ago using only binary-only fuzzers and the results are coherent, Eclipser performance are poor repsect modern fuzzers (better than AFL yes but not bettern than AFL with a good dictionary or a good seed corpus).

sangkilc commented 4 years ago

@andreafioraldi

Whether a fuzzer uses an expensive instrumentation mechanism is not relevant here. If there was an equivalent implementation of SanCov at a binary-level (with QEMU, Pintool, etc.), we will observe much higher overhead compared to the source-based SanCov. Of course, fuzzers have their own instrumentation mechanisms, and we don't need to care how fast each of the mechanisms is.

Regardless of the underlying instrumentation mechanism, the fact that a fuzzer requires source-level instrumentation doesn't really change.

In other words, binary-level fuzzers have their own needs and goals, and they are designed under a different assumption than source-based fuzzers. Therefore, it makes more sense to me to separate those two classes of fuzzers based on whether they need source code or not.

BTW, @jchoi2022 and I will be very interested to know how you performed the binary-only experiment. Can you share the detailed setup? Which fuzzer did you use? How did you modify source-level fuzzers? Can you share the modified versions somehow?

andreafioraldi commented 4 years ago

@sangkilc cannot share it atm but I didn't modify any fuzzer. For instance, almost all fuzzers here (except libfuzzer) can fuzz binaries with QEMU or with hardware-based profiling (hongg).

lszekeres commented 4 years ago

Thanks @sangkilc for the suggestion! Yes, creating specific reports with specific focus is indeed very useful. We are planning to support this by adding a feature soon where users can specify the set of fuzzers they would like to compare in a report: https://github.com/google/fuzzbench/issues/79. This will allow you to do e.g., generate_report --fuzzers afl-qemu eclisper to compare only afl-qemu and eclisper, or any other subset of fuzzers. You don't need to run your own experiment, you can use the raw data we publish. Here's the doc on how you can do that: Custom analysis and reports

Different users and researchers often have different focus and are often interested in different comparisons. Therefore, we will always publish the raw data of the experiments we run, together with a generic report, and encourage researchers to generate custom reports for their needs and focus.

jchoi2022 commented 4 years ago

Could you please clarify your point, @andreafioraldi ? Do you mean that (1) AFL++ made some QEMU optimization to make QEMU mode as fast as source instrumentation? or (2) AFL with SanCov is as slow as AFL in QEMU mode?

I presume you meant (2), but I believe your opinion is conflicting with the previous beliefs in the research community. To support my claim, I prepared some benchmarks to evaluate the impact of source-based vs. binary-based instrumentation. First of all, I believe that an intuitive and reasonable experiment is to compare AFL with LLVM SanCov against AFL in QEMU mode. Both modes use the exact same logic, work under the exact same infrastructure except their instrumentation. I prepared a Docker image to compare the two modes (https://github.com/jchoi2022/Bin-vs-Src-Instrument) with AFL-2.56b. In my environment, the execution speed (execs/sec) of QEMU mode AFL was 3.2 ~ 4.8 times slower than that of AFL with SanCov instrumentation. Please correct me if there is any wrong configuration in the above repository. Could you provide a concrete environment (like a Docker image) to prove your observation?

Regarding the experiment you performed months ago, I believe it is not a fair comparison if you provide a good dictionary and seed corpus only to AFL, but not to Eclipser.

You also mentioned that your separate experiment is coherent with fuzzbench result, but I would like to point out that current fuzzbench report is created with a wrong configuration of Eclipser. Eclipser ran with the 8-byte maximum file length, could would have substantially degraded its coverage. It was fixed in PR #72 few days ago, but the report has not been updated yet. When it gets updated, we hope Eclipser to achieve much higher coverage.

inferno-chromium commented 4 years ago

You also mentioned that your separate experiment is coherent with fuzzbench result, but I would like to point out that current fuzzbench report is created with a wrong configuration of Eclipser. Eclipser ran with the 8-byte maximum file length, could would have substantially degraded its coverage. It was fixed in PR #72 few days ago, but the report has not been updated yet. When it gets updated, we hope Eclipser to achieve much higher coverage.

fyi, that PR reduced max file size from 10 MB to 1 MB, the sample report ran with 10 mb file size (not the 8 bytes default). with 2 sec test timeout and 1 mb file size [changes you suggested], a new report should be generated soon.

Also, we will add more binary-only fuzzers in near future and provide the tracks functionality via generate_report, so thanks for your input.

jchoi2022 commented 4 years ago

Thank you for the clarification, I misunderstood the PR. Then I think there may be some mistakes in Eclipser side, since coverage plotting of some programs (opessl, irssi) are weird. I will take a deeper look and open a separate issue if any problem is found.

google / fuzzbench

Support tracks to compare classes of fuzzers (binary-only, source-only, hybrids, etc) #76