google / fuzzbench

FuzzBench - Fuzzer benchmarking as a service.
https://google.github.io/fuzzbench/
Apache License 2.0
1.11k stars 269 forks source link

Sampling Initial Seed Corpus and Analysis #1489

Open dylanjwolff opened 2 years ago

dylanjwolff commented 2 years ago

TO @jonathanmetzman @lszekeres CC @mboehme @inferno-chromium

We have two related features which we've implemented on a private fork that we'd like to integrate into Fuzzbench. The first is the ability to sample from a larger pool of seeds to provide a unique corpus to each fuzzer per trial during a benchmarking run. The second consists of additional data-analysis to give some insight into how various aspects of the initial corpora and programs under test might be affecting benchmarking outcomes.

The purpose of this issue it to establish the following:

  1. [Sampling] We currently have a script we've been using for local experiments that samples from e.g. a project's OSS-Fuzz corpus to generate random initial corpora. We then mount those in the docker containers of the runners. We also kick off the first measurer cycle before launching the fuzzer process to grab the initial coverage of the corpus. Are there other considerations or another approach we should take for adding this feature?
  2. [Properties] Which properties would you consider to be interesting? We currently have
    • seed-corpus: initial coverage, number of seeds, average seed exec time, average seed size
    • program: size (and others). Anything else that you would like to look at?
  3. [UI/UX] What is the interface that you want to present to the users? For the seed sampling, probably additional field(s) in the YAML configuration file to select a sampling level / strategy? For the data presentation, we have produced several visualizations which show the relative impact of a particular property on the final ranking of a fuzzer or its coverage. We are happy to share these separately and would welcome any feedback you might have on where and how to present this data in a Fuzzbench report.

Thanks!

mboehme commented 2 years ago

The key idea is essentially: Instead of saying fuzzer A is the top fuzzer in general, we could say that Fuzzer A is the top fuzzer under these circumstances while fuzzer B is the top fuzzer under those other circumstances. For any given benchmark run, a user could essentially use a slider on those benchmark properties to see how the fuzzer ranking changes.

jonathanmetzman commented 2 years ago

Sorry for the delay, I've had a bit of a crazy schedule with my holidays. I personally think the second might be more interesting and seems less of a maintenance burden (the analysis just gets done at the end right?) But I'm interested in seeing both.

jonathanmetzman commented 2 years ago

The UI/UX question is tricky, I don't have any answers let me think about it more. I'm happy to see your samples as well.

DonggeLiu commented 2 years ago
  1. [Properties] Which properties would you consider to be interesting? We currently have

    • seed-corpus: initial coverage, number of seeds, average seed exec time, average seed size
    • program: size (and others). Anything else that you would like to look at?

Would it make sense to compare different performances after:

  1. tuning the hyper-parameters assumed by the fuzzers (e.g., maximum input length), or
  2. changing the default heuristic used by the fuzzers (e.g., libFuzzer can try to generate small inputs first)?
DonggeLiu commented 2 years ago

Also, for fuzzers that can take an input keywords dictionary, maybe we could sample the items in the dictionary in the same way as sampling the initial corpus?

mboehme commented 2 years ago

Would it make sense to compare different performances after:

  1. tuning the hyper-parameters assumed by the fuzzers (e.g., maximum input length), or
  2. changing the default heuristic used by the fuzzers (e.g., libFuzzer can try to generate small inputs first)?

Also, for fuzzers that can take an input keywords dictionary, maybe we could sample the items in the dictionary in the same way as sampling the initial corpus?

Absolutely! However, this might be more difficult to implement. You'll need to expose some API that the fuzzer developer can use to specify what to vary during the benchmarking.

dylanjwolff commented 2 years ago

I personally think the second might be more interesting and seems less of a maintenance burden (the analysis just gets done at the end right?) But I'm interested in seeing both.

Yup, the analysis portion is just some post-processing that can be run on something similar to the final report data CSV file. But, without corpus sampling, you could only look at the effects of program properties as the corpus would be constant across trials.

Adding on to @mboehme's and @Alan32Liu's comments about fuzzing parameters: it a very interesting idea, but I agree the implementation (and maintenance) effort needed is probably quite high to get many different fuzzers to present a similar interface for various parameters. Dictionaries would be more doable, as that is at least already a consistent "interface" across fuzzers.

DonggeLiu commented 2 years ago

Please feel free to let me know if there is anything that I could help with : )