Support specifying which fuzzers to include in the report

lszekeres commented 4 years ago

One often wants to see a report that compares only two or just a few fuzzers. E.g., compare only

afl with libfuzzer, or
two different versions of the same fuzzer, or
only fuzzers with dynamic binary instrumentation, etc.

This can be supported by adding a --fuzzers flag to the report generator, where users can list the fuzzers they would like the report to compare. We should allow specifying fuzzers with different versions, by either tagging them with their version number or perhaps with an experiment name. E.g., generate_report --fuzzers afl:v2.5 afl:v2.4, or generate_report --fuzzers afl:experiment-2020-01-15 afl:experiment-2020-02-15.

This is also useful to have because specifying a smaller/different subset of fuzzers can affect the result of the top level statistical analysis (Friedman test, critical difference plot). Further, when we only compare two fuzzer, we can even do more precise statistical tests with more specific visualizations. On the benchmark level, we don't need a pairwise comparison matrix of statistical significance, since we only have a single pair. On the experiment level, we don't need to use Friedman/Nemenyi test (which compares more than two fuzzers, and is rather conservative), but we can use the Wilcoxon signed-rank test, which is specifically designed for comparing two things (i.e., matched samples for the different benchmarks).

jonathanmetzman commented 4 years ago

This issue is fixed.

lszekeres commented 4 years ago

This issue is fixed.

Not everything is fixed yet. There's also:

1) Support version/experiment tagging, e.g.,

--fuzzers afl:v2.5 afl:v2.4.

and

2) Different report when we're comparing only two things:

when we only compare two fuzzer, we can even do more precise statistical tests with more specific visualizations. On the benchmark level, we don't need a pairwise comparison matrix of statistical significance, since we only have a single pair. On the experiment level, we don't need to use Friedman/Nemenyi test (which compares more than two fuzzers, and is rather conservative), but we can use the Wilcoxon signed-rank test, which is specifically designed for comparing two things (i.e., matched samples for the different benchmarks).

Should we reopen this issue, or would you prefer creating separate issues for these?

jonathanmetzman commented 4 years ago

My bad, reopening is fine.

google / fuzzbench

Support specifying which fuzzers to include in the report #79