Open lszekeres opened 4 years ago
This issue is fixed.
This issue is fixed.
Not everything is fixed yet. There's also:
1) Support version/experiment tagging, e.g.,
--fuzzers afl:v2.5 afl:v2.4
.
and
2) Different report when we're comparing only two things:
when we only compare two fuzzer, we can even do more precise statistical tests with more specific visualizations. On the benchmark level, we don't need a pairwise comparison matrix of statistical significance, since we only have a single pair. On the experiment level, we don't need to use Friedman/Nemenyi test (which compares more than two fuzzers, and is rather conservative), but we can use the Wilcoxon signed-rank test, which is specifically designed for comparing two things (i.e., matched samples for the different benchmarks).
Should we reopen this issue, or would you prefer creating separate issues for these?
My bad, reopening is fine.
One often wants to see a report that compares only two or just a few fuzzers. E.g., compare only
This can be supported by adding a --fuzzers flag to the report generator, where users can list the fuzzers they would like the report to compare. We should allow specifying fuzzers with different versions, by either tagging them with their version number or perhaps with an experiment name. E.g.,
generate_report --fuzzers afl:v2.5 afl:v2.4
, orgenerate_report --fuzzers afl:experiment-2020-01-15 afl:experiment-2020-02-15
.This is also useful to have because specifying a smaller/different subset of fuzzers can affect the result of the top level statistical analysis (Friedman test, critical difference plot). Further, when we only compare two fuzzer, we can even do more precise statistical tests with more specific visualizations. On the benchmark level, we don't need a pairwise comparison matrix of statistical significance, since we only have a single pair. On the experiment level, we don't need to use Friedman/Nemenyi test (which compares more than two fuzzers, and is rather conservative), but we can use the Wilcoxon signed-rank test, which is specifically designed for comparing two things (i.e., matched samples for the different benchmarks).