Number of benchmarks in critical difference plot

jiradeto commented 2 years ago

I tried to understand how plotting critical difference in fuzzbench works and I see that you use number of benchmarks when generating a plot. From what I understood, if N is meant to be the number of datasets then I think this should instead be the number of trials in our case?

For example given two experiments (trials=10 and trials=100) having the same ranking, should the experiment with 100 trials have smaller critical difference?

Trials = 10

Trials = 100 100

wuestholz commented 2 years ago

@jonathanmetzman Let me rephrase Jiradet's questions slightly. It seems like the number of trials does not affect the critical difference diagrams while the number of benchmarks does. Is this intentional? Intuitively, it seems like more trials should reduce the critical difference since it increases the confidence in the assigned ranks.

jonathanmetzman commented 2 years ago

I found this surprising too, but increasing the trials shouldn't affect the critical difference diagram. The diagram shows the avg rank of the median trial for each benchmark. So while adding trials to a benchmark will not change the result, increasing the number of benchmarks will. @lszekeres may be able to explain better.

wuestholz commented 2 years ago

@jonathanmetzman @lszekeres Thanks! Yeah, it would be great to understand this in more detail.

In the current setup, the confidence with respect to the avg. rank isn't taken into account for the critical difference diagram. However, it seems like this would be desirable.

Would the following change fix this? Instead of considering the avg. rank per benchmark, consider each benchmark and trial to be a separate "benchmark" with a single ranking. More concretely, if you have 10 benchmarks and 10 trials consider 10*10 "benchmarks" instead of 10 benchmarks for computing the critical difference diagram.

What do you think? Does this make sense?

lszekeres commented 2 years ago

Thanks for looking into this! Would this change be consistent with the definition of "critical difference" in https://www.jmlr.org/papers/volume7/demsar06a/demsar06a.pdf ?

wuestholz commented 2 years ago

@lszekeres Thanks for taking a look! I'm not an expert in statistics, but to me it seems consistent with the definition in the paper you're referencing. This part is from their intro:

Screenshot 2022-10-05 093726

I think there's some room for interpretation when considering the last two sentences. One might ask the question if the sub-benchmarks we propose (in https://github.com/google/fuzzbench/pull/1509) are "reliable" given that the performance of the fuzzer is only evaluated for a single campaign and there is nondeterminism (due to the source of randomness).

I would argue that long-running fuzzing campaigns will collect a lot of data about the fuzzer's performance given that the performance will be based on millions of fuzzed inputs. Obviously, more trials per benchmark means that the average performance will become more reliable since there is even more data. However, when I look at the box plots for different benchmarks in the main experiment results there is usually not a lot of variance in the performance of a fuzzer across the 20 campaigns. There should be no variance if we would control the source of randomness.

Any thoughts?

lszekeres commented 2 years ago

Thanks!

My translation from the paper's terminology to fuzzbench's terminology is the following:

algorithm, classifier => fuzzer (e.g., afl)
data set => benchmark (e.g., libxml)
performance score => median final coverage
experiments, training/testing samples => trials

In my mind, the following quotes from the paper specifically say that the number of "trials" does not directly affect the critical difference (only the precision of the scores):

"We will not record the variance of these scores, but will only assume that the measured results are “reliable”; to that end, we require that enough experiments were done on each data set"

"In our task, multiple resampling from each data set is used only to assess the performance score and not its variance. The sources of the variance are the differences in performance over (independent) data sets and not on (usually dependent) samples, so the elevated Type 1 error is not an issue. Since multiple resampling does not bias the score estimation, various types of crossvalidation or leave-one-out procedures can be used without any risk."

The authors stress this in the last quote:

"We should also stress that the “sample size” in the following section will refer to the number of data sets used, not to the number of training/testing samples drawn from each individual set or to the number of instances in each set. The sample size can therefore be as small as five and is usually well below 30."

I'm not a statistician either and I might be misunderstanding something. What do you think?

wuestholz commented 2 years ago

@lszekeres Thanks for the detailed response! I think my last comment could have been more clear. :)

In case this isn't clear, the proposed alternative would change the "translation" like this:

data set => fuzzing campaign on benchmark (e.g., libxml)
performance score => final coverage of fuzzing campaign

I assume we agree that a single fuzzing campaign collects a lot of data about the fuzzer's performance given that there a millions of inputs that are sampled during the fuzzing process. I think the main question is whether the "reliability" assumption holds even for a single campaign.

In general, there seem to be three cases to consider: 1) the fuzzer's performance is deterministic and there is no variance: in this case, there is no harm in using just a single fuzzing campaign. 2) the fuzzer's performance has some variance, but it is low: based on the box plots and the coverage plots from the main experiment this seems to be the most common case. Again, this should satisfy the relatively fuzzy assumption of "reliability" in the paper even for a single fuzzing campaign. Note that any variance will still be captured in the results since a fuzzer that performs very badly in one campaign will now have a very low ranking in the corresponding sub-benchmark. 3) the fuzzer's performance has a lot of variance: based on the box plots and coverage plots this does not seem to be very common. In such a situation even the average/median is not very reliable. I would argue that the average actually hides very poor performance in this case. In contrast, using the unaggregated data for a single campaign will preserve the variance and this will be reflected in the diagram.

What do you think?

lszekeres commented 2 years ago

the proposed alternative would change the "translation" like this:

data set => fuzzing campaign on benchmark (e.g., libxml)

Right. I don't think we can do that. Doing this means that different data sets (campaigns) will be on the same subject (benchmark), which means that our data sets won't be independent. The data sets being independent is an assumption of this statistical test. My understanding is that if we did what is being proposed, and used dependent data sets, we'd be using the statistical test incorrectly and would get incorrect results. I.e., the "elevated Type 1 error" mentioned in the quote above.

I think what you're proposing is specifically addressed in this section of the paper:

3.2.3 CONSIDERING MULTIPLE REPETITIONS OF EXPERIMENTS

There are variations of the ANOVA and the Friedman test which can consider multiple observations per cell provided that the observations are independent (Zar, 1998). This is not the case here, since training data in multiple random samples overlaps. We are not aware of any statistical test that could take this into account.

Let me know if this makes sense.

wuestholz commented 2 years ago

@lszekeres Thanks for pointing out the issue with independence! I see how two different fuzzing campaigns for the same benchmark would not satisfy the assumption of independence for the statistical test.

I'd be curious what you think about the following hypothetical variant: instead of starting all fuzzing campaigns from the same seed inputs (as is happening right now), we start each campaign from a different (randomly chosen) seed corpus. There is evidence (e.g., https://dl.acm.org/doi/pdf/10.1145/3460319.3464795 and https://dl.acm.org/doi/pdf/10.1145/3243734.3243804) that the seed corpus has a significant effect on fuzzing campaigns. I would therefore argue that this increases or perhaps "restores" independence. What do you think?

lszekeres commented 2 years ago

@wuestholz Thanks, I really appreciate that you're thinking about how to make this better!

we start each campaign from a different (randomly chosen) seed corpus

That's an interesting idea! My understanding however is that they'd still be dependent. Further, if we used a different seed corpus for each trial, we'd only have a single sample for each "data set", which means that our "scores" wouldn't be precise and reliable. If we want to increase the power of the cross-benchmark statistical test (always a good goal!), there's a straightforward way of doing that:

by increasing the number of benchmarks.

Even if we used the same amount of resources, I think we should consider halving the number of trials (from 20 to 10) and doubling the number of benchmarks (from ~20 to ~40). I think the medians going into the cross-benchmark test would still be reliable, and the per-benchmark tests (Mann-Whitney) might still be good enough. If we care about the cross-benchmark results the most -- which I believe is the case as that's what we mean when we say fuzzer A is "better" than fuzzer B -- then I think this reallocation of resources would actually make a lot of sense.

What do you all think?

wuestholz commented 2 years ago

@lszekeres Thanks for the suggestion! Yeah, increasing the number of benchmarks would definitely help. Is that easy to do? When I see the low variance for many benchmarks and fuzzers, I'm not even sure 10 trials per benchmark are strictly necessary (unless you care about per-benchmark tests).

I find it hard to assess if the assumptions of this statistical test are met since terms such as "dependent" or "reliable" do not seem to have a precise definition or way of quantifying it.

For instance, I agree that two campaigns with different seeds will be dependent to a certain degree (certainly more dependent than two campaigns for two different programs, but definitely less dependent than two campaigns with the same seed). How do I quantify the degree of dependence and what degree is acceptable for the statistical test?

I struggle similarly for "reliable". For many benchmarks and fuzzers the variance seems to be very low. This is not surprising since the fuzzer runs for 24 hours and each algorithmic component of the fuzzer (e.g., mutators and power schedules) is sampled millions of times (in contrast, for instance, to an ML classifier). What degree of variance is acceptable for the statistical test?

google / fuzzbench

Number of benchmarks in critical difference plot #1417