google / fuzzbench

FuzzBench - Fuzzer benchmarking as a service.
https://google.github.io/fuzzbench/
Apache License 2.0
1.11k stars 269 forks source link

Measure time to achieve a given coverage (not vice versa) #327

Open mboehme opened 4 years ago

mboehme commented 4 years ago

For each new branch to be covered (or anything new to be [dis]covered, really), it takes exponentially more time (write up with experiments to be published shortly). This is why a comparison of the coverage achieved by each fuzzer within a fixed time budget gives very small deltas. This relationship might result in potentially spurious differences when ranking fuzzers in terms of their efficiency.

Instead, I would suggest to measure the time each fuzzer takes to achieve a given coverage. The given coverage could be determined manually or using the coverage achieved in 24h by that fuzzer which achieved the smallest coverage in 24h. This should give larger differences across fuzzers and might facilitate a better assessment of relative fuzzer efficiency.

jonathanmetzman commented 4 years ago

This seems like it's worth trying but I'm personally not so sure about using time-to-coverage/bug as a way of comparing fuzzers. Definitely interested in the writeup.

This is why a comparison of the coverage achieved by each fuzzer within a fixed time budget gives very small deltas.

Could this also be a factor of the benchmarks we chose? I think we can get bigger deltas by picking more benchmarks like sqlite/freetype/libpcap and less like openssl. I wonder if we selected for larger deltas if that would mean more accurate results.

The given coverage could be determined manually or using the coverage achieved in 24h by that fuzzer which achieved the smallest coverage in 24h

Is this so that we only compare times on edges found by every fuzzer?

If I understand this correctly, I think it would produce odd results in some (extreme) cases. Take the libpcap benchmark, entropic's median coverage is 1771.5 while aflsmart's is 18. I don't think it matters which fuzzer finds those 18 edges faster, IMO I'm more likely to find bugs in this target with entropic than with aflsmart (using the setup of each fuzzer in fuzzbench).

But I could see how this results in better comparisons on benchmarks like re2 where most fuzzers cover within 100 edges of each other.

Maybe I'm misunderstanding something?

I might be making some changes to the measurer that will save the edges covered at each snapshot in GCS (currently we only store counts). If I do that, time-to-cover-edge should be possible to compute for each new experiment. It's probably worth starting with this and then seeing if FuzzBench should support time-to-cover-edge.

mboehme commented 4 years ago

Here is why I think it is useful: Suppose, in 24 hours Honggfuzz covers 4563 edges, AFL++ covers 4550 edges, and AFL also covers 4550 edges. In other words, Honggfuzz covers 0.3%. more edges than AFL and AFL++ while AFL and AFL++ seem perfectly on-par. However, Honggfuzz covers the same number of edges that AFL covers in 24hours in only 6 hours, i.e., in 25% of the time (or 4 times faster) while AFL++ covers the same number of edges that AFL covers in 24hours in only 18 hours, i.e., in 75% of the time.

This is why a comparison of the coverage achieved by each fuzzer within a fixed time budget gives very small deltas.

Could this also be a factor of the benchmarks we chose? I think we can get bigger deltas by picking more benchmarks like sqlite/freetype/libpcap and less like openssl. I wonder if we selected for larger deltas if that would mean more accurate results.

I don't think the problem is with our benchmarks but with our measures. You probably want to evaluate your fuzzers on all benchmarks.

Take the libpcap benchmark, entropic's median coverage is 1771.5 while aflsmart's is 18.

I see your point. Can this be corrected while taking into account the intuition I gave in the example above?

I might be making some changes to the measurer that will save the edges covered at each snapshot in GCS (currently we only store counts). If I do that, time-to-cover-edge should be possible to compute for each new experiment. It's probably worth starting with this and then seeing if FuzzBench should support time-to-cover-edge.

There is a lot happening in the beginning of the fuzzing campaign. Are you considering to change the constant interval (DEFAULT_SNAPSHOT_SECONDS=15min) to a n exponentially increasing interval? For instance, for 100 snapshots in 24 hours, you could take a snapshot every 2^(seq(0, log2(24*60*60), log2(24*60*60)/100) seconds (in R).

jonathanmetzman commented 4 years ago

Here is why I think it is useful: Suppose, in 24 hours Honggfuzz covers 4563 edges, AFL++ covers 4550 edges, and AFL also covers 4550 edges. In other words, Honggfuzz covers 0.3%. more edges than AFL and AFL++ while AFL and AFL++ seem perfectly on-par. However, Honggfuzz covers the same number of edges that AFL covers in 24hours in only 6 hours, i.e., in 25% of the time (or 4 times faster) while AFL++ covers the same number of edges that AFL covers in 24hours in only 18 hours, i.e., in 75% of the time.

Makes sense, I think I agree that the time here is a better measure of ability.

I don't think the problem is with our benchmarks but with our measures. You probably want to evaluate your fuzzers on all benchmarks.

Fair, this was similar to my initial thinking. I have some more thoughts but they are offtopic so I'll put them in https://github.com/google/fuzzbench/issues/48#issuecomment-627754082

Take the libpcap benchmark, entropic's median coverage is 1771.5 while aflsmart's is 18.

I see your point. Can this be corrected while taking into account the intuition I gave in the example above?

Do you have ideas on how we would correct for this? The first that comes to my mind is only use time to coverage for fuzzers that have relatively similar numbers of edges?

There is a lot happening in the beginning of the fuzzing campaign. Are you considering to change the constant interval (DEFAULT_SNAPSHOT_SECONDS=15min) to a n exponentially increasing interval? For instance, for 100 snapshots in 24 hours, you could take a snapshot every 2^(seq(0, log2(246060), log2(246060)/100) seconds (in R).

It is something that has been suggested (by @lszekeres I believe) but I don't think anyone has plans to implement it right now. I think if we were to do this, we should include a mapping of snapshot archives to time in each experiment. Right now we don't so when we change the code processing corpora from old experiments becomes hard.

I'll also note (mainly for myself so I remember) that this might conflict a little with some ideas I have about using preemptible instances.

mboehme commented 4 years ago

Take the libpcap benchmark, entropic's median coverage is 1771.5 while aflsmart's is 18.

I see your point. Can this be corrected while taking into account the intuition I gave in the example above?

Do you have ideas on how we would correct for this? The first that comes to my mind is only use time to coverage for fuzzers that have relatively similar numbers of edges?

Yes, this was my first thought, as well.

In any case, it would be interesting to add this to FuzzBench. In my papers, I prefer to report those numbers over coverage improvement in a fixed time budget.

lszekeres commented 4 years ago

This can be added easily as an alternative report type under the analysis lib.

I.e., for each benchmark pick an edge coverage value, and then rank by the time it took to reach that coverage level for each fuzzer. Picking this value could be tricky though: if we use the highest coverage that all fuzzers have reached during the experiment (i.e. the lowest final coverage reached), then in practice we'll get something like this: image Where the majority of the fuzzers reach that coverage level in a matter of minutes. So we might want to pick a higher level according to some threshold, and assign time_to_reach=∞ to some of the worst performing fuzzers. Wdyt?

mboehme commented 4 years ago

Sounds good to me.

Another option to determine a coverage value: You could allow the user to choose a baseline fuzzer and rank other fuzzers against that baseline. In this example, a user might choose libfuzzer and see how entropic (which extends libfuzzer) ranks against its baseline (~1200 edges in 9hrs vs 24hours).

mboehme commented 4 years ago

Definitely interested in the writeup.

https://mboehme.github.io/paper/FSE20.EmpiricalLaw.pdf