More detailed output statistics.

WorksButNotTested commented 3 years ago

As an end user, it’s the number of defects found which is of interest and hence the number of paths discovered is the best proxy for this and duly show in the fuzzbench reports.

As a developer, there are two main ways to improve a fuzzer. Either increase its performance (iterations per second) or increase the yield (paths found per iteration, fractional) by introducing new techniques. As a developer, it would be really useful if these statistics could also be graphed in the fuzzbench output.

This could allow you to answer the following questions: Have my changes improved the speed of the fuzzer? Is the speed of the fuzzer consistent throughout the 23h run? Is it consistent between runs? Is the relative speed of my fuzzer better or worse for some benchmarks compared to another fuzzer? How much has my new fancy technique improved the yield? How bad has the associated performance overhead been?

Obviously these stats are meaningless on their own, having the fastest fuzzer means nothing if the yield is poor and viceversa. But I think these stats could provide very useful insights for developers.

Does this sound possible? What are your thoughts?

vanhauser-thc commented 3 years ago

I would like to have these stats as well, however I am not sure how to get them, as this is dependant on the fuzzer if it even generates that data and if so if it is made available and if so - how. for afl that is simple. for others ... not sure.

WorksButNotTested commented 3 years ago

Is there a common format for the fuzzers to share how many paths they have discovered over time? How are the current path statistics captured? Could this be extended?

vanhauser-thc commented 3 years ago

of course there is not (well, sadly). there is no standard, it is different devs, etc.

afl writes this information plot file so everything is there already.

this could be done for libfuzzer+entropic by analyzing the logs (for exec/s and edges found) as this is printed every minute to stdout and captured in the log file on fuzzbench.

for honggfuzz there is however nothing. it only prints a final stat on exit which shows the execs performed + edges found, but there is no way you could write a graph (to my knowledge, maybe it has hidden or not well documented options for that).

It would be great though if fuzzbench could gather that data for afl variants + libfuzzer variants as this is really tedious to do. maybe just writing it to a specific directory in the bucket. that would save a lot of time.

WorksButNotTested commented 3 years ago

How does it handle the number of paths found over time (which it does graph)? Is this a different means of collecting and a different format for each fuzzer too?

vanhauser-thc commented 3 years ago

this is done by analyzing the saved inputs (-o out data) in 15 min intervals. no log/print output of any fuzzer is assessed.

WorksButNotTested commented 3 years ago

this is done by analyzing the saved inputs (-o out data) in 15 min intervals. no log/print output of any fuzzer is assessed.

So actually, the results are how many paths the fuzzer thinks it has found (based on how many seeds there are)? So if a fuzzer has a lot of instability, or erroneously decides similar paths are in fact dissimilar, then the fuzzer could end up incorrectly with a higher score? So a defect in the fuzzer may end up with it scoring highly?

it seems like the answer to collecting these sorts of stats is actually fuzzer specific, and since I’m really only interested in afl++ variants right now, perhaps the solution should be afl++ specific and hence reside within that repo? Perhaps a modification to afl-plot to support aggregating different runs from the same fuzzer and compare them to runs of another fuzzer?

vanhauser-thc commented 3 years ago

no wrong. no paths. never. fuzzbench does statement coverage afaik, which is - as it comes down I think - the same as edge coverage.

WorksButNotTested commented 3 years ago

Nonetheless though, the fuzzer is expected to report the coverage itself (effectively marking its own homework) rather than it being independently verified (e.g. by a control)?

vanhauser-thc commented 3 years ago

but every fuzzer might instrument differently and therefore report a different coverage. or even be wrong about it :) how can you compare fuzzers if you do not do that independently?

jonathanmetzman commented 3 years ago

Sorry for the late reply

this is done by analyzing the saved inputs (-o out data) in 15 min intervals. no log/print output of any fuzzer is assessed.

+1

So actually, the results are how many paths the fuzzer thinks it has found (based on how many seeds there are)? So if a fuzzer has a lot of instability, or erroneously decides similar paths are in fact dissimilar, then the fuzzer could end up incorrectly with a higher score? So a defect in the fuzzer may end up with it scoring highly?

The way FuzzBench decides on number of edges or crashes found by a target is by running the targeted code on a fuzzer's output corpus. So yes, instability or a difference between what Clan'gs source based profiling thinks is a unique block and what the fuzzer thinks is unique can mean a fuzzer is penalized or rewarded.

If you care about the speed of a fuzz target or some other custom stat, you can use the custom stats feature. I haven't documented it yet, but if you are interested I can do so. I think I need to make a small change so the stats get output somewhere but that should be easy too. Just let me know. I think I can do this by mid-next week.

WorksButNotTested commented 3 years ago

Some docs for that would be awesome. It would be cool if fuzzbench could show the same type of graphs for execs as it does for paths. Not sure how easy that is?

But even if you can output a csv which collates all the data that would be awesome, then I can just Google a load of ms excel magic!

it would be better still if the stats has both the paths found and the execs.

Thanks.

Qwaz commented 3 years ago

Some docs for that would be awesome.

I think you can check the fuzz bench paper. It has a lot of details about the architecture, how it evaluates generalized performance of different fuzzers, and justification for the evaluation metric.

Some quotes from the paper that might be of your interest:

Note that beyond code and bug coverage, we allow tool integrators to export their own custom metrics (e.g., number of executions, RAM usage).

All reports make the raw data available so researchers can do their own custom analysis. For custom analyses, researchers can use FuzzBench’s analysis library for generating their own plots, tables, and statistical tests.

jonathanmetzman commented 3 years ago

Damn, I completely forgot about this. lemme try to do it next Friday.

WorksButNotTested commented 3 years ago

No worries. Easy to get distracted with other work. That would be great. Thanks.

jonathanmetzman commented 3 years ago

:-( this needs to be punted again because of some new deadlines I have. Sorry. I'll probably get to this next Monday.

WorksButNotTested commented 3 years ago

That's no worries. Just keep me posted how you get on!

google / fuzzbench

More detailed output statistics. #1253