google / fuzzbench

FuzzBench - Fuzzer benchmarking as a service.
https://google.github.io/fuzzbench/
Apache License 2.0
1.09k stars 266 forks source link

Support measuring/reporting for "unmanaged" trials. #722

Open jonathanmetzman opened 4 years ago

jonathanmetzman commented 4 years ago

FuzzBench assumes that it can build and run trials for everything it needs to measure. However, there are some fuzzers (@gamozolabs tko) that FuzzBench probably can't build and run. I think it would be very interesting for fuzzbench to be able to measure performance of these fuzzers on our benchmarks. E.g. someone who wants to benchmark these fuzzers can just give us the corpus archives and the measurer/report basically treats it like any fuzzer. At least for now, all we would need to do to support it is some kind of way of telling fuzzbench the fuzzer is unmanaged (so create db entities for it but don't try building or running) and then the "unmanaged" trials can simply put their corpus archives on Google Cloud Storage so they can be measured and compared to our other fuzzers. As Brandon points out, this would make cheating easy so some kind of asterisk and explanation is probably needed in the report. I think even without intentional cheating this comparison is problematic because the CPU family the unmanaged trials run on is probably different from the ones we use for the "managed" (regular) ones. But I still think there is value here. Especially when we have differences like libpcap_fuzz_both, I'm sure something like double or triple the CPU time isn't going to make fuzzers like AFL do better than the other ones.

gamozolabs commented 4 years ago

Oooh yeah, getting the time domains to mean anything here is going to be really hard. Might require coverage-per-case info for meaningful signal. Especially as there'd be no way to know if the remote is using threads (even if it's not intentional, but just a mistake when it was deployed!)

Other than using instruction counts, I don't know how to normalize performance properties between machines. Tbh, even on the same machine it's hard based on scheduling, maybe you get assigned to a hyperthread and you have a sad day.

This can slightly be controlled by using x86 rdtsc cycles, as the cycle counts will be a portable metric of time between processors of the same uarch, but running different numbers of cores or clock rates. Then again, turbo boost doesn't affect the TSC rate, thus a turboed machine would run "faster" than its TSC, and turbo rates vary by CPU models. When it comes down to running benchmark on multi-CPU systems (idk what fuzzbench actually runs on GCP), even things like NUMA locality and cache-coherency traffic could give some cores a 50% punishment to latency to accessing memory depending on the layout of certain structures in the kernel (eg. which NUMA node holds kernel code and filesystem cache data).

jonathanmetzman commented 4 years ago

Heh, yeah normalizing for a fair comparison sounds tough. As I mention though, there might be some cases where even if you aren't benchmarking with similar machine specs, the comparison is still valuable (even if not totally scientific). Like in this benchmark for example: https://www.fuzzbench.com/reports/2020-09-07/libpcap_fuzz_both_coverage_growth.svg

I'm pretty sure even given 10X the CPU AFLsmart will do worse than honggfuzz. If some other technique dominates on a particular benchmark, it might be useful to do a comparison even if it isn't exactly sound.