google / fuzzbench

FuzzBench - Fuzzer benchmarking as a service.
https://google.github.io/fuzzbench/
Apache License 2.0
1.11k stars 269 forks source link

Variance in FuzzBench trials with same random seed #667

Open jonathanmetzman opened 4 years ago

jonathanmetzman commented 4 years ago

@vanhauser-thc wrote here:

the experiment has started => https://www.fuzzbench.com/reports/2020-08-12/index.html and we can see the aflplusplus_same1 .. aflplusplus_same3 instances to have quite a different coverage.

for several benchmarks this will even out as their overage saturate after 6 hours, so capturing this now is important.

most easy to see for curl, re2 and sqlite3:

fuzzer fuzzer1 fuzzer2 fuzzer3
curl 17148.0 17161.5 17155.0
re2 3461.0 3455.0 3488.0
sqlite3 26895.0 26611.5 26420.5

as all PRNG values are the same and input selection based on same values this shows that the instances are running with different speeds.

jonathanmetzman commented 4 years ago

I ran the three examples you mention on my local machine. Here is curl: diverge On the left we have re2 and on the right we have sqlite Screenshot from 2020-08-12 10-22-43

Though the times diverge, at the exact same time, the two different runs don't have the same number of paths.

I did everything I could think of to make runs comparable, and I used this command for each run AFL_SKIP_CPUFREQ=1 ./afl-fuzz -d -s 1234567890 -mnone -i corpus/ -o /tmp/o ./fuzz-target 2147483647 (fuzz-target was changed for different benchmarks).

I don't really have the time to investigate this so thoroughly right now but I plan on doing more when I have free time since it is interesting question. But it seems like at least some of the variance isn't really due to fuzzbench itself.

One way we plan to address this is by recording things like CPU/RAM usage and allowing fuzzers to report their own custom stats.

peteroupc commented 4 years ago

Can you explain how the "random seed" is causing variance and how this is a problem? Does it cause variance in the inputs to be fuzzed?

I should note that there are many sources of non-reproducibility, and they all come down to "features or behavior outside the application's control". All the more so if third-party software is being tested for crashes. For example the following things can get in the way of reproducibility:

jonathanmetzman commented 4 years ago

Can you explain how the "random seed" is causing variance and how this is a problem? Does it cause variance in the inputs to be fuzzed?

I don't think it is causing varying behavior. I think the problem is that setting the same random seed should ideally do the opposite. Ideally if I run afl-fuzz with the same inputs and the same random seed it should

I should note that there are many sources of non-reproducibility, and they all come down to "features or behavior outside the application's control". All the more so if third-party software is being tested for crashes. For example the following things can get in the way of reproducibility:

  • Floating-point numbers, due to differences in accuracy, rounding, and operation order. This includes parallel reductions.
  • Multithreading and dynamic task scheduling.
  • Nondeterministic sources (such as the file system or the system clock).
  • Undocumented, undefined, or implementation-dependent behavior or features.

Yes I suspect some of these are the sources for the variance we are seeing.

vanhauser-thc commented 4 years ago

To compare you should not look at the time but the number of executions. As there is always randomness in scheduling the processes that is what must be matched.

If there is a difference ... maybe there is still something I haven’t thought about.

But I would rather put my money on the not 100% stability. This will very likely lead to a different seed ranking and from there the diversity starts and grows.

jaylinski commented 4 years ago

Sorry to chime in here, but do you have to provide a value for the -s param, or will it fall back to a predefined fixed seed?

jonathanmetzman commented 4 years ago

To compare you should not look at the time but the number of executions. As there is always randomness in scheduling the processes that is what must be matched.

Good point!

But I would rather put my money on the not 100% stability. This will very likely lead to a different seed ranking and from there the diversity starts and grows.

IMO this is the most likely culprit. We can report stability when https://github.com/google/fuzzbench/pull/648 lands and try another experiment with fixed seeds and setting persistent executions to 1. I suspect this would boost determinism quite a bit (but maybe not enough since i assume if there is any non-determinism, seeds won't help).

jonathanmetzman commented 4 years ago

Sorry to chime in here, but do you have to provide a value for the -s param, or will it fall back to a predefined fixed seed?

Sorry is this a question for me? I did provide a value, the command I used was AFL_SKIP_CPUFREQ=1 ./afl-fuzz -d -s 1234567890 -mnone -i corpus/ -o /tmp/o ./fuzz-target 2147483647. I initially pasted AFL_SKIP_CPUFREQ=1 ./afl-fuzz -s -d 1234567890 -mnone -i corpus/ -o /tmp/o ./fuzz-target 2147483647 into my second comment. But that was not the command I used (and AFL++ would refuse to run given that) so I corrected my comment. Apologies!

jaylinski commented 4 years ago

Sorry is this a question for me? I did provide a value, the command I used was AFL_SKIP_CPUFREQ=1 ./afl-fuzz -d -s 1234567890 -mnone -i corpus/ -o /tmp/o ./fuzz-target 2147483647. I initially pasted AFL_SKIP_CPUFREQ=1 ./afl-fuzz -s -d 1234567890 -mnone -i corpus/ -o /tmp/o ./fuzz-target 2147483647 into my second comment. But that was not the command I used (and AFL++ would refuse to run given that) so I corrected my comment. Apologies!

Ah, I see. Question answered :+1:

jonathanmetzman commented 4 years ago

We continued some of this discussion in discord. To summarize what was learned there, @gamozolabs tested the hypothesis that it's non-determinism in fuzz targets that is causing AFL++ to behave non-deterministically. From some eyeball experiments, it seems like this is definitely not the case.

vanhauser-thc commented 4 years ago

using afl++ 2.66c or newer with llvm 9 or newer and afl-clang-fast/afl-clang-lto with the default instrumentation options produce deterministic results if the target has 100% stability. However: using -c cmplog makes it non-deterministic.

TLDR: aflplusplus_same* variants have to produce the same results if run for the same number of executions for those benchmarks that have 100% stability. These are (at least): libpng, lcms, vorbis, woff2 and not (at least): bloaty_fuzz_target,freetype2-2017,harfbuzz-1.3.2,libpcap_fuzz_both,openssl_x509,openthread-2019-12-23,re2-2014-12-09,sqlite3_ossfuzz

this is just 15 of the 21 benchmarks though

vanhauser-thc commented 4 years ago

Though the times diverge, at the exact same time,

all your three targets have below 100% stability. and this statement shows that it is very likely the cause: if it diverts at the exact same time then it means that a different seed was selected in this moment - the cause being the instability.

try with lcms, vorbis or woff2 instead, no cmplog

EDIT: which explains while these three you benchmarked also were the one in the fuzzbench run with the highest differences among the three same variants. :)

vanhauser-thc commented 4 years ago

btw is there any other fuzzer that can be configured to run determinstic? because afl + variants can't, honggfuzz cant.... maybe libfuzzer? not if they select seeds based on a rand() on a list that was generated by readdir() order or if the runtime is a factor for a seed. feels odd that this is not standard. how would you otherwise effectively test changes in a fuzzer?

jonathanmetzman commented 4 years ago

btw is there any other fuzzer that can be configured to run determinstic? because afl + variants can't, honggfuzz cant.... maybe libfuzzer? not if they select seeds based on a rand() on a list that was generated by readdir() order or if the runtime is a factor for a seed. feels odd that this is not standard. how would you otherwise effectively test changes in a fuzzer?

libfuzzer supports -seed=N. I think it is not a standard because the original AFL doesn't support it and everyone copied this.

all your three targets have below 100% stability. and this statement shows that it is very likely the cause: if it diverts at the exact same time then it means that a different seed was selected in this moment - the cause being the instability.

try with lcms, vorbis or woff2 instead, no cmplog

Right, that makes sense.

Let's look at a report of the first 15 minutes of the experiment for the fixed seed fuzzers on the stable benchmarks (so time doesn't smooth out differences in behavior by reaching a plateau). http://commondatastorage.googleapis.com/fuzzbench-reports/stable-report/index.html

None of these fuzzers shows a statistically significant difference from another on any particular benchmark, or in aggregate.

While I would be more comfortable if every benchmark looked like libpng where every trial gets within 2 edges of one another, it seems like random environment differences aren't affecting results.

Even if we don't ignore the unstable benchmarks, in the first 15 minutes: http://commondatastorage.googleapis.com/fuzzbench-reports/unstable-report/index.html the fuzzers in aggregate do pretty much the same. The "best" has a 99.91 avg normalized score, while the worst has a 99.62. This also true for the rest of the experiment https://www.fuzzbench.com/reports/2020-08-12/index.html best score is 89.06 worst is 88.58. Neither of these differences is critical according to the diagrams, and I personally wouldn't be changing fuzzers because on benchmarks one of them finds .48% more edges on average in 21 benchmarks.

Does this still seem problematic?

a1xndr commented 4 years ago

maybe libfuzzer? not if they select seeds based on a rand() on a list that was generated by readdir() order or if the runtime is a factor for a seed.

IIRC libfuzzer will sort the input corpus by size by default (when -prefer_small=1), before executing it. I don't think there is a rand() problem, when using -seed. Anecdotally, libFuzzer with the same seed is generally reproducible for me (of course as long as state doesn't leak).

jonathanmetzman commented 4 years ago

using afl++ 2.66c or newer with llvm 9 or newer and afl-clang-fast/afl-clang-lto with the default instrumentation options produce deterministic results if the target has 100% stability. However: using -c cmplog makes it non-deterministic.

TLDR: aflplusplus_same* variants have to produce the same results if run for the same number of executions for those benchmarks that have 100% stability. These are (at least): libpng, lcms, vorbis, woff2 and not (at least): bloaty_fuzz_target,freetype2-2017,harfbuzz-1.3.2,libpcap_fuzz_both,openssl_x509,openthread-2019-12-23,re2-2014-12-09,sqlite3_ossfuzz

Yes, using the -E option as you recommended on discord allowed me to get results that looked deterministic wrt to a given seed. If the above is a problem we might be able to support experiments that stopped when the fuzzer decides to end it, so you could control for differences in execution speeds among trials. This wouldn't help with comparing different tools but it would allow for rock solid comparisons of different changes to the same tool.

vanhauser-thc commented 4 years ago

btw is there any other fuzzer that can be configured to run determinstic? because afl + variants can't, honggfuzz cant.... maybe libfuzzer? not if they select seeds based on a rand() on a list that was generated by readdir() order or if the runtime is a factor for a seed. feels odd that this is not standard. how would you otherwise effectively test changes in a fuzzer?

libfuzzer supports -seed=N. I think it is not a standard because the original AFL doesn't support it and everyone copied this.

yes I know it does. though this is not the only point of entropy. for example afl* and hongfuzz select seeds based on their runtime, where shorter is better. and the runtime is not deterministic. and honggfuzz seems to read the corpus directory with readdir without sorting (and also selecting based on runtime).

I dont know the source for libfuzzer (because I have never searched for it and the llvm-project source code is gigantic ;) so I dont know if these things are also accounted for or not.

@a1xndr thanks for the info! -prefer_small=1 also seems to be the default. do you know if libfuzzer takes the runtime of a seed into account for seed selection?

all your three targets have below 100% stability. and this statement shows that it is very likely the cause: if it diverts at the exact same time then it means that a different seed was selected in this moment - the cause being the instability. try with lcms, vorbis or woff2 instead, no cmplog

Right, that makes sense.

I will generate the rest of the targets over the day so we have a complete list of what is 100% stable and what is not. cant hurt to document that somewhere in the project :)

None of these fuzzers shows a statistically significant difference from another on any particular benchmark, or in aggregate.

Does this still seem problematic?

way too early to call. the experiment is restarted today so lets see when the run is completed. it could be that after 15 minutes or 4 hours it still within OK boundaries however it could be that after 12 or 24 hours this is not the case anymore. however now we know that we have to concentrate on those targets that are 100% stable and disregard the rest. lets see. that's what benchmark tests are for :)

even if the difference in the end is not significant, it will show us what the variance is - we will have 60 totally deterministic runs of 4+ targets each. that will show as the boundary of variance that the fuzzbench runs have. pretty sure @gamozolabs will be able to make a beautiful graph out of that!

vanhauser-thc commented 4 years ago

The complete list of 100% stability targets: jsoncpp_jsoncpp_fuzzer, lcms-2017-03-21, libpng-1.2.56, mbedtls_fuzz_dtlsclient, vorbis-2017-12-11, woff2-2016-05-06, zlib_zlib_uncompress_fuzzer

That is 1/3 which is enough IMHO for a proper analysis

jonathanmetzman commented 4 years ago

yes I know it does. though this is not the only point of entropy. for example afl* and hongfuzz select seeds based on their runtime, where shorter is better. and the runtime is not deterministic.

I think this should be the only source for entropy though @morehouse might be able to confirm.

and honggfuzz seems to read the corpus directory with readdir without sorting (and also selecting based on runtime).

I dont know the source for libfuzzer (because I have never searched for it and the llvm-project source code is gigantic ;) so I dont know if these things are also accounted for or not.

The only code used to build libFuzzer is located in llvm-project/compiler-rt/lib/fuzzer

I will generate the rest of the targets over the day so we have a complete list of what is 100% stable and what is not. cant hurt to document that somewhere in the project :)

We document details on each benchmark here. You can add stability info by editing this file

jonathanmetzman commented 4 years ago

The complete list of 100% stability targets: jsoncpp_jsoncpp_fuzzer, lcms-2017-03-21, libpng-1.2.56, mbedtls_fuzz_dtlsclient, vorbis-2017-12-11, woff2-2016-05-06, zlib_zlib_uncompress_fuzzer

That is 1/3 which is enough IMHO for a proper analysis

Funny, I guessed that zlib and jsoncpp were stable when doing my analysis last night since the performance of the three "sames" on these was exactly equal like with libpng.

vanhauser-thc commented 4 years ago

It is now running again with same1-3 https://www.fuzzbench.com/reports/2020-08-14/index.html Lets see on Monday how they fared. (I wont be on I guess, my grandmother has her 100th birthday that day)

vanhauser-thc commented 4 years ago

for the "real" analysis this would need to compare the execs_done value in the corpus/fuzzer_stats file for the six 100% stable benchmarks of all 20 runs of the three _same*

that would be 60 runs per benchmarks which will show how much CPU variance the fuzzbench setup introduces.

morehouse commented 4 years ago

@vanhauser-thc Thanks for running this experiment!

Just discussed this with Jonathan and Laszlo today, and I think the experiment strongly validates FuzzBench's statistical analysis.

The experiment was run on 23 benchmarks, with 3 trials each (same1-3), for a total of 69 experiments. On only two benchmarks (vorbis, proj4) does FuzzBench report a statistically significant difference at the 0.05 level. This is in line with what we would expect at the 0.05 level: 69 * 0.05 = 3.45 expected incorrect conclusions.

@lszekeres expressed some concern that the statistical analysis was potentially "too aggressive", since it concluded significant differences for median coverage values that were only 1% and and 0.1% apart for proj4 and vorbis, respectively. After thinking about this more, I don't think this is the case. Since we provided the same seed for each trial, variance was likely very low, which would cause the statistical test to make stronger statements about small differences in medians. So we would still expect 5% of the statistical statements to come to the wrong conclusion. If we wanted fewer wrong conclusions, we would need to look at the 0.01 level or below.

vanhauser-thc commented 4 years ago

for the "real" analysis this would need to compare the execs_done value in the corpus/fuzzer_stats file for the six 100% stable benchmarks of all 20 runs of the three _same*

that would be 60 runs per benchmarks which will show how much CPU variance the fuzzbench setup introduces.

@morehouse --- ah no. what I quoted is what I wrote previously. the 3x20 execs_done of the 6 100% stable benchmarks have to be compared. Can you please supply that data? Or point to the archive where this can be extracted? Just looking at the stats in the report is not enough. that should be obvious.

vanhauser-thc commented 4 years ago

Rereading morhouse's comment it seems to be it is unclear why the report does not answer the question about variance, at least not with a lot of effort.

what is the impact of the varied CPU resources because of system activity, docker etc.? it is the number of executions performed in the benchmark duration.

So how to measure that? exactly by this - checking the variance of number of executions performed for the benchmarks and fuzzers which are 100% deterministic, and the only fuzzers and benchmarks for this is true is aflplusplus_same 1-3 and jsoncpp_jsoncpp_fuzzer, lcms-2017-03-21, libpng-1.2.56, mbedtls_fuzz_dtlsclient, vorbis-2017-12-11, woff2-2016-05-06, zlib_zlib_uncompress_fuzzer.

You cannot use coverage - unless you perform extensive graph analysis. as the new edges found correlate on executions performed not time, you can only see use coverage for this analysis for a benchmark that still has findings at the end of the benchmark, and compare for every of the 20 runs what the difference in the end of the graphs are and try to assume what the timing difference is. That is a lot of work. and only half of the benchmarks actually have meaningful discoverage still at the end.

so - total executions performed for the named benchmarks and fuzzers is the only way to answer this question. all other analysis to answer the question of variance is ..... totally pointless and wrong :) sorry. I am open for arguments of course :)

vanhauser-thc commented 4 years ago

OK I am done with my analysis and you will not like the results.

I took the data from the google storage bucket with gsutil from gs://fuzzbench-data/2020-08-14/experiment-folders/lcms-2017-03-21-aflplusplus_same*/trial-*/corpus/corpus-archive-0093.tar.gz for the benchmarks that are 100% stable and hence have the exact same random number sequences, seed selection etc. (these are: jsoncpp_jsoncpp_fuzzer,lcms_2017_03_21,libpng_1.2.56,mbedtls_fuzz_dtlsclient,vorbis_2017_12_11,woff2_2016_05_06,zlib_zlib_uncompress_fuzzer)

First Issue - the corpus snapshots are not at the same time. even for *-0093.tar.gz it ranges between 22:59:01 and 22:59:59. That is minor however a very few fuzz runs have their last corpus snapshot at 22:24:20, e.g. gs://fuzzbench-data/2020-08-14/experiment-folders/mbedtls_fuzz_dtlsclient-aflplusplus_same2/trial-305914/corpus/corpus-archive-0091.tar.gz which can easily be seen as it is a -0091 snapshot, but it is the last one in that directory.

Then from every -0093 snapshot I extracted corpus/fuzzer_stats and from there extracted the execs_per_sec. Note that in afl++ this value is the total execs performed divided by the total run time. (in vanilla afl it is over the last 60 seconds).

well and here it gets ugly. The MIN, MEDIAN and MAX for each of the 60 (59 for json and mbedtls because the snapshot for one of each was below 0093) are: min_med_max

And plotted each benchmark as a line with sorted execs_per_second numbers: execs_per_sec

These. are. huge. differences!

vanhauser-thc commented 4 years ago

attached is the raw data that I extracted

fb.txt

jonathanmetzman commented 4 years ago

OK I am done with my analysis and you will not like the results.

I took the data from the google storage bucket with gsutil from gs://fuzzbench-data/2020-08-14/experiment-folders/lcms-2017-03-21-aflplusplus_same*/trial-*/corpus/corpus-archive-0093.tar.gz for the benchmarks that are 100% stable and hence have the exact same random number sequences, seed selection etc. (these are: jsoncpp_jsoncpp_fuzzer,lcms_2017_03_21,libpng_1.2.56,mbedtls_fuzz_dtlsclient,vorbis_2017_12_11,woff2_2016_05_06,zlib_zlib_uncompress_fuzzer)

First Issue - the corpus snapshots are not at the same time. even for *-0093.tar.gz it ranges between 22:59:01 and 22:59:59. That is minor however a very few fuzz runs have their last corpus snapshot at 22:24:20, e.g. gs://fuzzbench-data/2020-08-14/experiment-folders/mbedtls_fuzz_dtlsclient-aflplusplus_same2/trial-305914/corpus/corpus-archive-0091.tar.gz which can easily be seen as it is a -0091 snapshot, but it is the last one in that directory.

Then from every -0093 snapshot I extracted corpus/fuzzer_stats and from there extracted the execs_per_sec. Note that in afl++ this value is the total execs performed divided by the total run time. (in vanilla afl it is over the last 60 seconds).

We don't use the 93rd.

well and here it gets ugly. The MIN, MEDIAN and MAX for each of the 60 (59 for json and mbedtls because the snapshot for one of each was below 0093) are: min_med_max

And plotted each benchmark as a line with sorted execs_per_second numbers: execs_per_sec

These. are. huge. differences!

We do 20 trials per fuzzer-bencmark pair. So the difference between the min and the max isn't that important right? The interesting question is whether aflplusplus_same1 has a different distribution of execs than same2 or same3. In most cases, the median/avg execs per pair are very similar, within 5% of one another. There are some cases like bloaty where the difference is closer to 20% and I will look into those. As I explained privately when I shared how to get this data, I don't have time to look into/analyze this right now (I am moving out today and driving across the country at the end of the week). I'll investigate more and share my results when I get the chance.

vanhauser-thc commented 4 years ago

@jonathanmetzman sadly no, the median for same1, same2 and same3 are (I took the 11th entry after sorting of the 20 entries):

same1: 22490.07 same2: 17949.59 same3: 18934.58

or in other words, all executions performed in same1 are 1861353704, wheras in same3 that is 1485660644 --- that is a 20% difference. This data is from zlib, as it has the widest variance in the graph.

jonathanmetzman commented 4 years ago

Yes I'm aware that the difference is ~20% in the worst cases. As I said:

There are some cases like bloaty where the difference is closer to 20% and I will look into those.

Those are interesting and i will look into them. They are bad since one of the usecases I like most for fuzzbench is comparing two fuzers on a single benchmark and improving one. But it still seems to me that the ranking based on the results of 21 benchmarks (where on average the execs were within 5% of eachother) is sound.

vanhauser-thc commented 4 years ago

@all sorry for already writing this when there are little resources at fuzzbench to talk about this now ... overlooked the message :(

I think the impact on the overall ranking is there but small. it is affecting those which still make progress in the last 2 hours of the time, but not the others.

maybe those for those which this is the case the number of runs could be increased to 40, and for the others reduced to 10.

jonathanmetzman commented 4 years ago

I think the impact on the overall ranking is there but small. it is affecting those which still make progress in the last 2 hours of the time, but not the others.

I noticed this pattern too. bloaty (which is pretty close to stable though not 100%) and lcms are examples of this (though I think zlib is the opposite). I suspect (and there's no real evidence for this since I haven't looked) that finding more coverage is affecting the execution speed so if we have two identical trials where when is getting 5% more clock time (ignoring other things like I/O speed), the trials won't necessarily end up with 5% more executions since it can find testcases that are faster to execute and mutate those instead of slower ones. Intuitively I would think the later testcases tend to be slower (since they are deeper in the program probably) which doesn't explain why the benchmarks where coverage grows have wider differences in number of executions.

My analysis (I'll post a more complete version later) only used the 92nd corpus. The 92nd one is the last one we actually use: 92 cycle 15 minute/cycle 60 seconds/minute / 60 seconds/minute / 60 minute/hour = 23 hours. We don't do a 92nd archive if the corpus hasn't changed since 91, but we always do a 93rd so it might be better to use the 93rd.