Run fuzzing benchmarks - Halloween, Regular vs No seeds/No dicts, etc

vanhauser-thc commented 3 years ago

How about for Halloween there is a huge fuzzing benchmark week? Like - all benchmark targets running for 50 hours for all core fuzzers? So far it has all been 24 and 23 hours, and 20-21 targets. A huge and long assessment is long overdue IMHO and Halloween would be the perfect time :)

inferno-chromium commented 3 years ago

we do 23 hours experiments due to preemptibles.

But this is a great idea, a long experiment like that is overdue. Any particular reason you choose 50 :) or prefer higher ?

Ping us here when you want to start, we will try to remember as well [have a reminder for oct 30].

lszekeres commented 3 years ago

But this is a great idea, a long experiment like that is overdue. Any particular reason you choose 50 :) or prefer higher ?

+1 shooting for longer :)

vanhauser-thc commented 3 years ago

the 50 was just ... 24h x 2 = 48 ... so lets make it 50 :)

is there a limit on benchmarks? because there is a limit on fuzzers to run (14, at least via experiment-request.yaml).

inferno-chromium commented 3 years ago

No, we select all benchmarks available or atleast this list is kept to update - https://github.com/google/fuzzbench/blob/master/service/automatic_run_experiment.py#L49

Do you have benchmark ideas ? We can fix this list before this big experiment. We have 350+ projects in OSS-Fuzz, so can bring any of those here.

vanhauser-thc commented 3 years ago

I saw the libxslt is new and that php was gone. I think 25 targets for 50 hours of fuzzing sounds good :) Maybe a video and an audio target? and maybe a scripting language (afl++ is not so good there).

inferno-chromium commented 3 years ago

I saw the libxslt is new and that php was gone. I think 25 targets for 50 hours of fuzzing sounds good :) Maybe a video and an audio target? and maybe a scripting language (afl++ is not so good there).

php has some issues with clang coverage, ooms like crazy. didnt get time to investigate.

audio target is vorbis. yes probably a video and scripting lang.

jonathanmetzman commented 3 years ago

I saw the libxslt is new and that php was gone. I think 25 targets for 50 hours of fuzzing sounds good :) Maybe a video and an audio target? and maybe a scripting language (afl++ is not so good there).

php has some issues with clang coverage, ooms like crazy. didnt get time to investigate.

audio target is vorbis. yes probably a video and scripting lang.

Maybe dav1d (AV1 decoder) and tinyjs (Javascript implementation from QEMU's author) or Hermes (JS implementation from Facebook)?

wideglide commented 3 years ago

I have an alternative idea that might approximate 50 hours, or at least what happens after 23 hours.

What if there was an alternative experiment setting that used the culled queue from a previous experiment as the seeds? This way the fuzzers start with equal, saturated coverage. This experiment is closely related to running for 50 hours in that you start with the results of a previous 23 hour run (thus equivalently running from 23-46 hours), but the experiment is normalized by having the fuzzers start with the same amount of coverage saturation.

inferno-chromium commented 3 years ago

I have an alternative idea that might approximate 50 hours, or at least what happens after 23 hours.

What if there was an alternative experiment setting that used the culled queue from a previous experiment as the seeds? This way the fuzzers start with equal, saturated coverage. This experiment is closely related to running for 50 hours in that you start with the results of a previous 23 hour run (thus equivalently running from 23-46 hours), but the experiment is normalized by having the fuzzers start with the same amount of coverage saturation.

Since this is not regular, i am ok with doing this higher cost experiment few times. But your idea is great, and just fyi, we are running a similar experiment on saturated oss-fuzz corpus, will share on discord once done.

vanhauser-thc commented 3 years ago

I would like first the 50h long experiment. however the idea of using a large saturated corpus is also good IMHO. so how about that one next a few weeks later? we could even use the corpus of the 50h benchmark for this.

inferno-chromium commented 3 years ago

I would like first the 50h long experiment. however the idea of using a large saturated corpus is also good IMHO. so how about that one next a few weeks later? we could even use the corpus of the 50h benchmark for this.

yes definitely 50h experiment that time, saturated one is already done, sharing soon.

vanhauser-thc commented 3 years ago

I would further recommend that all afl variants get the AFL_SHUFFLE_QUEUE=1 env var set. this will make the results more volatile, but @mboehme and me can use the results better to analyze schedule effectiveness. we cant use 23h setups for this as some targets do not even complete a single cycle in this time.

vanhauser-thc commented 3 years ago

How is the planning coming along? I am excited :)

is AFL_SHUFFLE_QUEUE=1 a possibility?

inferno-chromium commented 3 years ago

How is the planning coming along? I am excited :)

We plan to discuss this today. Please provide the list of fuzzers that you are interested in. We plan to use the current list in core-fuzzers.yaml. for benchmarks, i think we will run all, except the non-interesting ones (where coverage is usually saturated in < 1 hr). @jonathanmetzman @lszekeres

is AFL_SHUFFLE_QUEUE=1 a possibility?

I think this can be made the default ? Can you propose PR to add this in afl.run_afl_fuzz and weizz's similar function ? @jonathanmetzman - thoughts as well.

vanhauser-thc commented 3 years ago

How is the planning coming along? I am excited :)

We plan to discuss this today. Please provide the list of fuzzers that you are interested in. We plan to use the current list in core-fuzzers.yaml. for benchmarks, i think we will run all, except the non-interesting ones (where coverage is usually saturated in < 1 hr). @jonathanmetzman @lszekeres

from the afl++ core fuzzers all except libaflfuzzer. (and it can be removed, will do that in the PR).

is AFL_SHUFFLE_QUEUE=1 a possibility?

I think this can be made the default ? Can you propose PR to add this in afl.run_afl_fuzz and weizz's similar function ? @jonathanmetzman - thoughts as well.

OK I am on it

inferno-chromium commented 3 years ago

Plan: 12 fuzzers in core-fuzzers.yaml 15 benchmarks (removes uninteresting ones - jsoncpp, openssl, re2, systemd, zlib - ones that saturate too quick - ~30 min-1hr). 10 trial and run for 5 days

inferno-chromium commented 3 years ago

Halloween experiment has started https://www.fuzzbench.com/reports/2020-10-31-hlwn-long/index.html

vanhauser-thc commented 3 years ago

When it is done It would also be interesting to have a crash analysis - which fuzzer found how many (real) unique crashes per target altogether.

inferno-chromium commented 3 years ago

When it is done It would also be interesting to have a crash analysis - which fuzzer found how many (real) unique crashes per target altogether.

crash based benchmarking is wip, right now we just store the list of crashes in archives, so that would be manual analysis. we can re-run similar experiment once crash-based benchmarking support is complete.

vanhauser-thc commented 3 years ago

do you know why entropic is missing from bloaty and woff?

inferno-chromium commented 3 years ago

it seems like an issue with this custom patch, reopened https://github.com/google/fuzzbench/issues/801

https://storage.googleapis.com/fuzzbench-data/2020-10-31-hlwn-long/experiment-folders/bloaty_fuzz_target-entropic/trial-600056/results/fuzzer-log.txt

==3550648== ERROR: libFuzzer: deadly signal
No such file or directory: d_mask=3; exiting
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/src/fuzzers/entropic/fuzzer.py", line 43, in fuzz
    '-entropic_scale_per_exec_time=1'
  File "/src/fuzzers/libfuzzer/fuzzer.py", line 86, in run_fuzzer
    subprocess.check_call(command)
  File "/usr/local/lib/python3.7/subprocess.py", line 363, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/out/fuzz_target', '-print_final_stats=1', '-close_fd_mask=3', '-fork=1', '-ignore_ooms=1', '-ignore_timeouts=1', '-ignore_crashes=1', '-detect_leaks=0', '-artifact_prefix=/out/corpus/crashes/', '-entropic=1', '-keep_seed=1', '-cross_over_uniform_dist=1', '-entropic_scale_per_exec_time=1', '/out/corpus/corpus', '/out/seeds']' returned non-zero exit status 1.

mboehme commented 3 years ago

Interesting. d_mask=3 is part of the CLI parameters: -close_fd_mask=3. I'll look into this.

vanhauser-thc commented 3 years ago

can someone create a current report? updating the web page stopped.

google / fuzzbench

Run fuzzing benchmarks - Halloween, Regular vs No seeds/No dicts, etc #814