Save the corpus and use later as seed

dank-cruise commented 11 months ago

Hi there!

A. Is it possible to save or dump the corpus that's been found so far? E.g. when I terminate the fuzzing run, it should save the corpus that's been discovered so far. Presumably the corpus path would be a command line flag. B. When I fuzz the same target again later, using the same Domains and all that, can I reuse a previously saved corpus?

Obviously, this is not a new idea. For example, Chromium fuzzing talks about it.

A on its own is useful, even if B isn't done. I think it would be very useful to take the corpus from A, and create a unit test for every corpus element, and add that to continuous Integration and pre-commit testing.

irowebbn commented 8 months ago

There appears to be a command line flag for this (I found it by running my test binary with the --helpfull flag).

--corpus_database (The directory containing all corpora for all fuzz tests
      in the project. For each test binary, there's a corresponding
      <binary_name> subdirectory in `corpus_database`, and the <binary_name>
      directory has the following structure: (1) For each fuzz test
      `SuiteName.TestName` in the binary, there's a sub-directory with the name
      of that test ('<binary_name>/SuiteName.TestName'). (2) For each fuzz test,
      there are three directories containing `regression`, `crashing`, and
      `coverage` directories. Files in the `regression` directory will always be
      used. Files in `crashing` directory will be used when
      --reproduce_findings_as_separate_tests flag is true. And finally, all
      files in `coverage` directory will be used when --replay_corpus flag is
      true.); default: "~/.cache/fuzztest";

Unfortunately, I have not been able to get it to work.

racko commented 7 months ago

There is an undocumented environment variable that helps us along one step: FUZZTEST_TESTSUITE_OUT_DIR

$ FUZZTEST_TESTSUITE_OUT_DIR=/some/path my_fuzztest --fuzz My.Test

will create /some/path and create lots of beautiful corpus files in it.

FUZZTEST_TESTSUITE_IN_DIR could be used in the same way to reuse the corpus later. (This is a separate mechanism from the --corpus_database stuff.)

However, the directory structure described in the --corpus_database flag documentation is not created. As a workaround, you can create the directory structure yourself, e.g. by running

$ FUZZTEST_TESTSUITE_OUT_DIR=~/.cache/fuzztest/<binary_name>/SuiteName.TestName/coverage <binary_name> --fuzz SuiteName.TestName

Later, to use the corpus, run

$ <binary_name> --fuzz SuiteName.TestName --corpus_database ~/.cache/fuzztest --replay_coverage_inputs

You cannot skip the --corpus_database ~/.cache/fuzztest argument: fuzztest does try to use ~/.cache/fuzztest as a default, but this doesn't actually work because ~ is not resolved by the C++ library code. But it is by your shell when you pass the argument on the command line.

As far as I can tell, we cannot make fuzztest write samples to the --corpus_database just by passing the argument. The path is exclusively used in https://github.com/google/fuzztest/blob/4c3852b5205760af71534e193b9868a3d13b2713/fuzztest/init_fuzztest.cc#L154-L162 to create a CorpusDatabase object: https://github.com/google/fuzztest/blob/4c3852b5205760af71534e193b9868a3d13b2713/fuzztest/internal/configuration.h#L14-L41 And as you can see, CorpusDatabase has no public API to get the database_path_ which would be necessary to write the new corpus files to it.

chandlerc commented 7 months ago

Some way of seeding with a corpus, and minimizing a corpus of seeds is really needed.

For example, these workflows are well supported with libFuzzer already: https://github.com/google/fuzzing/blob/master/tutorial/libFuzzerTutorial.md#seed-corpus https://github.com/google/fuzzing/blob/master/tutorial/libFuzzerTutorial.md#minimizing-a-corpus

I'm trying to migrate from libFuzzer to FuzzTest, and currently this is the biggest issue I'm facing.

davidben commented 4 months ago

Same. FuzzTest's model of putting all the fuzzers in one build target would be really attractive for BoringSSL (it would simplify keeping the same build across multiple build systems). But one of our workflows is that we record transcripts from our tests (a good sample of different TLS protocol flow and other hand-crafted interesting cases) and then minimize them as the starting corpus for the fuzzer, so it doesn't need to discover how the TLS protocol works from scratch.

google / fuzztest

Save the corpus and use later as seed #633