influxdata / flux

Flux is a lightweight scripting language for querying databases (like InfluxDB) and working with data. It's part of InfluxDB 1.7 and 2.0, but can be run independently of those.
https://influxdata.com
MIT License
770 stars 153 forks source link

Fuzz Flux Parser and Interpreter #235

Open nathanielc opened 5 years ago

nathanielc commented 5 years ago

@mark-rushakoff I know you did something around fuzzing Flux a while back, Is there anything public we can reference to incorporate?

mark-rushakoff commented 5 years ago

Yes, public repository at https://github.com/mark-rushakoff/flux-fuzz.

Anyone who picks this up should plan on doing a short call with me to walk through the idiosyncrasies and arts of fuzzing. It will be a lot easier as a conversation than as a writeup.

jsternberg commented 5 years ago

I talked with @mark-rushakoff and here's a general outline:

  1. Corpus files should probably be committed to a new repository.
  2. go-fuzz doesn't work for CI, but it might be useful to allocate a small amount of CPU and memory to fuzzing in Kubernetes as a separate project.
  3. I'm likely going to start by just testing the parser alone.
jsternberg commented 5 years ago

I have made an initial fuzzer that's easy to run using Docker at https://github.com/influxdata/flux-fuzz.

For the additional work, we should spec that out and talk about it at standup. I am running it locally to get some generated tests locally for now. Running it on a server wouldn't be too hard, but it requires a few things:

  1. A process to run the command and monitor it.
  2. Have it update the copy of flux when master changes.
  3. Have it occasionally stop fuzzing, commit the contents (including the crashes), commit it, and push to github.
  4. Look at the commits in fuzz-flux to know which ones crash and fix them.
jsternberg commented 5 years ago

After using this for a few times while developing the parser, here's a revised todo of what I would find useful and feasible.

While the continuous fuzzer above might be potentially useful, I don't think it would actually help very much. Most of the time when I am working with the fuzzer, I'm looking for crashes. Most of the crashes are pretty straightforward and are found within a few minutes. I've run the fuzzer for 10 minutes and 30 minutes and when the number of crashes is zero, it usually stays zero. In the abstract, running it all the time could be a benefit, but I'm not sure it's practical especially considering that the parser code isn't the most complex.

Instead, here's what I'm thinking.

  1. Add a CI workflow to run the fuzzer.
  2. The fuzzer will only run if the parser, scanner, or any of the dependent packages change. This is a fairly small list.
  3. The fuzzer will run for 5 or 10 minutes.

The process of fuzzing will run on Jenkins. Circle's timeout will likely stop us from using that and, since it isn't a main part of CI, I don't think that we need to worry about contributors seeing the results.

The actual fuzzer would essentially do the following.

  1. Run go get with the commit to update.
  2. Check the parser against the entire corpus. If any panic, we're done.
  3. Check the parser against any crashers if we've committed any (shouldn't happen). If they succeed, remove them from crashers.
  4. Run the fuzzer using flux-fuzz. Continue running for 10 minutes.
  5. Archive the corpus, suppressions, and crashers.
  6. If there are any new crashers, set the build status to fail (skip for master).
  7. If on master, copy over the artifacts for any of the builds that were merged.

I'm a bit unclear on the last few steps which is mostly ensuring that we continue to build a corpus. My preference isn't to run it against master to build the corpus since each of the PRs will help build it up. But, I'm also not really sure how to get the artifacts over. So maybe we just rerun the fuzzer on master and only commit those? That isn't my preference though.

github-actions[bot] commented 6 days ago

This issue has had no recent activity and will be closed soon.