new_edge and new_features stat may be incorrect (e.g. -max_len is used or -use_value_profile)

Dor1s commented 5 years ago

1) If we fuzz with some -max_len value, the initial_edge_coverage stat will not represent the actual edge coverage of the full corpus, because libFuzzer will ignore the contents past the first max_len bytes in every input.

2) During the merge, we reasonably remove -max_len argument, in order to avoid truncating the corpus and losing the coverage we have.

3) The resulting edge_coverage is extracted from the merge log and might be greater than the initial_edge_coverage even if we didn't generate any inputs, just because -max_len caused the initial coverage to be incomplete.

4) The resulting new_edge metric is calculated as the difference between these two (excluding the corpus subset strategy, as in that case we were able to realize the very same problem and fixed it some time ago):

https://github.com/google/clusterfuzz/blob/649d43cbd729acfd8a6342ad078b23fd191b9869/src/python/bot/fuzzers/libFuzzer/stats.py#L291

I don't have a good solution in mind. We can skip new_edge metric for random max length strategy, but that would mean we lose a signal for the strategy.

My best idea so far is to make libFuzzer's merge to give us all the numbers (i.e. initial coverage + new coverage), but that'll complicate libFuzzer's interface, so I don't really like this either.

UPD: we don't use value profiling during merge, that means we might be missing some features.

UPD 2: One more thing that came up during discussion with @jonathanmetzman today. With this new merge thing, we might want to do gsutil cp and gsutil rm after every fuzzing run, to essentially "prune" the corpus. That way, the recurring corpus pruning task would have much less of a load and would be more like a clean up / stats collection task, rather than a real heavy corpus pruning job.

Action Items

[x] 1. Modify libFuzzer's merge process so that it could provide the accurate stats. Done in https://reviews.llvm.org/D66107.
[x] 2. Switch ClusterFuzz's launcher code to use the new merge process.
[x] 3. Monitor the new behavior, debug and fix any failures.
[ ] 4. Revisit old code for parsing merge stats (and other stats overlapping with these), consider removing (e.g. merge_edge_coverage column), sha1 check for corpus filenames.
[ ] 5. Rewrite the _is_multistep_merge_supported using a less hacky approach (check help message).
[ ] 6. Update Android test binaries: #1732.

Dor1s commented 5 years ago

@mbarbella-chromium @mukundv-chrome for the short term, should we exclude random max length strategy? What'll be the best option for your ongoing experiment(s)?

Another bad idea is to do an extra run with -runs=0 on the full corpus to get the real initial_edge_coverage. We can do it only for the runs where random max length is used, but that will be quite expensive and unwise anyway :(

Dor1s commented 5 years ago

@mbarbella-chromium @mukundv-chrome for the short term, should we exclude random max length strategy? What'll be the best option for your ongoing experiment(s)?

Discussed offline. The experimental data should be good for now, so I'll upload a CL to record new_edges = 0 when random max length strategy is used.

Dor1s commented 5 years ago

@inferno-chromium suggested a nice solution on libFuzzer side (to print stats after reading the first corpus dir), I've uploaded a CL as https://reviews.llvm.org/D66020, but we probably need to chose some other keyword. INITED is supposed to indicate that ALL inputs files are read, so in our case it's kinda PARTIALLY INITED :)

Dor1s commented 5 years ago

One more relevant problem: we don't use -use_value_profile=1 during merge, that means we discard some features that could've been discovered during fuzzing. That explains why new_features stat so often has non-zero values.

I'll fix it as well when making the other merge changes.

Dor1s commented 5 years ago

One more thing that came up during discussion with @jonathanmetzman today. With this new merge thing, we might want to do gsutil cp and gsutil rm after every fuzzing run, to essentially "prune" the corpus. That way, the recurring corpus pruning task would have much less of a load and would be more like a clean up / stats collection task, rather than a real heavy corpus pruning job.

Dor1s commented 5 years ago

FTR, the most recent Clang roll in Chromium should include my -merge=1 change, so that is not a blocker anymore, but there are still some other things that can delay switching to the new merge logic.

Dor1s commented 4 years ago

I've checked BigQuery stats for the past 15 days for both Chromium ClusterFuzz and OSS-Fuzz.

The change has definitely had an effect -- I can share stats with Googlers.

One thing that is noticeable and expected is that an average new_edges value has dropped.

Some other movements are also evident, though hard to comment on, as previously stats were incomplete as we skipped recording them for certain strategies.

Dor1s commented 4 years ago

As per Abhishek's suggestion I've checked whether this new implementation affected the number of post-fuzzing merge timeouts. I've checked the logs for the past 7+ days and it looks like the overall number of timeouts is not affected, which is expected, as the overall complexity of performing a merge hasn't changed much.

# Chromium ClusterFuzz, number of "Merging new testcases timed out" errors a day.
2019-12-10 : 1423
2019-12-11 : 1246
2019-12-12 : 1435 # two step merge was deployed at the end of this day
2019-12-13 : 1497
2019-12-14 : 1330
2019-12-15 : 1363
2019-12-16 : 1446
2019-12-17 : 1128 # not a full day

Dor1s commented 4 years ago

Re-evaluating some strategies / metrics after the change being deployed for ~2 weeks (access to the link is restricted for non-Googlers):

1) DFT on OSS-Fuzz: https://docs.google.com/spreadsheets/d/14hAEuRsfofSiInUTR7KA48SbGY4QPuxHOzObjVPInQk/edit#gid=1395928105

on average, DFT strategy is still losing to the runs where it's not being used

2) ML RNN on ClusterFuzz: https://docs.google.com/spreadsheets/d/13te_2ZcRQhbm25B_h2Xg8-TFV0TBn4TSSBmbTNLy2_E/edit#gid=0

ML RNN looks pretty good, it seems to be be useful, as 66.7% of fuzz targets perform better with ML RNN, while only 32% of targets perform better without it (all on average)

3) MAB probabilities: https://datastudio.google.com/c/u/0/reporting/1crV7jDmoFh25j2ZMATEMa3X0HfsKqGOQ/page/QAox

The graphs seem to be broken, possibly due to https://github.com/google/clusterfuzz/issues/1304

I've ran some queries myself in https://docs.google.com/spreadsheets/d/1dNgK9xg9zkdds6ckFalNWniVI2FtVZylkfcFKR52kX0/edit#gid=0

It might be a little hard to read, but an awesome observation here is how fork strategy is showing up among a half of the top strategies since Dec 16th. Until that, fork strategy wasn't ever in top 10, just because we did not have good stats for it. ML RNN probability has also jumped for ~10-20%, but that might be a temporary fluctuation.

Once the Data Studio charts are fixed, we should be able to observe even more changes.

nwellnhof commented 7 months ago

One more relevant problem: we don't use -use_value_profile=1 during merge, that means we discard some features that could've been discovered during fuzzing. That explains why new_features stat so often has non-zero values.

I'll fix it as well when making the other merge changes.

Has this been fixed?

Dor1s commented 7 months ago

One more relevant problem: we don't use -use_value_profile=1 during merge, that means we discard some features that could've been discovered during fuzzing. That explains why new_features stat so often has non-zero values. I'll fix it as well when making the other merge changes.

Has this been fixed?

I think so: https://github.com/google/clusterfuzz/pull/1915

google / clusterfuzz

new_edge and new_features stat may be incorrect (e.g. -max_len is used or -use_value_profile) #802