Re-design / refactor strategy selection logic

Dor1s commented 4 years ago

Historically we've been adding new strategies (since the moment we added the first one, which was value profiling I believe), but now it's a pretty big feature and looks complicated.

Each strategy has up to three important components: probability, incompatible strategies, other preconditions (e.g. platform or presence of an auxiliary build). With the existing implementation (a sequence of if / else statements with nested conditions):

1) actual probabilities are not predictable: e.g. https://github.com/google/clusterfuzz/pull/1942 makes entropic strategy to be used in approximately (1 - dataflow_probability) * entropic_probability portion of cases instead of just entropic_probability). If we factor in additional conditions for dataflow strategy to be used, the formula becomes even less predictable

2) the strategy exclusion isn't generic as each if statement has (may have) extra checks for other strategies; it's also not immediately obvious which strategies are not compatible

3) when we use MAB selected probabilities instead of the default one, their effect is also unclear due to 1)

Additional issue is that we don't verify that some strategies were actually used. For example, for dataflow strategy we should be checking that INFO: AUTOFOCUS: ... line is present in the logs. Ideally, I'd like to make libFuzzer (and other fuzzing engines) to explicitly output that a certain mode was enabled. Otherwise, given the variety of CF deployments, we might be having incorrect stats recorded. That lead to an incorrect input provided to the MAB selection algorithm, which makes point 3) even worse.

oliverchang commented 4 years ago

Do you have suggestions on how to do this? We can certainly design some way to specify exclusions in a much better way, but the predictability of probabilities will still be an issue.

Dor1s commented 4 years ago

Only half-baked thoughts. I was also hoping there is a known algorithm for choosing actions with certain probabilities and mutual exclusion, but I personally don't know any.

Perhaps we can have an API where we first initialize a strategy pool by adding there all supported strategies with their probabilities and incompatibility rules.

Strategies without restrictions are easy, we need only their probabilities, and for each of them we roll the dice independently.
Mutually exclusive strategies can be grouped into pairs (or bigger groups, if there are more than two incompatible with each other), then we'll roll the dice to decide whether we use any of those (with probability being equal to ???*). If yes, then we'll roll it once again to see which particular strategy should be used (will need to normalize the probabilities by their total sum for this).
As for the additional conditions, we may either leave those as is but make sure we explicitly confirm the use of such strategies OR we may try checking all the conditions beforehand (i.e. check fuzz target binary to ensure it supports entropic, check the presence of dataflow traces, etc.).

* ??? not sure what it should be of the top of my head. Maybe we should set explicit probability for mutual exclusive strategies, e.g. we can say "in 60% of runs we want either entropic or dataflow, with dataflow being used as twice as entropic". But that would mean we'll need to have some tree-looking config for the strategies.

google / clusterfuzz

Re-design / refactor strategy selection logic #1943