prog: smarter mutations

dvyukov commented 6 years ago

Brain dump of ideas for improving mutation efficiency:

argument mutation priorities (int64 should be mutated more frequently than bool8)
call insertion/mutation priorities (IPT_SO_SET_REPLACE should be inserted/mutated more frequently than sched_yield) based on argument complexity
special mutation for binary flags (if flags values are all single bits, we should combine random sets of them)
resource-centeric generation/mutation
program priority in corpus based on amount of new coverage (e.g. 1 new basic block vs large chunk of code covered)
better estimate state of the system before each syscall (e.g. if we created a file "foo", it makes sense to try to open it, otherwise there is little sense in invoking syscalls that require existence of file "foo"; if we opened a socket on port X, it makes sense to connect to that port, etc)

lcytxw commented 6 years ago

Hi, dvyukov. I have some questions about the signal returned by the function execute1(). In syzkaller, new input added to corpus should contain new signals, this will cause problems to some extent. For example, if there is a input exercise first path with signal [A, C, D], and a input exercise second path with signal [B, C, E]. when a new input exercise a path with signal [A, C, E], this path will not be considered a new path by syzkaller. Is this not a problem?

dvyukov commented 6 years ago

I don't see how this is a problem. Why do you think it could be a problem?

lcytxw commented 6 years ago

Sorry, I just remember the path calculation method in afl and you are absolute right .

dvyukov commented 6 years ago

I see what you mean. But it's not that simple. There is spectrum of feedback signal sensitivity. One one end of the spectrum we have plain basic-block code coverage, then we have edges, longer paths, counters, stacks, pairs of basic-blocks, and ultimately using whole trace as signal (i.e. hash of all traced PCs), or even hash of input (i.e. different inputs are always memorized in corpus). The more sensitive the signal, the more useful inputs we add to the corpus. But at the same time we add even more unuseful inputs. Too bloated corpus starts giving negative effect at some point. What's the sweet spot here is a hard question. syz-manager has a special benchmarking mode (-bench flag). If you can tune signal and prove with numbers that one option is better then another, that would be great.

dvyukov commented 6 years ago

from @lcytxw

I collect every call series form the corpus, for example: form corpus
{A->B->C->D, A->E->C, B->A->B, C->E->A, C->D->D, D->E->C, E->C},
mark the series as {(A->B, 2), (A->E, 1), (B->A, 1), (B->C, 1), (C->E, 1),
(C->D, 2), (D->C, 1), (D->D, 1), (D->E, 1), (E->A, 1), (E->C, 3)}, and then
make them as a Markov matrix, we can choose any length function from
this Markov chain. It is very useful because the coverage increased 30%
through this method.

4 _cb rxx ys4f x sv

daydayup40 commented 5 years ago

from @lcytxw

I collect every call series form the corpus, for example: form corpus
{A->B->C->D, A->E->C, B->A->B, C->E->A, C->D->D, D->E->C, E->C},
mark the series as {(A->B, 2), (A->E, 1), (B->A, 1), (B->C, 1), (C->E, 1),
(C->D, 2), (D->C, 1), (D->D, 1), (D->E, 1), (E->A, 1), (E->C, 3)}, and then
make them as a Markov matrix, we can choose any length function from
this Markov chain. It is very useful because the coverage increased 30%
through this method.

4 _cb rxx ys4f x sv

I see that this is same as the implementation method of dynamic prios now. Right?

dvyukov commented 5 years ago

Yes, dynamic prios are similar.

dvyukov commented 5 years ago

A bit on resource-centeric generation/mutation from a mailing list:

I have a long-standing idea to rewrite program generation/mutation
around the concept of resources, rather that syscalls. Say, if we have
a socket created in a program, we consider what else we can do with
that socket (and have some tables as to what can be done with
sockets). This also allows much smarter program splicing. For example,
we can take an initialized resource from one program and plug it (with
the whole syscall sequences that initializes it) into another program.
This should give us much more efficient generation/mutation, because
what matters in the end is resources (that's the unit of state
accumulation in kernel).
I think this should also solve A->B->C problem you mentioned. For
example, syscall A creates a resource of type X, and then we know that
B and C are syscalls that act on resource of type X, so we will
consider adding them with higher probability.

xairy commented 5 years ago

An idea on how to quickly get an estimate for mutation parameters (priorities). Instead of doing some incremental changes to the parameters and monitoring the difference in coverage size (or number of crashes) after fuzzing for a certain period of time, we can compute those parameters based on the corpus that we already have on syzbot. The idea is to split the corpus into two parts in some way, and try to figure out what are the most optimal values for mutation parameters to generate the first part of the corpus from the other one via mutations as efficient as possible.

For example to calculate arguments mutation priorities we can do the following. Split all programs in the corpus into syscalls, group all syscalls into buckets based on syscall type (name), and then calculate the shortest sequence of mutations for each pair of syscalls in each group. Then take some kind of average (proportionally to the number of pairs in each bucket?) of the mutation types that we had to do, and use the result values as mutation priorities.

harperchen commented 4 years ago

from @lcytxw

I collect every call series form the corpus, for example: form corpus
{A->B->C->D, A->E->C, B->A->B, C->E->A, C->D->D, D->E->C, E->C},
mark the series as {(A->B, 2), (A->E, 1), (B->A, 1), (B->C, 1), (C->E, 1),
(C->D, 2), (D->C, 1), (D->D, 1), (D->E, 1), (E->A, 1), (E->C, 3)}, and then
make them as a Markov matrix, we can choose any length function from
this Markov chain. It is very useful because the coverage increased 30%
through this method.

4 _cb rxx ys4f x sv

Hi, May I ask how to reproduce the 30% increase after introducing the dynamic priority?

google / syzkaller

prog: smarter mutations #534