google / syzkaller

syzkaller is an unsupervised coverage-guided kernel fuzzer
Apache License 2.0
5.35k stars 1.22k forks source link

repro: bad features and timeout handling #5012

Open xairy opened 3 months ago

xairy commented 3 months ago

There's a set of issues/behaviors in pkg/repro and related packages that together make the reproducing process take unnecessarily longer and also make most repros to have the repeat flag set.

When reproducing a bug, pkg/repro first uses syz-execprog to execute the program and see if it can trigger the bug and also decide on the appropriate timeout.

Issue 1. The first (minor) issue is that pkg/instance/execprog.go does not respect the enabled features when spawning syz-execprog: the tool is always spawned with all features enabled (i.e., no -enable or -disable flags are provided).

This can likely be easily fixed when pkg/repro and thus pkg/instance/execprog.go are used from the manager, as the information about enabled features is already passed to RunSyzProg via opts.

Fixing this when pkg/repro is used from tools/syz-repro seems to require more work: we would need to teach syz-repro to detect enabled features first and then pass those to pkg/repro.

Fixing this will likely make no difference to syzbot, as I believe its instances have all/most features enabled anyway. But it would improve the time it takes to generate syz repros for people running custom instances (due to the start-up time of syz-execprog, see issue 2). And it also should speed up generating C repros, as the simplification code won't need to go through all of the features but only through the ones that are enabled.

But let's assume we run a full-blown instance with all features enabled. This brings us to:

Issue 2. pkg/repro does not account for the fact that spawning syz-execprog with all features enabled takes a very long time.

The feature that takes particularly long to set up is net_dev. swap is also somewhat long.

As syz-execprog takes a long time to set up, when spawned from pkg/repro, it rarely gets to executing programs before the first reproducing timeout (3 * cfg.Timeouts.Program == 15 seconds) is over. Thus, pkg/repro often switches to the second timeout (20 * cfg.Timeouts.Program == 1 minute). This happens even for programs that take little time to trigger bugs.

As a result, reproducing a bug takes unnecessarily longer.

I noticed this issue when reproducing a bug on my machine, but I suspect syzbot is affected as well (couldn't check the logs due to the issue #5011 should resolve). (On my machine, spawning syz-execprog in QEMU with KVM enabled takes ~23 seconds. However, for pkg/repro, I need to increase the first timeout to around twice of that, probably because pkg/repro starts counting time even before the execution of syz-execprog starts.)

I think the proper solution here would be to start the timer only after syz-execprog starts executing programs. I.e., ignore the time it takes to set up the features. But I'm not sure how difficult it would be to implement this.

Considerably speeding up the features set up process should also work if it's possible. But I suspect this won't be a lasting solution, as at some point more features will likely get added.

(I don't know if guilty commit bisection on syzbot takes the same timeout as was used for reproducing, but if so, this issue also makes bisection time out more often.)

Issue 3 (or, arguably, just a consequence of issue 2). As the reproducing process often decides to use the large 1 minute timeout, an attempt to remove the repeat from the reproducer always fails on the checkOpts check in pkg/repro.

This is what causes most reproducers to have the repeat flag set.

I initially noticed this on the USB syzbot manager, and it didn't make sense, as I could reproduce most of the bugs without repeat. But it appears that this issue affects all syzbot instances.

While it's not a problem for the syzbot's intended purpose by itself, it might create confusion for people looking at the reproducers. At least for me, seeing a repeat flag set makes me think that the bug is related to some timing/racing issues.

Resolving issue 2 with the approach I mentioned should resolve this one as well. If a different approach is taken, this issue needs to be addressed separately.

a-nogikh commented 3 months ago

Regarding the Issue 1.

We don't pass the features list to tools/syz-execprog indeed, and it would be correct to do so, but I think it won't give any noticeable improvement in the case when the reproductions are run by a syz-manager (even when not on syzbot). The features are enabled unconditionally (if the kernel turns out to support them), so e.g. the slow netdev setup functionality will always be enabled.

Also, during the repro generation, we may only drop the features at a very late stage -- we must already have a program that reliably crashes the kernel, which is the longest part. So most of the iterations would have to happen with all features anyway.

Where it can surely make a difference is if we made tools/syz-repro accept the set of enabled/disabled features and then someone manually crafts some minimal feature list. Then it can really optimize the process, but it's a very very special use case.

xairy commented 3 months ago

What we can do is to extend the manager config to allow selectively disabling features. There's already the experimental remote_cover option that disables flatrpc.FeatureExtraCoverage. And both syz-manager and syz-repro take this config as an argument. But we will still need to teach syz-repro to respect these config options when spawning syz-execprog.

Do this sound acceptable?

a-nogikh commented 1 month ago

Yes, that sounds reasonable.

Regarding the issue 2 -- this is not a solution yet, but at least the timeouts have now become bigger.

Regarding the issue 3 (just for the record) 1 minute timeout and Repeat conflict because it's explicitly done so in the code

https://github.com/google/syzkaller/blob/1eda0d1459e5ff07903ffa2f8cedf55ae7b24af0/pkg/repro/repro.go#L554-L558