google / syzkaller

syzkaller is an unsupervised coverage-guided kernel fuzzer
Apache License 2.0
5.31k stars 1.21k forks source link

pkg/coveragedb: lost commits problem #5293

Open tarasmadan opened 1 week ago

tarasmadan commented 1 week ago

Is your feature request related to a problem? Please describe. Every fuzzing session has a target kernel commit attribute. To merge the fuzzing signals from this commit we need related kernel files content. The problem: some commits live only week-month-quarter. It means we can't recover file content and can't merge coverage signals.

Describe the solution you'd like What we really need is a commit when the file was actually changed. Most part of the kernel is stable. Having the "last changed" commit number we'll also speed up the aggregation logic. It is a much better merge base than the currently used kernel commit version. If the file wasn't changed for 3 years, all the 3 years long aggregation for this file can be done by DB engine.

Do you have any implementation in mind for this feature? Spanner DB table: filepath, kernel_commit, last_change_commit, primary_key(filepath, kernel_commit) should be enough. Data federation engine (bigquery+spanner) allows to use this spanner table for the coverage preaggregation on the bigquery side.

Additional context Alternatively we can store all the kernel sources we're fuzzing. It looks too heavy to solve this specific problem. But having more use-cases we can consider the kernel source code storage creation.

Plan

tarasmadan commented 1 week ago

Let's link it to #4911 .

tarasmadan commented 1 week ago

@a-nogikh @dvyukov wdyt? CC @ramosian-glider

dvyukov commented 1 week ago

Are there any alternatives?

Where do these commits come from? Coverage reports look like the lesser problem if commits disappear. It also means we report bugs on commits that nobody can find, can't bisect, etc.

dvyukov commented 1 week ago

I remember we also discussed pushing all tested commits to a single git repo, which would preserve them for other developers, bisection, etc.

How bad is the problem? Another possible solution is to do nothing for now.

tarasmadan commented 1 week ago

It is mostly about git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git. But there are others like: https://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb.git, ac6a205c5bef39d65ecd9f5dd2c1d75652c35405

I tried to explain the most flaky parts of the coverage.

Worst example I have is syz-cover -for-file sound/soc/codecs/adau7002.c -from 2024-05-01 -to 2024-05-31 In May 5 commits provided coverage signals for the sound/soc/codecs/adau7002.c. All 5 commits were lost.

git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git, 78186bd77b478c474e719409c0569ce48eb73a57 git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git, 6a71d290942741edc158620aa5b0950ddd4cbc9e git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git, c4d6737dc9dacb2b774216c0441a827230691446 git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git, fda5695d692cf6a82fceb174583923fda049799f git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git, 1c9135d29e9ec681b8c6abadf80a7f3721c20f7c

> Another possible solution is to do nothing for now.

It also enables me to land #5239 . TLDR: I can't download 10+ file versions from the webgit (they throttle me) but knowing all 10+ versions are equivalent, file coverage visualization is trivial. It will work for most files. The alternatives are:

  1. Store source file at the coverage merging time or syz-ci kernel build time with some TTL(1 year). GCS will store everything we need for the file coverage rendering in this case.
  2. As you proposed, "pushing all tested commits to a single git repo". Android and modules may be the problem here.
tarasmadan commented 1 week ago

GCS way seems to be the easiest to solve both, commit absence and file coverage visualization problems. It will cost apx. 10 new commits/day 365 days 100M(kernel source) =~ 365G annual storage. GCS costs 0,1$ for Gmonth. It gives 365G 12 months * 0,1$ =~ 400$ every year for the storage. With compression we can at least /10 this number and make upload/download operations the most expensive part of this idea.

Knowing the last_edit_commit, I can offload a lot of coverage merging to the DB side (I expect something like 10x speedup), but it is not a problem now because we batch data.

dvyukov commented 1 week ago

Android and modules may be the problem here.

How do modules affect this? They are mostly build/runtime artifacts. On source code level it's not even possible to say if a particular file belongs to a module or not.

GCS way seems to be the easiest to solve both, commit absence and file coverage visualization problems.

What do you mean by commit absence problem? If you mean what I mentioned (developers can't checkout/build the mentioned reversion, we can't do bisection, etc), I am not sure how it's resolved with a copy of each file on GCS.

dvyukov commented 1 week ago

Overall, custom GCS solution looks like lots of new custom code written specifically for this (we will need to upload all files to GCS, add downloading code, extract last modification commit, send to dashboard, extend datastore, store that info in datastore, provide API to fetch this info back, and update coverage aggregation to fetch metadata from dashboad and files from GCS). It won't seamlessly integrate with any existing logic (coverage, bisection, web links from syzbot, etc), so that will be even more code.

Uploading to git looks like less code now (e.g. no involvement from dashboard + coverage aggregation just need to do 1 pull before aggregation and no additional changes), plus it will work out-of-the-box with bisection, we can provide web links to source files in reports, etc. And it looks like less maintenance in future as we have fewer custom parts.

Additionally, patch fuzzing could use as well (it's useful to publish exact repo state that we test to reference in any reports) cc @a-nogikh

tarasmadan commented 1 week ago

Overall, custom GCS solution looks like lots of new custom code written specifically for this (we will need to upload all files to GCS, add downloading code, extract last modification commit, send to dashboard, extend datastore, store that info in datastore, provide API to fetch this info back, and update coverage aggregation to fetch metadata from dashboad and files from GCS).

We'll need to upload all files to GCS to something like gcs://bucket/commit/file_path and access them by gcs path. Downloading code - we have gcs read function. Extract last modification commit - no, I don't need it if the file is available. Send to dashboard - no. Dashboard will read it directly from gcs when needed. Extend datastore - why? Store in datastore - no. Why do you think we need it? Provide API - no. We can read it from gcs directly. Update coverage aggregation to fetch metadata from dashboard and files from GCS - instead of web I'll read them from gcs. What metadata do you think may be needed?

Where do you think the git repo should be hosted? VM or something like github? GCS looks cheaper to me from the maintenance point of view.

tarasmadan commented 1 week ago

After 30 minutes:

  1. It should be ok to have a VM with git super-repo.
  2. It should be ok to expose web api to read the files from AppEngine.
  3. It looks dangerous to make this repo accessible from the internet.

It will cover the coverage use-cases.

dvyukov commented 1 week ago

Send to dashboard - no. Dashboard will read it directly from gcs when needed.

I though what you propose still includes the "What we really need is a commit when the file was actually changed" part.

Where do you think the git repo should be hosted? VM or something like github? GCS looks cheaper to me from the maintenance point of view.

Own VMs is pain. https://cloud.google.com/source-repositories would be ideal, it should provide easy way to auth syz-ci, and then it just needs to do a push.

dvyukov commented 1 week ago

It looks dangerous to make this repo accessible from the internet.

Why? It looks like the least valuable information we have. If fact, we already downloaded contents from public repos.

tarasmadan commented 1 week ago

Own VMs is pain. https://cloud.google.com/source-repositories would be ideal

I considered it previously. It looks deprecated.

I though what you propose still includes the "What we really need is a commit when the file was actually changed" part.

No, I need all the files. It is a better solution. Or alternatively I need files "last_changed_commit". I hope for 99% files it will be 1 file.

tarasmadan commented 1 week ago

It looks dangerous to make this repo accessible from the internet.

Why? It looks like the least valuable information we have. If fact, we already downloaded contents from public repos.

I remember some articles like https://www.tarlogic.com/blog/cve-2024-32002-vulnerability-git/ . I'm not worrying about the data. The potential RCE and access to the VM is my main concern. This VM will require careful configuration. Git as a service looks good if this option is available.

tarasmadan commented 1 week ago

https://cloud.google.com/secure-source-manager/docs/overview seems to be the option.

tarasmadan commented 1 week ago

5302 because not all the git repos we use have http+plain_text api and can't be directly asked by appengine.

Let's wait for the GCP SSM repo availability to solve the lost commits problem.