Open tarasmadan opened 2 months ago
Let's link it to #4911 .
@a-nogikh @dvyukov wdyt? CC @ramosian-glider
Are there any alternatives?
Where do these commits come from? Coverage reports look like the lesser problem if commits disappear. It also means we report bugs on commits that nobody can find, can't bisect, etc.
I remember we also discussed pushing all tested commits to a single git repo, which would preserve them for other developers, bisection, etc.
How bad is the problem? Another possible solution is to do nothing for now.
It is mostly about git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git. But there are others like: https://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb.git, ac6a205c5bef39d65ecd9f5dd2c1d75652c35405
I tried to explain the most flaky parts of the coverage.
Worst example I have is syz-cover -for-file sound/soc/codecs/adau7002.c -from 2024-05-01 -to 2024-05-31 In May 5 commits provided coverage signals for the sound/soc/codecs/adau7002.c. All 5 commits were lost.
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git, 78186bd77b478c474e719409c0569ce48eb73a57 git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git, 6a71d290942741edc158620aa5b0950ddd4cbc9e git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git, c4d6737dc9dacb2b774216c0441a827230691446 git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git, fda5695d692cf6a82fceb174583923fda049799f git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git, 1c9135d29e9ec681b8c6abadf80a7f3721c20f7c
> Another possible solution is to do nothing for now.
It also enables me to land #5239 . TLDR: I can't download 10+ file versions from the webgit (they throttle me) but knowing all 10+ versions are equivalent, file coverage visualization is trivial. It will work for most files. The alternatives are:
GCS way seems to be the easiest to solve both, commit absence and file coverage visualization problems. It will cost apx. 10 new commits/day 365 days 100M(kernel source) =~ 365G annual storage. GCS costs 0,1$ for Gmonth. It gives 365G 12 months * 0,1$ =~ 400$ every year for the storage. With compression we can at least /10 this number and make upload/download operations the most expensive part of this idea.
Knowing the last_edit_commit, I can offload a lot of coverage merging to the DB side (I expect something like 10x speedup), but it is not a problem now because we batch data.
Android and modules may be the problem here.
How do modules affect this? They are mostly build/runtime artifacts. On source code level it's not even possible to say if a particular file belongs to a module or not.
GCS way seems to be the easiest to solve both, commit absence and file coverage visualization problems.
What do you mean by commit absence problem? If you mean what I mentioned (developers can't checkout/build the mentioned reversion, we can't do bisection, etc), I am not sure how it's resolved with a copy of each file on GCS.
Overall, custom GCS solution looks like lots of new custom code written specifically for this (we will need to upload all files to GCS, add downloading code, extract last modification commit, send to dashboard, extend datastore, store that info in datastore, provide API to fetch this info back, and update coverage aggregation to fetch metadata from dashboad and files from GCS). It won't seamlessly integrate with any existing logic (coverage, bisection, web links from syzbot, etc), so that will be even more code.
Uploading to git looks like less code now (e.g. no involvement from dashboard + coverage aggregation just need to do 1 pull before aggregation and no additional changes), plus it will work out-of-the-box with bisection, we can provide web links to source files in reports, etc. And it looks like less maintenance in future as we have fewer custom parts.
Additionally, patch fuzzing could use as well (it's useful to publish exact repo state that we test to reference in any reports) cc @a-nogikh
Overall, custom GCS solution looks like lots of new custom code written specifically for this (we will need to upload all files to GCS, add downloading code, extract last modification commit, send to dashboard, extend datastore, store that info in datastore, provide API to fetch this info back, and update coverage aggregation to fetch metadata from dashboad and files from GCS).
We'll need to upload all files to GCS to something like gcs://bucket/commit/file_path and access them by gcs path. Downloading code - we have gcs read function. Extract last modification commit - no, I don't need it if the file is available. Send to dashboard - no. Dashboard will read it directly from gcs when needed. Extend datastore - why? Store in datastore - no. Why do you think we need it? Provide API - no. We can read it from gcs directly. Update coverage aggregation to fetch metadata from dashboard and files from GCS - instead of web I'll read them from gcs. What metadata do you think may be needed?
Where do you think the git repo should be hosted? VM or something like github? GCS looks cheaper to me from the maintenance point of view.
After 30 minutes:
It will cover the coverage use-cases.
Send to dashboard - no. Dashboard will read it directly from gcs when needed.
I though what you propose still includes the "What we really need is a commit when the file was actually changed" part.
Where do you think the git repo should be hosted? VM or something like github? GCS looks cheaper to me from the maintenance point of view.
Own VMs is pain. https://cloud.google.com/source-repositories would be ideal, it should provide easy way to auth syz-ci, and then it just needs to do a push.
It looks dangerous to make this repo accessible from the internet.
Why? It looks like the least valuable information we have. If fact, we already downloaded contents from public repos.
Own VMs is pain. https://cloud.google.com/source-repositories would be ideal
I considered it previously. It looks deprecated.
I though what you propose still includes the "What we really need is a commit when the file was actually changed" part.
No, I need all the files. It is a better solution. Or alternatively I need files "last_changed_commit". I hope for 99% files it will be 1 file.
It looks dangerous to make this repo accessible from the internet.
Why? It looks like the least valuable information we have. If fact, we already downloaded contents from public repos.
I remember some articles like https://www.tarlogic.com/blog/cve-2024-32002-vulnerability-git/ . I'm not worrying about the data. The potential RCE and access to the VM is my main concern. This VM will require careful configuration. Git as a service looks good if this option is available.
https://cloud.google.com/secure-source-manager/docs/overview seems to be the option.
Let's wait for the GCP SSM repo availability to solve the lost commits problem.
Once we have git repo with all commits, ideally we fix web links as well. For example, looking at "WARNING in vfs_removexattr (2)": https://syzkaller.appspot.com/bug?extid=ad9ca5fa6f83171e3bb9
source links point to: https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git/tree/fs/xattr.c?id=df54f4a16f82b1722593ff8ec2451fdefa467cd0#n577
which produces "Invalid commit reference" error.
Is your feature request related to a problem? Please describe. Every fuzzing session has a target kernel commit attribute. To merge the fuzzing signals from this commit we need related kernel files content. The problem: some commits live only week-month-quarter. It means we can't recover file content and can't merge coverage signals.
Describe the solution you'd like What we really need is a commit when the file was actually changed. Most part of the kernel is stable. Having the "last changed" commit number we'll also speed up the aggregation logic. It is a much better merge base than the currently used kernel commit version. If the file wasn't changed for 3 years, all the 3 years long aggregation for this file can be done by DB engine.
Do you have any implementation in mind for this feature? Spanner DB table: filepath, kernel_commit, last_change_commit, primary_key(filepath, kernel_commit) should be enough. Data federation engine (bigquery+spanner) allows to use this spanner table for the coverage preaggregation on the bigquery side.
Additional context Alternatively we can store all the kernel sources we're fuzzing. It looks too heavy to solve this specific problem. But having more use-cases we can consider the kernel source code storage creation.
Plan