tarasmadan commented 2 months ago

Motivation:

LPC customers asked for a better access to reproducers.
Distros need them to improve their own CIs.
github.com/ksteuck wants to get reproducers by filter.

tarasmadan commented 2 months ago

We have a tools/syz-reprolist created for this purpose. It currently uses the dashAPI and requires client names + access keys.

The proposal is to:

Export required data as a jsonAPI. We're already doing it for bugs.
Switch reprolist.go from dashapi to jsonAPI.
Let reprolist.go download "upstream" namespace reproducers if called w/o any parameters
Add filers support to get only the specific subsystem reproducers etc.
Switch reprolist.go authentication to "gcloud auth login".
Let reprolist.go download reproducers from selected namespace.

tarasmadan commented 2 months ago

@dvyukov @a-nogikh wdyt?

dvyukov commented 2 months ago

Do you consider doing just raw export, or something that does regression testing out-of-the-box? I would assume that raw export won't be too useful for most users. They won't be able to use them, or will use incorrectly. End-to-end solution that distros can use for testing should also include build/run wrappers that will check kernel config, run tests in parallel with timeouts, monitor dmesg output for bugs + docs on how to use this.

If we export them (which is required for export form non-public namespaces), then the current auth can work as well. "gcloud auth login" is a bit handier, but not a game changer. What would be a game changer is fully automated periodic export.

+there is an unresolved problem with missing C repros in lots of cases. syz-reprolist is slow and unreliable (may be broken already). I think we should keep C repros in datastore rather than re-create.

dvyukov commented 2 months ago

For filtering purposes we could also annotate exported reproducers with some metadata (subsystem, expected running time, bug type, etc). There will be lots of reproducers (tens of thousands), so users may want to invoke some subsets of tests (faster ones, or for more critical bug types only). Runner program could accept these filter and run corresponding subsets.

tarasmadan commented 2 months ago

Do you consider doing just raw export, or something that does regression testing out-of-the-box?

I want the user to get a C reproducers collection like https://github.com/dvyukov/syzkaller-repros.

What would be a game changer is fully automated periodic export.

What do you mean? I want every syz-reprolist call to create the latest snapshot.

a-nogikh commented 2 months ago

What would be a game changer is fully automated periodic export.

+1. Maybe even to some git repository exactly like it was done manually before.

I think we should keep C repros in datastore rather than re-create.

But for older ones we'd still have to invoke older syz-prog2c versions, right? Or, probably, just ignore the syz repro bugs in this export? There are not too many of them.

tarasmadan commented 2 months ago

to some git repository exactly like it was done manually before

Pro:

It offloads the traffic to git repo.
It makes the results reachable for robots.
Generally looks easier to do.
Some access to the per-bug historical repro data out of the box.

Contra:

What about private namespaces? More git repos?
How to track usage?
The filter based selection looks more complex.

dvyukov commented 2 months ago

What do you mean? I want every syz-reprolist call to create the latest snapshot.

Is it OK to export tens of thousands of reproducers each time? I was thinking of checking them into a git repo.

dvyukov commented 2 months ago

But for older ones we'd still have to invoke older syz-prog2c versions, right? Or, probably, just ignore the syz repro bugs in this export? There are not too many of them.

Yes, either ignore, or upload once what we can easily recover. syz-reprolist may run for days, but it's fine if done once.

dvyukov commented 2 months ago

What about private namespaces? More git repos?

I would export into a single repo all reproducers that were obtained on kernels with public source code.

How to track usage?

Don't track. I not sure raw number of API invocations is very important. Users may still cache result on their side, then the number will be low. Or they can pull it every minute, but what's the impact of that.

The filter based selection looks more complex.

I would concentrate on end user use cases. This looks like a minor impl detail. Not writing several dozens lines of code to sacrifice user experience and adoption does not looks like a good tradeoff.

tarasmadan commented 2 months ago

What do you mean? I want every syz-reprolist call to create the latest snapshot.

Is it OK to export tens of thousands of reproducers each time? I was thinking of checking them into a git repo. Tens of thousands is doable if we have good benefits.

6k_repros.tar.gz from https://github.com/dvyukov/syzkaller-repros is 28 megabytes. But it is a 2 years old repo. We added the filesystems... and want to scale fuzzing. It can take hundreds of megabytes in a few years. Agree, git looks better from this perspective. Combined with repro annotations it covers any scenario I can think about.

dvyukov commented 2 months ago

@gkennedy12 also periodically asks for updates (which unfortunately slept through the cracks).

tarasmadan commented 2 months ago

Thanks for the inputs. Let's try once more! For every public namespace we want to mirror ReproC files from the datastore to some public git repository.

Something like this:

repo
- upstream
  - bug1
    - repro1.c
    - repro2.c
- android-6-1
  - bug1
    - repro1.c
    - repro2.c

dvyukov commented 2 months ago

What's the use case for separating them by namespace? We can also export from non-public but open-source kernels (that's that I used to do).

store all tentative C repros in the datastore
easy way to build, properly run, and monitor these reproducers

tarasmadan commented 2 months ago

and monitor these reproducers

What is it about?

dvyukov commented 2 months ago

and monitor these reproducers

What is it about?

Detect that they triggered a bug. Lots of kernel test suites just run tests and then ignore actual bugs they provoked in the kernel, so tests look like passing.

syzbot-noreply commented 1 month ago

https://github.com/syzbot-noreply is now registered to perform the bot operations.

tarasmadan commented 1 month ago

https://github.com/syzbot-noreply is now registered to perform the bot operations.

It was me.

dvyukov commented 1 month ago

Detect that they triggered a bug. Lots of kernel test suites just run tests and then ignore actual bugs they provoked in the kernel, so tests look like passing.

We have lots of the required logic in syzkaller already. It could be a new syz-manager/execprog mode. But on the other hand, it may complicate things for users. Not sure what's the right balance.

tarasmadan commented 1 month ago

5374 to continuously export the reproducers

tarasmadan commented 3 weeks ago

5465 to download and build reproducers

5458 to document repro-export tool

google / syzkaller

syzbot: users need better access to reproducers #5345

5374 to continuously export the reproducers

5465 to download and build reproducers

5458 to document repro-export tool