Parallelising work in CI systems

tomato42 commented 3 years ago

Both Gitlab CI and Github Actions allow having tasks that depend on each-other: https://docs.github.com/en/free-pro-team@latest/actions/learn-github-actions/migrating-from-gitlab-cicd-to-github-actions#dependencies-between-jobs

It would be nice to have ability to:

prepare work for N workers in one job
start N workers to process the mutation runs, each getting a single file with mutations to execute
have a summary task that combines results from the N workers and provides the overall mutation score

abingham commented 3 years ago

There are few ways you might approach this that come to mind. First, you could have each worker handle a particular module (or subset of modules). For each worker, the cosmic-ray.module-path config option would tell it which modules to mutate/test. If you wanted to get a unified result at the end, you'd need some method to combine their WorkDBs afterward; this shouldn't be difficult, and might be a generally useful tool for CR to have.

Another option is to give each worker access to the entire set of mutations for your project, but to have them only actually perform a subset of them. So if a worker knew, for example, that it was number 3 out of 5, then it would only work on the third fifth of all mutations...or something like that, I'm glossing over details. As before, you might want some way to combine all of the results to get a unified report.

Of course, if the workers are actually able to communicate with one another, you could also use e.g. the celery execution engine to distribute work among them. I'm not sure if this is possible or not.

So I think we already have most of what you need to do this. It'll require a little creativity, and we might find that there are even better ways, e.g. perhaps a new execution engine. I'm happy to help you work on a solution (though I don't have much bandwidth to actually implement something right now).

tomato42 commented 3 years ago

doesn't celery require real-time access between controller and runners? I'm thinking files as those are typically handled well by CI systems (as build artefacts), so they would be runnable even if workers don't have access to network.

I'm thinking that the split should happen on a single machine, with job files having a subset of mutations to execute, as we probably want to preserve the runner that executes mutations in random order—so that we can kill workers after specific amount of time, not when they finish the job (for CI we want results quickly, even if they are incomplete).

While using filtering and module-path would work, rarely modules are same size or complexity, so split like that would be rather rough. More of a crutch than a solution.

Build artefacts handling is also why I'm thinking that combining should use files as inputs.

So basically, I think that we need something that splits the sqlite file into N files with random mutations that can be executed by the existing runner, and then something that takes all the files after the runners are done with them and melds it together.

abingham commented 3 years ago

doesn't celery require real-time access between controller and runners?

That's right, hence the caveat about the workers needing to be able to communicate. I figured this was not likely to be possible, but I thought I should include it for completeness.

More of a crutch than a solution.

I agree, this is a pretty crude approach. It's primary benefit is its simplicity, but it's not so much simpler than other approaches that I'd try it first.

I think that we need something that splits the sqlite file into N files with random mutations that can be executed by the existing runner, and then something that takes all the files after the runners are done with them and melds it together.

Right, I think this is the best way to start. I think we even have most of the parts we need. I'm not sure what channels there are for communicating between the workers, but I guess we'll need some way of serializing WorkDBs or WorkItems between them. This should be pretty straightforward.

sixty-north / cosmic-ray

Parallelising work in CI systems #505