CI for benchmarks to track performance

vogler commented 3 years ago

GitHub Actions are fine for running the regression tests, but we also want something to track performance (and precision) for long-running benchmarks.

Originally posted by @michael-schwarz in https://github.com/goblint/analyzer/pull/234#issuecomment-850362116:

or better having some server with a job queue checking every commit (https://github.com/goblint/analyzer/settings/hooks).

Something along these lines was supposed to be the outcome. Basing it on this benchexec framework has the advantages that it is the same setup for SV-Comp so all those tests work out of the box and our own tests can be integrated without too many issues. Also this tablegen tool would in theory give us a nice diff of what changed between runs (or configurations) that could simply be served at some URL to look at the results without having to ssh to the machine.

One probably wants some glue code so that this is not all shell scripts but a bit more robust. But the idea was exactly this.

checking every commit

This is a bit optimistic given that one of these runs will likely take >12h (at least for SV-Com) even on the new hardware.

sim642 commented 3 years ago

Now that this issue exists, I'll write down one thought. Maybe we could just use a GitHub Actions self-hosted runner for this: https://docs.github.com/en/actions/hosting-your-own-runners/about-self-hosted-runners. I haven't looked into it but it looks like it already has a builtin job queue system etc, so it would avoid a lot of reinventing of the wheel.

Each workflow run is limited to 72 hours.

This limit should be sufficiently high that we can run big jobs that the free GitHub hosted runner probably doesn't allow.

Also GitHub Actions can schedule jobs a la cron instead of trying to do them on each push: https://docs.github.com/en/actions/reference/events-that-trigger-workflows#schedule. And there looks to be even a way to manually trigger jobs.

And of course the integration would be minimal: no need to build some properly authenticated HTTPS webhook server to handle GitHub hooks into testing-framework or whatever.

vogler commented 3 years ago

Does it make sense to look at something like https://www.jenkins.io/ or do we make our own?

A simple implementation would probably be some nodejs server as an endpoint reacting to the GitHub commit hook. There are libraries for job queues with priorities and web-interfaces: https://github.com/Automattic/kue, https://github.com/OptimalBits/bull

vogler commented 3 years ago

Now that this issue exists, I'll write down one thought. Maybe we could just use a GitHub Actions self-hosted runner for this: https://docs.github.com/en/actions/hosting-your-own-runners/about-self-hosted-runners. I haven't looked into it but it looks like it already has a builtin job queue system etc, so it would avoid a lot of reinventing of the wheel.

Each workflow run is limited to 72 hours. This limit should be sufficiently high that we can run big jobs that the free GitHub hosted runner probably doesn't allow.

Also GitHub Actions can schedule jobs a la cron instead of trying to do them on each push: https://docs.github.com/en/actions/reference/events-that-trigger-workflows#schedule. And there looks to be even a way to manually trigger jobs.

And of course the integration would be minimal: no need to build some properly authenticated HTTPS webhook server to handle GitHub hooks into testing-framework or whatever.

Ok, that looks like an easy option. Just need to make sure the limits are fine for the selected benchmarks:

Each job for self-hosted runners can be queued for a maximum of 24 hours. If a self-hosted runner does not start executing the job within this limit, the job is terminated and fails to complete.

vogler commented 3 years ago

If we do our own, there'd be no limits and one could think about more sophisticated prioritization strategies. What's the GitHub behavior? Start if nothing is running, ignore following commits until run is done and then start accepting again? Ideally you'd have the same, but then start bisecting on idle if there are changes above a certain threshold.

sim642 commented 3 years ago

If we do our own, there'd be no limits and one could think about more sophisticated prioritization strategies.

I would be very cautious of trying to roll something decent from scratch. If we really need something beyond those limits, then it still might be worth looking at Jenkins or something else existing and mature. For example, Jenkins even seems to have a plugin for bisecting. Although I'm not sure how necessary such functionality would be. If we already do nightly benchmarks, then there's probably not that much to bisect. And even if there is a need, one can bisect a single/handful of benchmarks locally by hand instead of having to do bisect with an entire 12h suite or whatever.

vogler commented 3 years ago

Yea, just some greenfield thinking, but likely the devil is in the details 😄 Bisect is also good for looking back to see what changes had a big (unexpected) impact.

michael-schwarz commented 2 years ago

Moved it over here, as it seems more appropriate here.

michael-schwarz commented 7 months ago

We now have a minimum working version of this running on server01 and reporting to Zulip.

goblint / GobExec

CI for benchmarks to track performance #13