Design Ideas for Benchmarks in CI

Fizzixnerd commented 2 years ago

@ranjitjhala and @ConnorBaker had a conversation where it was brought up that it would be nice to have historical data of benchmark timing runs in the CI. The purpose of this issue is to collect ideas and discussion for how this might be done.

Fizzixnerd commented 2 years ago

My Thoughts

git should be used to track historical data.
- Zero monetary cost: just make a new open source project liquidhaskell-timing-data.
- Exact correspondence with commits: commits to the timing-data repo could be automated to match branch names and provide exact commit names in the git log for each association and searching.
- Programmatically accessible history: git is easy to access programmatically, having bindings in many languages (notably Haskell) and a shell command interface. This lends itself to easy automated scripting, or more complicated manual readings/manipulations of the history, should one wish.
A Haskell script should be provided to compare two commits, and then used in the CI
- This is in opposition to inlining the logic in the CI file itself.
- This will allow runs to be performed on user computers, for when the CI's flaky performance is an issue.
- EDIT: I just realized this makes no sense, unless the user has their own timing-data repo. More thought required.
Although the CI will not provide reliable point-to-point comparisons, the timing data can still be useful for seeing general trends along the history of the project.
- Circle does not provide reliable enough performance to naively compare two runs.
- This data can be used to track trends still, as the noise should mostly average out over time.

Fizzixnerd commented 2 years ago

cc @facundominguez

ConnorBaker commented 2 years ago

Oh, that's a good thought about storing the timing data in its own repo.

I've been working on building a nicer UX for test results with GitHub actions on my fork of REST: https://github.com/ConnorBaker/rest/actions/runs/2230708515.

I've got a matrix build set up to test against a bunch of combinations of Z3 and operating system.

On my fork, I refactored the test suite to use tasty and use tasty-json to output the test results as JSON.

I wrote a GitHub action (https://github.com/ConnorBaker/tasty-json-to-markdown-action) which takes that JSON and produces a nice markdown table. An example from that run is here: https://github.com/ConnorBaker/rest/runs/6187936424?check_suite_focus=true.

My workflow then uploads both the JSON output and the generated markdown as an artifact.

Having a separate repo would definitely provide a single, consolidated place to store the timing data.

Additionally, IIRC GitHub clears out the old artifacts after 90 days, so there's no worry about data retention either.

I know that someone linked this earlier, but I don't remember who: https://github.com/rust-lang (rendered: https://perf.rust-lang.org).

Having something like that for Liquid Haskell would be amazing. Maybe the rust code is re-usable? I saw they've got a GitHub bot to trigger on-demand benchmarking.

EDIT: At any rate, having a single location where structured benchmark results are available would go a long way towards increasing observability.

ConnorBaker commented 2 years ago

Additionally, I've been looking at low-cost ways of running less noisy benchmarks.

I saw that Phillips has a terraform module (among other things) to provision and scale spot instances as self-hosted GitHub runners: https://github.com/philips-labs/terraform-aws-github-runner.

I also saw a GitHub action which would allow you to do much the same, but with Lambda: https://github.com/nwestfall/lambda-github-runner.

Spot instances are dirt-cheap (cheaper even than Lambdas per vCPU second), but there is a delay as they spin up and are provisioned, and it can't scale out as much as lambda can -- but at that point we're talking thousands of cores, and I doubt we'd run up against that.

Fizzixnerd commented 2 years ago

I was speaking with a colleague, and they brought up the possible complication of rebases with respect to the git workflow, but I think it's still workable with a little bit of effort. I think having an indexed history of the benchmarks outweighs complications arising from this, but I'd be interested to hear others' opinions.

@ranjitjhala would have to comment about the spot instances plan. At minimum, I imagine we would need an estimate of how much it would cost per month at the current rate of development. I also am a little bit unsure personally about how much value less noisy benchmarks would bring. Comparing two benchmarks point-to-point is a dangerous game to play on the best of days, while for long-term trend data, the noise should more or less average out over time (I would think?). I guess I just don't have a clear picture of what useful data we would be gaining from having less noisy benchmarks (and they have a non-zero monetary cost, which may make them harder to justify), maybe you could help clarify? Also, I'm not sure about cache-sharing between the main circleci run and the spot instances -- we'd need to make sure we don't have to rebuild the whole thing just for the benchmarks... but that's likely solvable.

I know @facundominguez already has some gnuplot-based graphing code for comparing two summaries of timing data for the main liquidhaskell repo, maybe he has thoughts on the rust graphs and what exactly he wants to glean from such charts.

EDIT: The markdown chart is really cool! I wonder if there is a way to integrate tasty with the cabal tests...

ucsd-progsys / liquidhaskell

Design Ideas for Benchmarks in CI #1973

My Thoughts