Open Fizzixnerd opened 2 years ago
git
should be used to track historical data.
liquidhaskell-timing-data
.timing-data
repo could be automated to match branch names and provide exact commit names in the git log
for each association and searching.git
is easy to access programmatically, having bindings in many languages (notably Haskell) and a shell command interface. This lends itself to easy automated scripting, or more complicated manual readings/manipulations of the history, should one wish.A Haskell script should be provided to compare two commits, and then used in the CI
timing-data
repo. More thought required.Although the CI will not provide reliable point-to-point comparisons, the timing data can still be useful for seeing general trends along the history of the project.
cc @facundominguez
Oh, that's a good thought about storing the timing data in its own repo.
I've been working on building a nicer UX for test results with GitHub actions on my fork of REST: https://github.com/ConnorBaker/rest/actions/runs/2230708515.
I've got a matrix build set up to test against a bunch of combinations of Z3 and operating system.
On my fork, I refactored the test suite to use tasty and use tasty-json to output the test results as JSON.
I wrote a GitHub action (https://github.com/ConnorBaker/tasty-json-to-markdown-action) which takes that JSON and produces a nice markdown table. An example from that run is here: https://github.com/ConnorBaker/rest/runs/6187936424?check_suite_focus=true.
My workflow then uploads both the JSON output and the generated markdown as an artifact.
Having a separate repo would definitely provide a single, consolidated place to store the timing data.
Additionally, IIRC GitHub clears out the old artifacts after 90 days, so there's no worry about data retention either.
I know that someone linked this earlier, but I don't remember who: https://github.com/rust-lang (rendered: https://perf.rust-lang.org).
Having something like that for Liquid Haskell would be amazing. Maybe the rust code is re-usable? I saw they've got a GitHub bot to trigger on-demand benchmarking.
EDIT: At any rate, having a single location where structured benchmark results are available would go a long way towards increasing observability.
Additionally, I've been looking at low-cost ways of running less noisy benchmarks.
I saw that Phillips has a terraform module (among other things) to provision and scale spot instances as self-hosted GitHub runners: https://github.com/philips-labs/terraform-aws-github-runner.
I also saw a GitHub action which would allow you to do much the same, but with Lambda: https://github.com/nwestfall/lambda-github-runner.
Spot instances are dirt-cheap (cheaper even than Lambdas per vCPU second), but there is a delay as they spin up and are provisioned, and it can't scale out as much as lambda can -- but at that point we're talking thousands of cores, and I doubt we'd run up against that.
I was speaking with a colleague, and they brought up the possible complication of rebases with respect to the git workflow, but I think it's still workable with a little bit of effort. I think having an indexed history of the benchmarks outweighs complications arising from this, but I'd be interested to hear others' opinions.
@ranjitjhala would have to comment about the spot instances plan. At minimum, I imagine we would need an estimate of how much it would cost per month at the current rate of development. I also am a little bit unsure personally about how much value less noisy benchmarks would bring. Comparing two benchmarks point-to-point is a dangerous game to play on the best of days, while for long-term trend data, the noise should more or less average out over time (I would think?). I guess I just don't have a clear picture of what useful data we would be gaining from having less noisy benchmarks (and they have a non-zero monetary cost, which may make them harder to justify), maybe you could help clarify? Also, I'm not sure about cache-sharing between the main circleci run and the spot instances -- we'd need to make sure we don't have to rebuild the whole thing just for the benchmarks... but that's likely solvable.
I know @facundominguez already has some gnuplot-based graphing code for comparing two summaries of timing data for the main liquidhaskell repo, maybe he has thoughts on the rust graphs and what exactly he wants to glean from such charts.
EDIT: The markdown chart is really cool! I wonder if there is a way to integrate tasty with the cabal tests...
@ranjitjhala and @ConnorBaker had a conversation where it was brought up that it would be nice to have historical data of benchmark timing runs in the CI. The purpose of this issue is to collect ideas and discussion for how this might be done.