Collect timing information in CI scripts, & detect regressions

septract commented 4 months ago

Per @cp526 in https://github.com/rems-project/cerberus/pull/365#issuecomment-2232911423

We would like the CI to find regressions in timing behaviour, though. So we'll probably want to extend our scripts to collect timing statistics and "publish" the results in a way that allows for easy comparison across commits + fail if the overall sum exceed some sensible limit.

septract commented 4 months ago

I talked with my colleague Kevin Quick about how this should work. Here's one possible approach:

For generating logs in a single CI run:
- Modify the check.sh script to output per-example run times in a .csv file
- Save this file as an artifact
For detecting regressions in a PR:
- Generate the current performance numbers.
- Pull the most recent N .csv files through the artifact API
- Compute a rolling average / trajectory / whatever we need to identify a regression
- If there has been a regression, apply a label
- Do whatever else we want in order to raise the saliency of the regression - auto-comment on the PR, block the merge ...
For long term logging
- Run a nightly CI job that pulls all the artifacts from the current day
- Aggregate into a daily log file
- Generate a performance graph using GnuPlot
- Commit the log and performance graph to the repo

PeterSewell commented 4 months ago

Sounds good to me. For simplicity I might skip the aggregation - just keep the commit timestamp in the csv and ask gnuplot to regen the graph on each commit (with x axis either commit index or time).

septract commented 4 months ago

Github artifacts are deleted after 90 days, so it depends if we care about long term retention of performance logs. If we do, we'll need some other approach to aggregation and storage.

septract commented 4 months ago

Oh, maybe you mean we just keep a rolling csv artifact with all the previous logs? That could work.

PeterSewell commented 4 months ago

Ideally I think we'd end up with all the logs and the current graphs checked in someplace rather than relying on that transient thing - but I don't speak this much github to have an opinion how.

On Fri, 26 Jul 2024 at 06:35, Mike Dodds @.***> wrote:

Oh, maybe you mean we just keep a rolling csv artifact with all the previous logs? That could work.

— Reply to this email directly, view it on GitHub https://github.com/rems-project/cn-tutorial/issues/54#issuecomment-2252007942, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABFMZZQ2B4OFSHDAMKDEVX3ZOHN3RAVCNFSM6AAAAABLHFR7JOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENJSGAYDOOJUGI . You are receiving this because you commented.Message ID: @.***>

cp526 commented 4 months ago

I talked with my colleague Kevin Quick about how this should work. Here's one possible approach:

For generating logs in a single CI run:

Modify the check.sh script to output per-example run times in a .csv file

Save this file as an artifact

For detecting regressions in a PR:

Generate the current performance numbers.

Pull the most recent N .csv files through the artifact API

Compute a rolling average / trajectory / whatever we need to identify a regression

If there has been a regression, apply a label

Do whatever else we want in order to raise the saliency of the regression - auto-comment on the PR, block the merge ...

For long term logging

Run a nightly CI job that pulls all the artifacts from the current day

Aggregate into a daily log file

Generate a performance graph using GnuPlot

Commit the log and performance graph to the repo

That sounds good to me.

And it would indeed be very useful to keep long-term data. However we implement 2 and 3 exactly, I think it's good to have the data from item 3 committed in the repository, to reduce the risk of losing that.

jprider63 commented 3 months ago

I've opened a PR that runs benchmarks on every update to master.

Next I'll work on running benchmarks on every PR so that we can detect potential regressions. Ideally this will add a comment with graphs of the updated performance numbers. If we can't get that working quickly, the alternative is to have the (optional) benchmarking CI job fail and add a comment if performance degradation exceeds a threshold (2x-3x slower?).

For the future, it would also be nice to compare CVC5 vs z3 timings.

rems-project / cn-tutorial

Collect timing information in CI scripts, & detect regressions #54