rems-project / cn-tutorial

7 stars 8 forks source link

Collect timing information in CI scripts, & detect regressions #54

Open septract opened 1 month ago

septract commented 1 month ago

Per @cp526 in https://github.com/rems-project/cerberus/pull/365#issuecomment-2232911423

We would like the CI to find regressions in timing behaviour, though. So we'll probably want to extend our scripts to collect timing statistics and "publish" the results in a way that allows for easy comparison across commits + fail if the overall sum exceed some sensible limit.

septract commented 1 month ago

I talked with my colleague Kevin Quick about how this should work. Here's one possible approach:

PeterSewell commented 1 month ago

Sounds good to me. For simplicity I might skip the aggregation - just keep the commit timestamp in the csv and ask gnuplot to regen the graph on each commit (with x axis either commit index or time).

septract commented 1 month ago

Github artifacts are deleted after 90 days, so it depends if we care about long term retention of performance logs. If we do, we'll need some other approach to aggregation and storage.

septract commented 1 month ago

Oh, maybe you mean we just keep a rolling csv artifact with all the previous logs? That could work.

PeterSewell commented 1 month ago

Ideally I think we'd end up with all the logs and the current graphs checked in someplace rather than relying on that transient thing - but I don't speak this much github to have an opinion how.

On Fri, 26 Jul 2024 at 06:35, Mike Dodds @.***> wrote:

Oh, maybe you mean we just keep a rolling csv artifact with all the previous logs? That could work.

— Reply to this email directly, view it on GitHub https://github.com/rems-project/cn-tutorial/issues/54#issuecomment-2252007942, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABFMZZQ2B4OFSHDAMKDEVX3ZOHN3RAVCNFSM6AAAAABLHFR7JOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENJSGAYDOOJUGI . You are receiving this because you commented.Message ID: @.***>

cp526 commented 1 month ago

I talked with my colleague Kevin Quick about how this should work. Here's one possible approach:

  • For generating logs in a single CI run:

    • Modify the check.sh script to output per-example run times in a .csv file
    • Save this file as an artifact
  • For detecting regressions in a PR:

    • Generate the current performance numbers.
    • Pull the most recent N .csv files through the artifact API
    • Compute a rolling average / trajectory / whatever we need to identify a regression
    • If there has been a regression, apply a label
    • Do whatever else we want in order to raise the saliency of the regression - auto-comment on the PR, block the merge ...
  • For long term logging

    • Run a nightly CI job that pulls all the artifacts from the current day
    • Aggregate into a daily log file
    • Generate a performance graph using GnuPlot
    • Commit the log and performance graph to the repo

That sounds good to me.

And it would indeed be very useful to keep long-term data. However we implement 2 and 3 exactly, I think it's good to have the data from item 3 committed in the repository, to reduce the risk of losing that.

jprider63 commented 2 weeks ago

I've opened a PR that runs benchmarks on every update to master.

Next I'll work on running benchmarks on every PR so that we can detect potential regressions. Ideally this will add a comment with graphs of the updated performance numbers. If we can't get that working quickly, the alternative is to have the (optional) benchmarking CI job fail and add a comment if performance degradation exceeds a threshold (2x-3x slower?).

For the future, it would also be nice to compare CVC5 vs z3 timings.