Performance/stress CI rework

qinsoon commented 1 year ago

This issue discusses what we need and what we are going to do for performance regression CI and stress test CI. They share infrastructure so I put them together.

Requirements

Performance tests include benchmark runs, start up time, min heap values, allocation/collection speed, num of instructions in allocation sequence etc. Basically any test that returns a metric may be included in the future.
Performance tests per commit, per day, per week, etc. We need a plot to show performance over time.
Performance tests per pull request (before merged). We need a plot or at least a table to show the performance diff.
Stress tests should be running all the time. We aim to run stress tests for each benchmark for the most recent commit. If we have a newer commit after one run, we switch to the new commit. If we finish running all the benchmarks for the most recent commit, we go back to run missing benchmarks for previous commits (backtracking for N days).
We set up the tests for mmtk-core (including testing mmtk-core with OpenJDK), but the infrastructure and code/script need to be reusable by the bindings (e.g. testing Julia/Ruby binding's performance overtime, etc).

Non goal

This issue only talks about our set up and framework for doing performance CI and stress tests CI. Other correctness CI is a non goal for this issue.
We do not plan to run the cross production of bindings x plans x benchmarks all the time, which is infeasible for us. We may prioritize the tests, and run important ones more frequently.
The tests only serves the purpose of maintaining MMTk's performance and robustness. Checking the overall performance and robustness for a language is a non goal.

Design

Job Triggering

Github action: it is the easiest to just use Github action to trigger jobs. E.g. for new commits, for pull requests, for pull request labels, etc. It works well in most cases. A key shortcoming for Github actions is that it cannot queue a job for more than 24 hours. If we have a set of long running jobs, we cannot use Github actions to trigger them. Otherwise the queue'd jobs will be cancelled after 24 h.
Github bot: it is flexible, as we can code our own logic on how to trigger CI jobs.
- We once set up a github bot on heroku with a free tier account. But Heroku free tier was cancalled. We need a host to run the bot.
- For our stress test CI, it is convinent to let Github bot to trigger the work. It can figure out what tests have been run, and what tests should be run next.
We probably need both.

Job execution

Github action runner: We have set up our own runners. We should keep using them.
Github action workflows: We organise our CI jobs as Github action workflows. The workflows may be triggered by Github events, cron timer, or by bot.
running-ng (https://github.com/anupli/running-ng): we will use it for running performance benchmarks. I would suggest we just rely on the tool for running all the stress tests and performance tests. We introduce features to the tool if we need to.
We should avoid building VMs in the performance CI. Instead, we should download the build from somewhere else.

Results storage

Logs: Logs could be expensive to store over a long time.
- Git repo: we currently push logs to a git repo. We can keep doing that, as it costs nothing. The shortcoming is that it is slow to read, and we cannot concurrently uploading results. As we currently only have two CI machines, it works fine with rebase/push as the retry mechanism. It has a size limit of 25 MB per file which is enough for us.
- Github artifects: Github action has 90 days retention for artifects. We can upload logs as artifects, and they will be retained for 90 days (enough for us to debug any recent issue). But we will lose them after 90 days. We cannot view logs for an old issue. We cannot retrospectively build data or graphs.
- External storage: we can use any paid file storage service, or use any machine to store the logs. It may not worth it.
Structured data: We should parse logs once the tests finish, and store the useful information in a structured way. Visualization should only use structured data, not raw logs.
- File: csv, json, or whatever format. We can store it in a git repo, or with external storage.
- Database: It requires a host. But it can be used as a data source for visualization tools.
- Externally managed: managed by the visualization frameworks like codespeed.

Visualization

Performance
- Performance timeline framework: We do not need to maintain the code. We may not be able to customize much of the visualization (you get what you get -- unless we would like to contribute to those projects which is unlikely). They all require a web host to run.
- Codespeed: Example. Mainly used by CPython and PyPy. We post performance data to the server, and it renders the results.
- RebenchDB: Example. Not sure how easy it is to feed data to it. It uses Postgres as its data source. No detailed documentation yet. It says it is inspired by codespeed.
- Other data visualization/monitoring framework:
- Grafana: Example. A popular framework for data visualization. It is usually used for monitoring real-time data (like work load for servers, etc). It might be an overkill for us. But it should work for our needs.
- Static page generation with our own script: We need to maintain the script. We can generate static pages which are hosted with Github pages. This is what we currently do (link).
- Repurpose plotty: we should be able to use plotty to generate graphs to compare two commits easily. If we have a solution for timeline, we can use this approach to get a graph to compare performance for pull requests.
Stress test results: as stress tests take long, we may not be able to run all the tests for every commit. We opportunistically run tests as many as we can. We need to a way to report and track what tests have been run for each commit, and the results of it.
- Static page generation with our own script.

qinsoon commented 1 year ago

Related issues:

k-sareen commented 1 year ago

This is what Rust uses: https://github.com/rust-lang/rustc-perf

If we want to use their frontend, we would have to output results in a compatible format. I am not sure what that format is.

qinsoon commented 1 year ago

This is what Rust uses: https://github.com/rust-lang/rustc-perf

If we want to use their frontend, we would have to output results in a compatible format. I am not sure what that format is.

I noticed that. But the project seems tightly coupled with rustc and not suitable for us.

caizixian commented 1 year ago

Here's an architecture I discussed with @tianleq and @wenyuzhao

We build a lightweight API server backed by some sort of database (Firebase, or SQL on VPS, etc.). We just need one table with columns (commit metadata, date, metric, benchmark, configuration, data).

Commit metadata (JSON): {commit: deadbeef, repo: mmtk/mmtk-core, pr: 42} or {commit: deadbeef, repo: mmtk/mmtk-core, branch: master}.
Date: the timestamp of the CI run
Metric: rss, startup time, total time, GC time, allocation slow path distribution
Benchmark: dacapo2006_fop
Configuration: SemiSpace
Data (JSON): 50MB, 2s, 40s, {0ns: 0%, 1ns: 3%, 2ns: 50%, 4ns: 10%, ...}

We expose two very generic HTTP endpoints.

POST /query

Body: {metric: str, benchmarks: [str], configurations: [str], repo: Option[str], pr: Option[str], branch: Option[str], commits: Option[[str]]}

POST /insert

Body: {metric: str, benchmark: str, configuration: str, repo: str, pr: Option[str], branch: Option[str], commit: str}

These endpoints return an array of datapoints. These endpoints should be easy to implement with INSERT and SELECT

During benchmark runs, for each completed configuration/benchmark, we do POSTs to insert parsed data into the database, and then do another POST to store the log in object storage.

The visualization frontend can just be a static webpage that talks to the backend. We can also have other text-based frontends (such as GitHub bot) that comment on PRs.

Some example HTTP requests.

Performance regression for the same configuration on multiple benchmarks: {metric: "total_time", benchmarks: [fop, lusearch]}, configurations: [OpenJDK_SemiSpace], repo: mmtk/mmtk-core, pr: None, branch: master, commit: None}

Performance comparison before merging PR: {metric: "total_time", benchmarks: [fop, lusearch]}, configurations: [OpenJDK_SemiSpace], repo: mmtk/mmtk-core, pr: 42, commit: None}

Get performance for a single commit: {metric: "total_time", benchmarks: [fop, lusearch]}, configurations: [OpenJDK_SemiSpace], repo: mmtk/mmtk-core, commit: deadbeef}

Performance comparison against baseline: {metric: "total_time", benchmarks: [fop]}, configurations: [OpenJDK_SemiSpace, OpenJDK_Parallel], repo: mmtk/mmtk-core, pr: None, branch: master, commit: None}

qinsoon commented 1 year ago

That looks like what codespeed does. Should we use codespeed rather than reinventing the wheel?

caizixian commented 1 year ago

That looks like what codespeed does. Should we use codespeed rather than reinventing the wheel?

Main problems are

Codespeed only supports numerical data. We might want to store histogram from bpftrace, etc.
Codespeed's timeline view only support viewing different benchmarks in different graphs, and compare different executable in the same graph. We want to support viewing different benchmarks of the same executable in the same graph so that we can see how the performance trend differs depending on the workload
Unclear how to support multiple invocations and errorbar

caizixian commented 1 year ago

Also it seems like the API I proposed above is too narrow. We probably need something plotty-esque. Essentially, we need four generic fields: run, scenario, value.

run (dict) can be used to locate results. A run must contain a unique ID (UUID or hostname-timestamp) and the timestamp, and a bunch of additional fields (for example, PR number, repo name, etc.)
scenario (dict): benchmark, runtime, GC, arguments, etc.
metric (str): rss, startup time, total time, GC time, allocation slow path distribution.
value (Any): data.

We assume that the database backend will just need to perform filtering and retrieval, and the analysis logic will be implemented on the client side. It seems like document db like MongoDB/Elasticsearch can be a good choice for such unstructured data.

We mostly interested in two types of query. Compare two runs, or track the trend of specific scenarios over time. So we need some sort of indices on run, scenario, and metric.

Client-side analysis and visualization should be feasible given today's web stack and machine performance.

This might eventually replace plotty, so that we can share the same workflow for performance regression and day-to-day analysis.

It might be possible to do a lot of analysis and a dashboard in, e.g. Kibana (the normalization algorithm used by plotty is really hard to implement as database queries). https://www.elastic.co/guide/en/kibana/current/lens.html

qinsoon commented 1 year ago

Zixian mentioned this blog post https://www.mongodb.com/blog/post/using-change-point-detection-find-performance-regressions. The post itself does not include much information, but there is a list of papers and talks at the end.

mmtk / mmtk-core