Add continuous benchmarks

MadVikingGod commented 11 months ago

Problem Statement

I want to understand how the go SDK's performance has changed over time.

Proposed Solution

Use the newly available dedicated environments in the Otel Org to run benchmarks automatically.

Tasks to make this happen

[ ] Decide on which 3-4 benchmarks are important. [1]
[x] Make a branch that the Github Pages that can be hosted from. [2]
[ ] (Optional) Run the benchmarks against historic versions to generate data
[ ] Setup workflow similar to other languages, like Java

[1]: Currently, there are approximately 300 benchmarks with 3 data points each (ns/op, allocations/op, and B/op). The current tools will make a graph for each one, which is overwhelming. [2]: This can be done via:

git checkout --orphan benchmarks
git commit --allow-empty -m "Init"
git push origin benchmarks

Other Notes

This can't be run against every PR. There isn't a way to securely run against a PR and not have it be open for exploration. Instead, this should be run against releases.

dmathieu commented 11 months ago

We already run go test -bench=. in a GH Action on pushed to main. https://github.com/open-telemetry/opentelemetry-go/blob/main/.github/workflows/benchmark.yml

However, that run is rather unreliable, as the runs may have noisy neighbors. Switching to a dedicated environment would definitely help.

XSAM commented 2 months ago

This can't be run against every PR. There isn't a way to securely run against a PR and not have it be open for exploration. Instead, this should be run against releases.

@MadVikingGod, I missed the context here. Could you please elaborate on the security issue of running a benchmark CI against every PR? Thank you

dmathieu commented 2 months ago

Also note that pushing the benchmark results to GitHub pages seems like outside the acceptable use for GitHub apps/extensions, since bots should only be reading data, not writing to the repository.

https://github.com/open-telemetry/community/blob/main/docs/using-github-extensions.md

MadVikingGod commented 2 months ago

This can't be run against every PR. There isn't a way to securely run against a PR and not have it be open for exploration. Instead, this should be run against releases.

@MadVikingGod, I missed the context here. Could you please elaborate on the security issue of running a benchmark CI against every PR? Thank you

We accept PRs from everyone, and because those PRs can change what commands are run on the dedicated hardware, it is a security risk to have them run on every PR. GitHub does not recommend using dedicated runners with a public repository

XSAM commented 2 months ago

@MadVikingGod, is it OK if I assign this to myself? I want to push this issue.

XSAM commented 2 months ago

To make the continuous benchmark work, we at least need to solve the following issues:

Have a stable environment to run the benchmark so we can do the comparison.
Remove alert-threshold: "400%" in the benchmark.yml, as this basically disables the alert for performance degradation.
Against every commit pushed to the main branch or the PR, so we know a more precise granularity on the historical performance data.
Publish visual historical data (probably on the github page), so we can easily know the trending on performance and find the turning point. However, there is a limitation on the permission a bot can have. https://github.com/open-telemetry/community/blob/main/docs/using-github-extensions.md#using-github-apps-extensions-and-bots

Proposal:

Enable self-hosted runner to have a dedicated environment to run benchmarks and gain stable results.
Once we have stable results, we can remove alert-threshold: "400%".
Against every commit pushed to the main branch instead of against every PR. The workflows on the main branch should be safe enough to run the benchmarks on the self-hosted runner.
We can let the benchmark-action action generate the HTML files, and then we use our own command in the action to push them. With this setup, even if the benchmark-action action is compromised, it can only affect our github page, not our main branch.

Other optional goals:

Mark 3-4 important benchmarks in front of the visual page. So we don't need to check hundreds of benchmark graphs every time. This may require making a contribution to the benchmark-action repo to have such a feature.
Add benchmark comparison with other projects, like comparing the metrics benchmark with Prometheus SDK. Related: #5542

dmathieu commented 2 months ago

Enable self-hosted runner to have a dedicated environment to run benchmarks and gain stable results.

There has been several discussions about this at the community level, to set this up in an automated way where no single human would have the keys. No definitive solution arose yet (the closest one would be actuated, but it's not dedicated runners AFAIK). Related: https://github.com/open-telemetry/community/issues/1162

XSAM commented 2 months ago

IIUC, we have two types of github runners available to use:

Dedicated machine from Equinix. We can use runs-on: self-hosted to run github action on these machines.
- The OTel Java repo uses this type of runner to run benchmark. https://github.com/open-telemetry/opentelemetry-java/blob/9bfb81b66b38c4d2667239182fcb59ba23a0d749/.github/workflows/benchmark.yml#L11
- https://github.com/cncf/cluster/issues/245. The machine is m3.small.x86. It has 8 cores 64 GiB. https://deploy.equinix.com/product/servers/m3-small/
Isolated VM from actuated. We can use runs-on: actuated-arm64-4cpu-4gb to run github action.
- The OTel collector uses this type of runner for testing on the main branch and every pull request (if they put certain label on the PR) https://github.com/open-telemetry/opentelemetry-collector/blob/227fb823dbf33e6a62b842aef5a70e10ed1db84f/.github/workflows/build-and-test-arm.yml#L27
- The rust project uses this for CI against every PR. https://github.com/open-telemetry/opentelemetry-rust/blob/a9b8621cd1004e7f083002071b9f9a57748c6b99/.github/workflows/ci.yml#L25
- Actuated uses a fresh, immutable, Firecracker VM for every build which boots up in less than 1 second and is destroyed after the build completes, which removes this risk.
https://actuated.dev/blog/arm-ci-cncf-ampere https://github.com/open-telemetry/community/blob/main/docs/using-actuated.md

Related issue:

I can try actuated's machines first to see if they can produce stable benchmark results. If it doesn't, I can try self-hosted as a backup.

yurishkuro commented 2 months ago

The OTel collector uses this type of runner for testing

testing, but not benchmarks?

XSAM commented 2 months ago

The OTel collector uses this type of runner for testing

testing, but not benchmarks?

Yes, not benchmark, only testing. https://github.com/open-telemetry/opentelemetry-collector/blob/227fb823dbf33e6a62b842aef5a70e10ed1db84f/.github/workflows/build-and-test-arm.yml#L27

IIUC, the collector CI is not running any benchmark.

XSAM commented 1 month ago

Update: The actuated runner cannot run for the forked repository https://github.com/XSAM/opentelemetry-rust/actions/runs/9968339009/job/27543447028 (it will wait for a runner to pick up this job forever)

XSAM commented 1 month ago

Tasks

[x] #5635 for ~~every PR and~~ commit on the main branch. (actuated runner cannot run for a forked repo, and no other otel projects actually used actuated runner for benchmarks. We have to test by ourselves)
- If the test result is stable, then we can proceed. #5644
- 5664
- 5671
- 5694
5695
5717
[ ] Remove alert-threshold: "400%" in the benchmark.yml
[ ] Pick 3-4 benchmarks are most important to us.
[ ] Publish the result on github pages.
[ ] Mention the benchmark result page on the README.

MrAlias commented 1 month ago

@XSAM the resource limits I was thinking about in the SIG meeting were actually just Go flags: https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/4301d25c8fbe97a8153eca91acc370b990011af2/.github/workflows/build-and-test-windows.yml#L53-L54

Not sure that will have an impact on the consistency of benchmarks. You can likely just disregard.

XSAM commented 3 weeks ago

benchmark-action does not recognize the Go package name; thus, it would consider benchmarks with the same name from different packages as identical benchmarks. This causes the result of benchmark comparison on the summary to be incorrect, as it would compare the wrong benchmark.

For example, the BenchmarkBool/AsBool from https://github.com/open-telemetry/opentelemetry-go/actions/runs/10424262745 action shows twice.

Benchmark suite	Current: 2db4ef2c032cd2390d18d01feeb37b71a7ca9469	Previous: f5872418a6297a9b0b2806b84d3d66707b1a0e37	Ratio
`BenchmarkBool/AsBool`	`0.9276` ns/op 0 B/op 0 allocs/op	`1.001` ns/op 0 B/op 0 allocs/op	`0.93`
`BenchmarkBool/AsBool`	`2.087` ns/op 0 B/op 0 allocs/op	`1.001` ns/op 0 B/op 0 allocs/op	`2.08`

It might be worth introducing a group field to the result of benchmark-action.

open-telemetry / opentelemetry-go