CI Benchmark discussion

MadVikingGod commented 2 years ago

The Benchmark job is not giving us consistent results and is generally failing.

We should discuss if this is adding value enough to improve it, or if it should be removed.

Some datapoints:

Of the last 20 runs 2 passed
The benchmark names are not sufficient, they don't provide what package they are from.. eg: BenchmarkInt64Slice/KeyValue, BenchmarkString/KeyValue, and BenchmarkIterator_4
The failing test isn't consistent, of the last 5 runs it failed on:
- BenchmarkHistogramSearchFloat64_1024
- BenchmarkString/KeyValue
- BenchmarkIterator_4
- BenchmarkGaugeObserverObservationFloat64
- BenchmarkInt64Slice/KeyValue
- BenchmarkBoolSlice/Emit
- BenchmarkInt/KeyValue
- BenchmarkHistogramSearchInt64_1

TylerHelmuth commented 2 years ago

I think benchmark tests are super important and are a step in the right direction towards being able to answer customer questions on expected performance and resource requirements. Performance of open telemetry was always a big topic when I was a customer of open telemetry, and it feels like it is still an open question; we ended up stress testing components ourselves.

But if the tests in their current state aren't meeting that goal and are actively prohibiting speedy development on the project then maybe it would be best to temporarily disable (or let them run but don't throw up a red x) while we re-evaluate what to do.

A similar issue was recently opened in the Collector Contrib. There was a suggestion to publish benchmark reports for every build that could then be reviewed instead of failing builds. The python repo has something similar that could be used as an example. Running on Github hardware , though, the python reports also suffer from noisy neighbors like ours do, so I don't think it is a perfect solution, but it is a step in the right direction.

@tigrannajaryan mentioned here that CNCF has provided private compute space to run the tests on, which, if its still true, would solve our noisy neighbor problem. Assuming we have our own space with which to run the tests, I think it would be great to get the benchmarks running in that space and reporting visual, reviewable results every run.

Long term and outside the scope of Go, it would be awesome to have a OpenTelemetry Project-wide performance report that could be shared. If we can standardize as a Project on a benchmark reporting solution, maybe we could have a single link with all the different OpenTelemetry components that are reporting performance results and resource requirements. With more components in OpenTelemetry becoming stable/putting out major releases, I think we need to have solid answers for customers when they ask the performance of traces/metrics/logs instrumentation or the collector.

I would love to work on this problem, whatever direction we decide to go.

dmathieu commented 2 years ago

I agree with @TylerHelmuth entirely. As a short term solution, we could stop failing them on pushes to main. PRs can be retried (though for some reason, I'm never seeing any failures in PRs).

TylerHelmuth commented 2 years ago

For perspective, I went looking at what other SIGs are doing with benchmarks and this is what I found.

Java: Has benchmarks but not part of any CI.
Dotnet: Has benchmarks but not part of any CI.
Collector: Has benchmarks but not part of any CI.

I know Java and Dotnet ask for runs on a PR by PR basis. Dotnet publishes the results as comments in the code.

It wouldn't be unprecedented if we remove the benchmarks from our CI and instead ask for them to be run on a per-PR basis. If we do that I think we should add to the contributing guide the reasons when they would need to be run.

MrAlias commented 2 years ago

It wouldn't be unprecedented if we remove the benchmarks from our CI and instead ask for them to be run on a per-PR basis. If we do that I think we should add to the contributing guide the reasons when they would need to be run.

I'm reluctant to add overhead to the development process that in theory could be automated and performed by a computer. Especially since the benchmarks will likely not be comparable due to the unique hardware/OS the benchmarks will be run on for each developer.

If we could get the tests to run on the dedicated EC2 instance mentioned that would be amazing and the solution I would strive for. I wonder if there is a cron job we could setup on the instance and periodically run/publish benchmarks. Maybe from that a trigger could be devised so the job runs when a merge happens.

TylerHelmuth commented 2 years ago

If we could get the tests to run on the dedicated EC2 instance mentioned that would be amazing and the solution I would strive for

Agreed, I think this is the best end state. In the meantime stopping the benchmarks from blocking CI/CD does feel ok.

MrAlias commented 2 years ago

@TylerHelmuth can you get a PR together to stop the benchmark workflow from failing the CI report?

At least for the meantime, while the long-term solution is explored.

TylerHelmuth commented 2 years ago

Yes I can get that done tomorrow

TylerHelmuth commented 2 years ago

During the Collector SIG today it was confirmed that the project does have hardware that could be used to run benchmarks. Based on @Aneurysm9's recommendation, I am going to investigate if it is feasible to use the boxes as private Github runners.

TylerHelmuth commented 2 years ago

So the good news is that self-hosted runners are free, and we could use our boxes to run actions and it sounds like the integration and process for hooking into a self-hosted runner is pretty good.

The horrible news is that GitHub does not recommend using self-hosted runners for public repositories, because forks of the repo will also have their actions run on the self-hosted runner, which is a security risk. There may be ways to protect ourselves, but we're gonna have to get creative if we want to use self-hosted runners.

TylerHelmuth commented 2 years ago

I think to use the boxes as self-hosted runners the workflows would need to live in a private repository. It seems possible to trigger workflows on one repository from a different repository, so it seems feasible that a workflow in opentelemetry-go could trigger benchmarks (and only benchmarks) in the private repository which would then run the benchmark tests. After the tests run I bet we could use the github API to post the results back to the main repository.

Seems like a lot of overhead to run benchmarks, but at the same time I want the benchmarks run lol. We should also look into options other than self-hosted runners and see if there are any good alternatives.

tigrannajaryan commented 2 years ago

@jpkrohling can you tell more about the bare-metal boxes you said may be available? How many there, who owns them, etc.

dmathieu commented 2 years ago

Running benchmarks on a private repository seems like a bad approach. Contributors wouldn't be able to know what we run, or when. Nor would they be able to add new benchmarks when a code change requires it.

An alternative solution could be to enable approval of PR workflows for either first-time contributors, or all of them. That would reduce the risk of having malicious actors running something inappropriate in the custom runner.

TylerHelmuth commented 2 years ago

I think we could design the private repository in such a way that the Benchmark tests themselves are still in this repo and only the workflow definition would be in the private repo. The private repo workflow could clone this repo to get access to everything it needs, and all the results of the runs could be posted back to this repo somewhere.

But I agree, it's less than optimal. It would be much harder to maintain has only certain people would be able to troubleshoot and make changes to the job. Also would be a pretty complicated setup without a lot of transparency.

There is a section in this blog post that has an interesting read on self-hosted runners. The approval solution could work, but I think we'd have to do it for all runs; a second -time contributor could be malicious as well. I think the approval solution has a lot of human error opportunity though, like if a run-approver misses something malicious. Spread that across all the SIGs and it gets more likely that a mistake will be made.

TylerHelmuth commented 2 years ago

As discussed during the Go SIG today, before continuing to solve how to use the boxes, we'd like to validate that the boxes will solve what we expect is a noisy neighbor problem. @jpkrohling and @tigrannajaryan, is it possible for me and/or @dmathieu to get temporary access to the boxes so that we can run the benchmark tests for Go (and load tests for collector-contrib) and verify that the benchmarks report consistently?

As part of this benchmark work I am also going to investigate what other large open source projects are doing for benchmarks, starting with Prometheus, which recently detected a memory increase in a new version using their benchmarks.

jpkrohling commented 2 years ago

I have used a set of bare metal machines from Equinix that could be provisioned on-demand, sponsored by the CNCF. While I can try to get you all temporary access to the project I have there, the appropriate way to request access is by creating an issue here: https://github.com/cncf/cluster

Here's an issue you can use as reference: https://github.com/cncf/cluster/issues/182

tigrannajaryan commented 2 years ago

It is probably useful to have some preliminary thoughts on these before starting the actual implementation:

Who is going to be responsible for the ongoing maintenance? Is the answer "all core and contrib maintainers/approvers" or we will have more restrictions in place?
What kind of access there will be for maintainers/approvers/other contributors? Who can SSH into the machines and who cannot?

TylerHelmuth commented 2 years ago

@jpkrohling is there a way to use this request to get access to the boxes that OpenTelemetry already has? I don't wan't to accidentally request more machines.

jpkrohling commented 2 years ago

The machines are provisioned manually, on demand. At this moment, we have the ability to have bare metal machines, but we have none in active use.

TylerHelmuth commented 2 years ago

Who is going to be responsible for the ongoing maintenance? Is the answer "all core and contrib maintainers/approvers" or we will have more restrictions in place?

It would be great to be able to identify a handful of maintainers of the machines, who are responsible for ensuring the boxes are available and capable of running tests. I don't think the maintainers can be responsible for ensuring any particular SIG's tool for benchmark tests are on the machine. Flushing out what "maintain" means in this context will also be important.

What kind of access there will be for maintainers/approvers/other contributors? Who can SSH into the machines and who cannot?

I think the answer to this will depend on whether or not we can find a way to use the machines as part of a workflow. If we can trigger the tests remotely then I think we should limit the number of people who can ssh in. In this scenario, I think only a few people from each SIG (maybe maintainers, maybe some combo of maintainers and approvers) should be able to ssh in to ensure the boxes have the tools necessary for the tests to trigger remotely. If we can't run the tests remotely then maybe more people will need access. I think limiting it only to approvers/maintainers is the safest though.

TylerHelmuth commented 2 years ago

Followup on how others do benchmarking:

Prometheus uses prombench, which is a tool created for the prometheus project. The setup seems pretty complicated, but in short the tests are run against a dedicated test cluster in GKE. I think the tests can only be triggered manually by maintainers..
Other examples I found are also spinning up clusters on demand and running tests there.
I also found examples of repos that have benchmarking tools, but don't have them hooked up to CI. I think etcd is this way.

I found no examples of using github self-hosted runners (other than the blog I already posted). Other's investigations on this topic would be welcome.

open-telemetry / opentelemetry-go

CI Benchmark discussion #2791