META: Benchmarks - Githubissues

ajnavarro commented 1 week ago

Description

Related issues:

https://github.com/gnolang/gno/issues/2915

Now that we have a Benchmark MVP working, I'd like to enumerate the requests from several parties regarding the next steps:

Tasks

[ ] Execute Benchmarks on Pull Requests We had to roll back this feature due to concerns from developers about the time it takes for benchmarks to complete. Here are some proposals to mitigate this issue:
- Run a Subset of Benchmarks: Use the -short flag to execute a smaller set of benchmarks. While this won't completely prevent regressions in the master branch, it will reduce their frequency.
- Run Benchmarks Only on Modified Packages: This approach requires further thought, as it might not be the most effective solution.
- Limit Execution Time: Ensure that benchmark execution takes less than 10 minutes, on PRs at least.
[ ] Disable Non-Essential Benchmarks via Flags Some benchmarks, like those testing goleveldb (which are testing the database itself), don't need to run on every PR. Instead, we could focus on specific benchmarks that test particular cases. Here some examples:
- Testing speed when running examples package.
- Launching the chain
- Database loading speed
- Benchdata performance

Benchmark Tools

Currently, two GitHub Actions can run Go benchmarks:

github-action-benchmark (currently in use): https://github.com/benchmark-action/github-action-benchmark
gobenchdata (previously used and discarded): https://github.com/bobheadxi/gobenchdata

If none of these tools meets all our requirements, we should consider developing a new action fitting our use cases.

This is a call for comments (cc @moul and @thehowl )

sw360cab commented 1 week ago

The options Execute Benchmarks on Pull Requests can be reconsidered disabling the option: failOnAlert

sw360cab commented 1 week ago

Consider #2915

thehowl commented 1 week ago

Run Benchmarks Only on Modified Packages: This approach requires further thought, as it might not be the most effective solution.

Yes, I'd rather we start off from having a small set of "core" benchmarks which are run on PRs.

Comparisons with master using benchstat would be nice.

thehowl commented 1 week ago

Run a Subset of Benchmarks: Use the -short flag to execute a smaller set of benchmarks. While this won't completely prevent regressions in the master branch, it will reduce their frequency.

IMO this could also be just a list in the workflow YAML file; using -short means that benchmarks are by default included, which is not ideal because I'll definitely forget it when I review a PR adding a benchmark. Instead, having a list of benchmarks ensures they have to be explicitly added.

The non-essential benchmarks we can disable with build tags or flags maybe?

Rest LGTM. :)

ajnavarro commented 1 week ago

IMO this could also be just a list in the workflow YAML file

That might be an extra step that is difficult to know from all devs. That is how it was implemented before, you needed to add your benchmark in a yaml file. To reduce the hassle to the minimum, I had to:

Add a new check on the PR template (that everyone ignored): https://github.com/gnolang/gno/commit/60bab86b59ab65fde6a7e5e256c136a3c17c1150#diff-b2496e80299b8c3150b1944450bd81c622e04e13d15c411d291db0927d75fd6bR15
That check was pointing to some instructions about how to add new benchmarks (that no one read :P): https://github.com/gnolang/gno/commit/60bab86b59ab65fde6a7e5e256c136a3c17c1150#diff-56e366db169a5bb6f2a8ade8ad89efebcde7825b907b9e5609c6e27f2629eae1R1-R34
And finally the yaml itself (needed to be in another branch): https://github.com/gnolang/gno/commit/82e58a5883104c1a01d4ba0a6e0048c078113d3f#diff-632f689f80d979b7c273e32fd187b4e38f0b5cec60b0ff3af393eefb6b3d3fca

To avoid repeating the same mistakes, I propose using a specific tag similar to -short but used to indicate whether that benchmark will run regularly or not. It can be --ci

@thehowl WDYT?

moul commented 1 week ago

I believe we should establish two levels of benchmarks:

The first level is enforced quietly and applies to all developers, including those writing contracts or fixing typos in a README. There should be no noise unless a PR triggers a global benchmark issue. In that case, the PR will become a topic of discussion regarding benchmarks. We could implement high-level benchmarks, such as running a blockchain node and processing N transactions. While this may not identify the exact cause of any slowdown, knowing that overall performance has decreased is sufficient for the core team to pause a PR until we investigate further. Developers should not have to check a box or worry about benchmarks; they should be able to continue their work without concern until a CI check or the core team blocks the merge for performance reasons.
The second level is for PRs that introduce features we want to benchmark from the outset, primarily in tm2 and gnovm, or for those focused on adding benchmarks rather than features. These PRs should prioritize benchmark configuration management, including defaults for CI and local setups.

However, upon reviewing the PR history, it’s unclear if we can conclude that everyone ignored benchmarks; most PRs were simply unrelated to them (case 1). Ignoring benchmarks is definitely more problematic in case 2.

A quick win that I suggest includes:

Removing the checkbox from the default PR template.
Adding a dynamic checkbox using the GitHub Bot we discussed in Torino, which could require core team review for benchmark updates if specific folders are modified (tm2, gnovm, amino).
Implementing high-level integration-like benchmarks to prevent PRs from reducing TPS. If this high-level benchmark fails, the PR should be locked until we investigate. This investigation could lead to adding more specific benchmarks, fixing performance issues, or deciding to ignore the perf decrease if it’s expected.
Running full benchmarks against the master branch asynchronously (#2915).
Running a short "global" check for PRs synchronously; there’s no need to check by folders. We just need a comprehensive integration test that runs a node, loads contracts, and processes a reasonable number of transactions. This links to another topic we discussed with Morgan about having a top-level integration test that we run regularly on each PR, on master, and in production. This test should fail (often?) without clear reasons, allowing engineers to investigate further. It's a catch-all. Catch-alls are great when we know how to investigate because they are cheap and efficient.
Utilizing asynchronous infrastructure to run longer benchmarks and fuzzing tests primarily against the master branch and, eventually, against some PRs, particularly those touching specific folders, or possibly all if it’s efficiently asynchronous.

ajnavarro commented 1 week ago

Let's summarize all the requests:

Requests Summary

[x] Remove the checkbox from the PR template
Done: https://github.com/gnolang/gno/pull/2927
[ ] Add a dynamic checkbox using a GitHub Bot
Require core team review for benchmark updates if specific folders are modified (e.g., tm2, gnovm, amino). (we will need more information about that feature. Maybe @aeddi is on it?).
[ ] High-level integration benchmarks
Implement benchmarks that test integration-like scenarios. Tools like supernova could be used for this.
[x] Run full benchmarks against the master branch asynchronously
Keep running full benchmarks on the master branch to monitor ongoing performance.
[ ] Run a "global" check for each PR
Integration tests that involve running a node, loading contracts, and processing transactions. (We lack a ton of definition here. I really think this is not going to give us useful information and it will be just another hassle full of cognitive overload to go over to be able to merge code.)
[ ] Run async long benchmarks on some PRs eventually Focus on PRs touching specific paths or folders. Define how these async benchmarks will be triggered. Maybe a tag?
[ ] Benchmark Tools Evaluation Continue with github-action-benchmark. If none fit all requirements, consider developing a custom benchmark tool.

Benchmark Execution on Pull Requests

[ ] Run a Subset of Benchmarks
Execute a smaller set of core benchmarks (define explicitly in workflow YAML, rather than using -short flag). I would rather prefer to specify a new flag, like -ci. See my previous comment.
[ ] Limit Execution Time for Benchmarks on PRs
Ensure benchmark execution time is less than 10 minutes.
[ ] Comparison with Master Branch
Use benchstat for comparing current changes with master benchmarks.

Ping me if I missed something or some point needs more clarification/the output I got is wrong.

aeddi commented 3 days ago

Working on it right now. Let me get a PoC ready and then we’ll figure out a specific rule to meet this need. :)

gnolang / gno

META: Benchmarks #2919

Description

Tasks

Benchmark Tools

Requests Summary

Benchmark Execution on Pull Requests