neondatabase / neon

Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, code-like database branching, and scale to zero.
https://neon.tech
Apache License 2.0
14.78k stars 429 forks source link

run pagebench performance tests on predictable, developer-reproducible infrastructure #6297

Closed problame closed 3 months ago

problame commented 9 months ago

https://github.com/neondatabase/neon/pull/6214 adds a pagebench benchmark.

In order to make benchmark results actionable, the results that we record during the nightly runs should be easily reproducible by Pageserver developers.

DoD

Implementation Ideas

  1. Run the nightly pagebench benchmarks on the same EC2 instance type as we use in production.
  2. Move the code that sets up the instance (e.g., partitions the instance store volume, kernel tunables, etc) into a script that's usable both by nightly benchmarks as well as a developer who provisioned the same EC2 instance type manually.
  3. Teach/document/remind pageserver developers on how to run the benchmark.

Related Issues

https://github.com/neondatabase/neon/pull/6214 adds the first pagebench benchmark.

6295 enables fast iteration (measure -> change -> build cycle) using overlayfs.

Not useful in the nightly benchmarks which will run the benchmark just once.

https://github.com/neondatabase/neon/pull/6350 adds a script to setup the ec2 instance store.

Bodobolero commented 4 months ago

Reactivated page bench in CI, see https://github.com/neondatabase/neon/pull/8023 however test is still flaky due to other reasons https://github.com/neondatabase/neon/issues/8070 see https://neondb.slack.com/archives/C060CNA47S9/p1718358720474039

This flakyness was now fixed with https://github.com/neondatabase/neon/pull/8079 and the page bench testcase is run for each main commit in neon.git repo

and reported back in this dashboard

https://neonprod.grafana.net/d/DGKBm9Jnz/perf-test3a-unit-perf-tests?orgId=1&var-test_name=test_runner%2Fperformance%2Fpageserver%2Fpagebench%2Ftest_pageserver_max_throughput_getpage_at_latest_lsn.py%3A%3Atest_pageserver_max_throughput_getpage_at_latest_lsn%5Brelease-pg14-github-actions-selfhosted-1-13-30%5D&var-test_name=test_runner%2Fperformance%2Fpageserver%2Fpagebench%2Ftest_pageserver_max_throughput_getpage_at_latest_lsn.py%3A%3Atest_pageserver_max_throughput_getpage_at_latest_lsn%5Brelease-pg14-github-actions-selfhosted-1-6-30%5D&var-test_name=test_runner%2Fperformance%2Fpageserver%2Fpagebench%2Ftest_pageserver_max_throughput_getpage_at_latest_lsn.py%3A%3Atest_pageserver_max_throughput_getpage_at_latest_lsn%5Brelease-pg14-github-actions-selfhosted-10-13-30%5D&var-test_name=test_runner%2Fperformance%2Fpageserver%2Fpagebench%2Ftest_pageserver_max_throughput_getpage_at_latest_lsn.py%3A%3Atest_pageserver_max_throughput_getpage_at_latest_lsn%5Brelease-pg14-github-actions-selfhosted-10-6-30%5D&var-platform=github-actions-selfhosted&from=1716715476558&to=1719307476558

Bodobolero commented 3 months ago

Dedicated EC instance runner is now in terraform see issue linked above. There is one piece missing https://github.com/neondatabase/cloud/issues/15053 Until then I am running the workflow from https://github.com/neondatabase/ec2_test_runner/blob/main/.github/workflows/periodic_pagebench.yml

The driver code running on the instance runner is here

https://github.com/neondatabase/ec2_test_runner/tree/main/pagebench

The GitHub workflow in neon repo and the test cases are in this PR https://github.com/neondatabase/neon/pull/8233

The grafana dashboard is here https://neonprod.grafana.net/d/ddqtbfykfqfi8d/afd0fdec-f44d-5f2c-a2c0-b738d4ce3d32?orgId=1

however you can also use the generic perf dashboard

https://neonprod.grafana.net/d/DGKBm9Jnz/perf-test3a-unit-perf-tests?orgId=1&var-test_name=All&var-platform=ec2-test-runner-1.eu-central-1.aws.neon.build&from=1717249056185&to=1719841056185

The GitHub workflow has a manual dispatch where you can enter the full commit hash of a Neon repo commit and then run the test with that specific commit to bi-sect a regression.

Also you can teleport to the ec2 instance runner with tsh ssh ec2-test-runner-1.eu-central-1.aws.neon.build during the run.

Bodobolero commented 3 months ago

https://neondb.slack.com/archives/C033RQ5SPDH/p1720158590249679

Bodobolero commented 3 months ago

The test was failing and needed some rework

https://github.com/neondatabase/neon/pull/8382#event-13552246867