neondatabase / neon

Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, code-like database branching, and scale to zero.
https://neon.tech
Apache License 2.0
14.48k stars 420 forks source link

allow storage_controller error during pagebench #8109

Closed Bodobolero closed 3 months ago

Bodobolero commented 3 months ago

Problem

test_pageserver_max_throughput_getpage_at_latest_lsn is a pagebench testcase which creates several tenants/timelines to verify pageserver performance.

The test swaps environments around in the tenant duplication stage, so the storage controller uses two separate db instances (one in the duplication stage and another one in the benchmarking stage). In the benchmarking stage, the storage controller starts without any knowledge of nodes, but with knowledge of tenants (via attachments.json). When we re-attach and attempt to update the scheduler stats, the scheduler rightfully complains about the node not being known. The setup should preserve the storage controller across the two envs, but i think it's fine to just allow list the error in this case.

Summary of changes

add the error message

`2024-06-19T09:38:27.866085Z ERROR Scheduler missing node 1``

to the list of allowed errors for storage_controller

github-actions[bot] commented 3 months ago

3228 tests run: 3111 passed, 0 failed, 117 skipped (full report)


Code coverage* (full report)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
4e30745f424539d3816b821c09fe7733c446c226 at 2024-06-19T13:47:44.905Z :recycle:
Bodobolero commented 3 months ago

This allows the error in all tests, right? I think that's too permissive. Let's only allow it in the context of the max tput test like so:

diff --git a/test_runner/performance/pageserver/pagebench/test_pageserver_max_throughput_getpage_at_latest_lsn.py b/test_runner/performance/pageserver/pagebench/test_pageserver_max_throughput_getpage_at_latest_lsn.py
index 1a0012397..b0c65af69 100644
--- a/test_runner/performance/pageserver/pagebench/test_pageserver_max_throughput_getpage_at_latest_lsn.py
+++ b/test_runner/performance/pageserver/pagebench/test_pageserver_max_throughput_getpage_at_latest_lsn.py
@@ -206,3 +206,11 @@ def run_benchmark_max_throughput_latest_lsn(
             unit="ms",
             report=MetricReport.LOWER_IS_BETTER,
         )
+
+    env.storage_controller.allowed_errors.append(
+        # The test setup swaps NeonEnv instances, hence different
+        # pg instances are used for the storage controller db. This means
+        # the storage controller doesn't know about the nodes mentioned
+        # in attachments.json at start-up.
+        ".* Scheduler missing node 1",
+    )

good idea to do it per testcase