neondatabase / neon

Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, code-like database branching, and scale to zero.
https://neon.tech
Apache License 2.0
15.22k stars 444 forks source link

pageserver: revise metrics lifetime for SecondaryTenant #9818

Closed jcsp closed 1 day ago

jcsp commented 2 days ago

Problem

We saw a scale test failure when one shard went secondary->attached->secondary in a short period of time -- the metrics for the shard failed a validation assertion that is meant to ensure the size metric matches the sum of layer sizes in the SecondaryDetail struct.

This appears to be due to two SecondaryTenants being alive at the same time -- the first one was shut down but still had its contributions to the metrics.

Closes: https://github.com/neondatabase/neon/issues/9628

Summary of changes

github-actions[bot] commented 2 days ago

5535 tests run: 5309 passed, 0 failed, 226 skipped (full report)


Flaky tests (2) #### Postgres 17 - `test_ondemand_wal_download_in_replication_slot_funcs`: [release-x86-64](https://neon-github-public-dev.s3.amazonaws.com/reports/pr-9818/11948640280/index.html#suites/180444c850d4a41d41eb0a410dc16d84/168106639bff6c54/retries) - `test_cli_start_stop`: [release-arm64](https://neon-github-public-dev.s3.amazonaws.com/reports/pr-9818/11948640280/index.html#suites/7c2541b6822795aacf99a72eb660b5b7/3e6739636b23a89d/retries)

Code coverage* (full report)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
52d9d4a58355424a48c56cb9ba9670a073f618b9 at 2024-11-21T08:34:30.584Z :recycle: