neondatabase / neon

Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, code-like database branching, and scale to zero.
https://neon.tech
Apache License 2.0
14.48k stars 420 forks source link

storage controller: make proxying of GETs to pageservers more robust #9065

Closed jcsp closed 4 days ago

jcsp commented 1 week ago

Problem

These commits are split off from https://github.com/neondatabase/neon/pull/8971/commits where I was fixing this to make a better scale test pass -- Vlad also independently recognized these issues with cloudbench in https://github.com/neondatabase/neon/issues/9062.

  1. The storage controller proxies GET requests to pageservers based on their intent, not the ground truth of where they're really attached.
  2. Proxied requests can race with scheduling to tenants, resulting in 404 responses if the request hits the wrong pageserver.

Closes: https://github.com/neondatabase/neon/issues/9062

Summary of changes

  1. If a shard has a running reconciler, then use the database generation_pageserver to decide who to proxy the request to
  2. If such a request gets a 404 response and its scheduled node has changed since the request was dispatched.

Checklist before requesting a review

Checklist before merging

github-actions[bot] commented 1 week ago

5125 tests run: 4961 passed, 0 failed, 164 skipped (full report)


Flaky tests (17) #### Postgres 17 - `test_hot_standby_feedback`: [debug-x86-64](https://neon-github-public-dev.s3.amazonaws.com/reports/pr-9065/10960582380/index.html#suites/950eff205b552e248417890b8b8f189e/c41a6cd28fde8b4d/retries) - `test_ondemand_wal_download_in_replication_slot_funcs`: [release-x86-64](https://neon-github-public-dev.s3.amazonaws.com/reports/pr-9065/10960582380/index.html#suites/180444c850d4a41d41eb0a410dc16d84/b046763e9e780b81/retries), [release-arm64](https://neon-github-public-dev.s3.amazonaws.com/reports/pr-9065/10960582380/index.html#suites/180444c850d4a41d41eb0a410dc16d84/81a4bf90202d9c78/retries) - `test_neon_cli_basics`: [release-arm64](https://neon-github-public-dev.s3.amazonaws.com/reports/pr-9065/10960582380/index.html#suites/48b4046d39093f7675bf477e070db277/cf870a40229246b4/retries) - `test_pg_regress[None]`: [debug-x86-64](https://neon-github-public-dev.s3.amazonaws.com/reports/pr-9065/10960582380/index.html#suites/158be07438eb5188d40b466b6acfaeb3/1f6e91af2f9a7e09/retries) - `test_readonly_node_gc`: [debug-x86-64](https://neon-github-public-dev.s3.amazonaws.com/reports/pr-9065/10960582380/index.html#suites/3ccffb1d100105b98aed3dc19b717917/ee87137d25a150c1/retries) - `test_scrubber_physical_gc[4]`: [release-arm64](https://neon-github-public-dev.s3.amazonaws.com/reports/pr-9065/10960582380/index.html#suites/616e84f65c91fe4bc748db7447d35268/7aee82e087a095e2/retries), [debug-x86-64](https://neon-github-public-dev.s3.amazonaws.com/reports/pr-9065/10960582380/index.html#suites/616e84f65c91fe4bc748db7447d35268/27d34f9f43db7193/retries) - `test_subscriber_restart`: [debug-x86-64](https://neon-github-public-dev.s3.amazonaws.com/reports/pr-9065/10960582380/index.html#suites/8be0c222d5601535470e7e5978bbfb03/9c2cdf06ac28451/retries) #### Postgres 16 - `test_scrubber_physical_gc[4]`: [release-arm64](https://neon-github-public-dev.s3.amazonaws.com/reports/pr-9065/10960582380/index.html#suites/616e84f65c91fe4bc748db7447d35268/185933d436cca63d/retries) #### Postgres 15 - `test_scrubber_physical_gc[4]`: [release-x86-64](https://neon-github-public-dev.s3.amazonaws.com/reports/pr-9065/10960582380/index.html#suites/616e84f65c91fe4bc748db7447d35268/edc41f69db954e68/retries), [release-arm64](https://neon-github-public-dev.s3.amazonaws.com/reports/pr-9065/10960582380/index.html#suites/616e84f65c91fe4bc748db7447d35268/19d35a569a99ed6e/retries) - `test_subscriber_restart`: [release-x86-64](https://neon-github-public-dev.s3.amazonaws.com/reports/pr-9065/10960582380/index.html#suites/8be0c222d5601535470e7e5978bbfb03/c5f8009268374448/retries) - `test_peer_recovery`: [release-arm64](https://neon-github-public-dev.s3.amazonaws.com/reports/pr-9065/10960582380/index.html#suites/82004ab4e3720b47bf78f312dabe7c55/20928add8190136b/retries) #### Postgres 14 - `test_ondemand_wal_download_in_replication_slot_funcs`: [release-x86-64](https://neon-github-public-dev.s3.amazonaws.com/reports/pr-9065/10960582380/index.html#suites/180444c850d4a41d41eb0a410dc16d84/c916c01e0e9b7974/retries) - `test_replica_start_scan_clog_crashed_xids`: [release-x86-64](https://neon-github-public-dev.s3.amazonaws.com/reports/pr-9065/10960582380/index.html#suites/a26094bd3e2f863ec62f3b6538f3f78b/f9c2e427932c94a6/retries) - `test_scrubber_physical_gc[4]`: [release-arm64](https://neon-github-public-dev.s3.amazonaws.com/reports/pr-9065/10960582380/index.html#suites/616e84f65c91fe4bc748db7447d35268/a0efe473fcca3bb1/retries)

Code coverage* (full report)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
2b0730d767f560e20b6748f57465922aa8bb805e at 2024-09-25T14:08:19.210Z :recycle:
jcsp commented 1 week ago

In theory we can add a python test for this behaviour, but not sure it's worth the failpoints. Up to you.

Yeah, seems worthwhile. Added test test_storage_controller_proxy_during_migration

VladLazar commented 1 week ago

In theory we can add a python test for this behaviour, but not sure it's worth the failpoints. Up to you.

Yeah, seams worthwhile. Added test test_storage_controller_proxy_during_migration

Nice test!