problame commented 4 months ago

Stuck /location_config operation while transitioning from attached to secondary (i.e. Tenant::shutdown), clearly a bug in pageserver. We mitigated by restarting pageserver, but, we should debug this.

https://neondb.slack.com/archives/C03H1K0PGKH/p1713453288707669?thread_ts=1713444234.032949&cid=C03H1K0PGKH

This occurred during a /location_config call transitioning a pageserver from attached to secondary.

Root Cause Analysis

https://github.com/neondatabase/neon/issues/7427#issuecomment-2220523013

tl;dr: no direct smoking gun, but, symptoms so far point to page_service keeping gate open for too long.

Fixing It

Refactor page_service so that gate guard lifetime is easy to understand and obviously bounded to the requests that are ongoing at time of timeline shutdown.

Sketch: https://github.com/neondatabase/neon/pull/8286

### Tasks
- [ ] https://github.com/neondatabase/neon/pull/8224
- [ ] https://github.com/neondatabase/neon/pull/8227
- [ ] fix dashboards that used pageserver_live_connections
- [ ] https://github.com/neondatabase/neon/pull/8296
- [ ] https://github.com/neondatabase/neon/pull/8295
- [ ] https://github.com/neondatabase/neon/pull/8292
- [ ] https://github.com/neondatabase/neon/pull/8338
- [ ] https://github.com/neondatabase/neon/pull/8244
- [ ] https://github.com/neondatabase/neon/pull/8372
- [ ] https://github.com/neondatabase/neon/pull/8449
- [ ] https://github.com/neondatabase/neon/pull/8339
- [x] rollout to staging
- [x] monitor cancellation behvior / number of live connections in staging
- [x] rollout to prod
- [ ] https://github.com/neondatabase/neon/pull/8650
- [ ] https://github.com/neondatabase/neon/pull/8632
- [ ] larger scale testing of migrations to ensure it's truly fixed
- [ ] https://github.com/neondatabase/helm-charts/pull/96
- [ ] https://github.com/neondatabase/infra/pull/1746
- [x] deploy chaosInterval to staging https://neondb.slack.com/archives/C059ZC138NR/p1724079789211089
- [x] announce larger-scale testing to team https://neondb.slack.com/archives/C033A2WE6BZ/p1724080458246829
- [ ] give it a few days, analzye logs https://neondb.slack.com/archives/C033RQ5SPDH/p1724142868137099

### Follow-Ups
- [ ] https://github.com/neondatabase/neon/pull/8462

problame commented 4 months ago

Happened again during that migration.

jcsp commented 4 months ago

Suspected regression from changes around shutdown recently: https://github.com/neondatabase/neon/pull/7233

jcsp commented 4 months ago

Next steps: look carefully at timeline shutdown code.

jcsp commented 2 months ago

We have plenty of examples of kept the gate from closing messages in the field.

We see them in handle_get_page_at_lsn_request, and initial_size_calculation{...}:logical_size_calculation_task:get_or_maybe_download

Those are signals that something in those code paths is not promptly respecting a cancellation token. However, they may not be responsible for actual hangs, as that's a message printed when a gate guard eventually releases the gate, not if it holds it forever.

jcsp commented 2 months ago

Let's invest some time next week in looking for issues that might have been introduced in https://github.com/neondatabase/neon/pull/7233 -- I think a couple of hours of quality time staring at the code might yield results.

This week I spent some time writing a test (https://github.com/neondatabase/neon/tree/jcsp/shutdown-under-load-test) to try and reproduce shutdown hangs, without any success: this isn't completely surprising, because in the field we see one hang out of every ~10,000 migrations, so it's clearly something pretty niche.

problame commented 2 months ago

After lengthy investigation, I have no definitive smoking gun, but the hottest lead based on log message correlation is that HandlerTimeline holds the Timeline::gate open.

More details here (internal doc): https://www.notion.so/neondatabase/2024-07-01-stuck-detach-investigation-f0d4ae0247b347ab9bff95355fce1b25?pvs=4#c93179d978bc41d1a47f4b3821b12b81

problame commented 2 months ago

Also, the check for the first (as in ShardSelector::First) tenant in handle_pagerequests is insufficient if a pageserver hosts multiple tenant shards. For example, if we're hosting shard 1 and shard 2, and ::First resolved to shard 1, but we want to detach shard 2, then we're sensitive to the wrong cancellation token.

problame commented 1 month ago

This week: finish preliminary refactorings & get ready for review the big PR so we can soak it in staging next week

https://github.com/neondatabase/neon/pull/8339

problame commented 1 month ago

This week:

PR got reviewed and merged on Wednesday
=> observe staging, watch out for growing connection count metrics / slow detach / deletions

Next week:

PR will roll into prod.
Should be safe to revert.

Week after:

Once we know we won't need to revert, ship https://github.com/neondatabase/neon/pull/8462

problame commented 3 weeks ago

Status update:

last week, I was on vacation; code rolled out into prod
- required one hotfix: https://github.com/neondatabase/neon/pull/8632

This week:

use chaos injection feature in staging us-east-2 to run a lot of migrations
- question: how to discover that storage controller migration is stuck? storcon doesn't have cplane's "stuck operation" alert yet
- https://github.com/neondatabase/helm-charts/pull/96
- https://github.com/neondatabase/infra/pull/1746

problame commented 2 weeks ago

Status update:

chaos injection has been running in staging
analysis / discussion thread: https://neondb.slack.com/archives/C033RQ5SPDH/p1724142868137099

problame commented 2 weeks ago

Calling victory on this, there were some minor issues exposed by the chaosInterval enablement, but none related to slow detaches.

neondatabase / neon

pageserver: stuck detach operation #7427

Root Cause Analysis

Fixing It