neondatabase / neon

Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, code-like database branching, and scale to zero.
https://neon.tech
Apache License 2.0
14.78k stars 430 forks source link

pageserver: tenant hang with "Tenant is being modified concurrently" when creation is retried during restart #6423

Closed jcsp closed 9 months ago

jcsp commented 9 months ago

In quick succession:

So: some piece of code that holds a TenantSlotGuard is getting stuck.

This is likely related to one or both of:

Backref: https://neondb.slack.com/archives/C03F5SM1N02/p1705915089864759?thread_ts=1705847167.713309&cid=C03F5SM1N02

jcsp commented 9 months ago

I think this bug is being exposed now because the control plane used to call /attach in this case, and would have got an error (attach is not idempotent) because of the already-attached tenant. Now the location_conf API is correctly trying to shut down the original Tenant and create a new one, so we're hitting some bug in the shutdown path.

jcsp commented 9 months ago

Diagnosed the hang: Tenant::shutdown calls set_stopping with allow_transition_from_attaching=false, the tenant is left in attaching state by Tenant::spawn when it sees cancellation token while waiting for the concurrent_tenant_warmup.