neondatabase / neon

Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, code-like database branching, and scale to zero.
https://neon.tech
Apache License 2.0
14.28k stars 408 forks source link

storcon: re-attach can race with heartbeats and result in tenants not getting re-attached #8044

Closed VladLazar closed 2 months ago

VladLazar commented 2 months ago

This was observed in staging where the infra team stopped the pageservers for around 20 minutes. When the pageserver restarted, the reattach response was processed before the heartbeats marked the node active. The heartbeats detected the node coming back online (they store node state separately), but this inhibited the heartbeat handler from re-attaching the tenants (Service::node_activate_reconcile)

https://neondb.slack.com/archives/C060CNA47S9/p1718270673517979

VladLazar commented 2 months ago

I fixed this manually in staging by restarting the storage controller.