neondatabase / neon

Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, code-like database branching, and scale to zero.
https://neon.tech
Apache License 2.0
13.18k stars 367 forks source link

Safekeepers should gossip `remote_consistent_lsn` via broker #8148

Open petuhovskiy opened 4 days ago

petuhovskiy commented 4 days ago

I saw this happenning in tests:

  1. pageserver updates remote_consistent_lsn to match last_record_lsn
  2. connected safekeeper receives new LSN
  3. safekeeper updates broker_is_active to false
  4. it stops pushing updates to broker
  5. other safekeepers have no chance to learn new remote_consistent_lsn from broker

The fix is to delay timeline deactivation for some time (30s), so that safekeepers would have a chance to broadcast remote_consistent_lsn update to peers. It's not a solution for 100% of cases, but should work good enough.

jcsp commented 3 days ago

(notes chatting with Arthur)

Impact: interferes with writing clean tests. Currently if a safekeeper has stale remote_consistent_lsn for long enough, it will remain active & the pageserver will eventually connect to it. When the pageserver connects it will eventually learn remote_consistent_lsn.

More generally: should we reconsider using remote_consistent_lsn in the safekeeper in our condition for broker_is_active?