neondatabase / neon

Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, code-like database branching, and scale to zero.
https://neon.tech
Apache License 2.0
14.4k stars 418 forks source link

Occasional failure to start up after pg_switch_wal() #9079

Open hlinnaka opened 2 hours ago

hlinnaka commented 2 hours ago

is this a known issue?

I haven't seen exactly this sequence with xlog switch, but it is kind of expected behaviour.

I was able to reproduce this locally with:

    for i in range(1, 2000):
        branch_name = f"test_twophase{i}"
        env.neon_cli.create_branch(branch_name)
        endpoint = env.endpoints.create_start(branch_name)
        endpoint.safe_psql("SELECT pg_switch_wal()")
        endpoint.stop_and_destroy()
        endpoint = env.endpoints.create_start(branch_name)
        endpoint.safe_psql("SELECT pg_switch_wal()")
        endpoint.stop_and_destroy()

failed after about ~100 iterations

CI failure misses first ep logs (it is better to do endpoint.stop(mode="immediate") to preserve them instead of endpoint.stop_and_destroy()), but I think the following happens:

1. SELECT pg_switch_wal() is done at first compute;

2. first compute is SIGKILL'ed

3. second compute starts and observers that sk position is 0/14EE280, nothing to sync, and fetches basebackup at this pos

4. safekeeper gets from TCP stream WAL switch record and bumps flush position to 0/2000000

5. basebackup is now spoiled because it is taken at wrong LSN position.

So I'd suggest to add sleep/retry until endpoint starts well. Alternatively safekeeper exposes list of walreceivers and we can poll sk until walreceiver on it is gone.

Hmm, isn't this a potential problem in production too?

Originally posted by @hlinnaka in https://github.com/neondatabase/neon/issues/8914#issuecomment-2363793939

arssher commented 2 hours ago

It is just an instance of generic "we can't have two writing computes at the same time", one of them would panic. This particular case can be optimized/eliminated by forcing compute_ctl to bump term during sync-safekeepers check, but I don't see much value in it.

hlinnaka commented 2 hours ago

It is just an instance of generic "we can't have two writing computes at the same time", one of them would panic. This particular case can be optimized/eliminated by forcing compute_ctl to bump term during sync-safekeepers check, but I don't see much value in it.

Hmm, there are no two computes running at the same time. Or do you think there's a delay between sending SIGKILL to the old compute and the processes actually exiting, such that the old compute is still running when new one starts?

arssher commented 2 hours ago

It is not running, but there is a leftover TCP connection from it which delivers this xlog switch after new compute checked need for sync-safekeepers (and decided on basebackup LSN).

hlinnaka commented 2 hours ago

Hmm, so process has been killed, but the WAL is already in the safekeeper's TCP receive window, the safekeeper just hasn't processed it yet. Ok, makes sense. To test that hypothesis, a small delay in the test after killing postgres should make the problem disappear.