Open hlinnaka opened 2 hours ago
It is just an instance of generic "we can't have two writing computes at the same time", one of them would panic. This particular case can be optimized/eliminated by forcing compute_ctl to bump term during sync-safekeepers check, but I don't see much value in it.
It is just an instance of generic "we can't have two writing computes at the same time", one of them would panic. This particular case can be optimized/eliminated by forcing compute_ctl to bump term during sync-safekeepers check, but I don't see much value in it.
Hmm, there are no two computes running at the same time. Or do you think there's a delay between sending SIGKILL to the old compute and the processes actually exiting, such that the old compute is still running when new one starts?
It is not running, but there is a leftover TCP connection from it which delivers this xlog switch after new compute checked need for sync-safekeepers (and decided on basebackup LSN).
Hmm, so process has been killed, but the WAL is already in the safekeeper's TCP receive window, the safekeeper just hasn't processed it yet. Ok, makes sense. To test that hypothesis, a small delay in the test after killing postgres should make the problem disappear.
I was able to reproduce this locally with:
failed after about ~100 iterations
Hmm, isn't this a potential problem in production too?
Originally posted by @hlinnaka in https://github.com/neondatabase/neon/issues/8914#issuecomment-2363793939