While investigating Pageserver logs from the cases where systemd hangs during shutdown (https://github.com/neondatabase/cloud/issues/11387), I noticed that even if Pageserver shuts down cleanly[^1], there are lingering walredo processes.
[^1]: Meaning, pageserver finishes its shutdown procedure and calls exit(0) on its own terms, instead of hitting the systemd unit's TimeoutSec= limit and getting SIGKILLed.
While systemd should never lock up like it does, maybe we can avoid hitting that bug by cleaning up properly.
Changes
This PR adds a shutdown method to WalRedoManager and hooks it up to tenant shutdown.
We keep track of intent to shutdown through the new enum ProcessOnceCell stored inside the pre-existing redo_process field.
A gate is added to keep track of running processes, using the new type struct Process.
Future Work
Requests that don't need the redo process will not observe the shutdown (see doc comment).
Doing so would be nice for completeness sake, but doesn't provide much benefit because Tenant and Timeline already shut down all walredo users.
Testing
I did manual testing to confirm that the problem exists before this PR and that it's gone after.
Setup:
neon_local with a single tenant, create some data using pgbench
While investigating Pageserver logs from the cases where systemd hangs during shutdown (https://github.com/neondatabase/cloud/issues/11387), I noticed that even if Pageserver shuts down cleanly[^1], there are lingering walredo processes.
[^1]: Meaning, pageserver finishes its shutdown procedure and calls
exit(0)
on its own terms, instead of hitting the systemd unit'sTimeoutSec=
limit and getting SIGKILLed.While systemd should never lock up like it does, maybe we can avoid hitting that bug by cleaning up properly.
Changes
This PR adds a shutdown method to
WalRedoManager
and hooks it up to tenant shutdown.We keep track of intent to shutdown through the new
enum ProcessOnceCell
stored inside the pre-existingredo_process
field. A gate is added to keep track of running processes, using the new typestruct Process
.Future Work
Requests that don't need the redo process will not observe the shutdown (see doc comment). Doing so would be nice for completeness sake, but doesn't provide much benefit because
Tenant
andTimeline
already shut down all walredo users.Testing
I did manual testing to confirm that the problem exists before this PR and that it's gone after. Setup:
neon_local
with a single tenant, create some data usingpgbench
strace -e kill,wait4 -f -p "$(pgrep pageserver)"
neon_local pageserver stop
With this PR, we always observe
Before this PR, we'd usually observe just
Refs
refs https://github.com/neondatabase/cloud/issues/11387