neondatabase / neon

Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, code-like database branching, and scale to zero.
https://neon.tech
Apache License 2.0
13.18k stars 367 forks source link

fix: shutdown does not kill walredo processes #8150

Closed problame closed 1 day ago

problame commented 4 days ago

While investigating Pageserver logs from the cases where systemd hangs during shutdown (https://github.com/neondatabase/cloud/issues/11387), I noticed that even if Pageserver shuts down cleanly[^1], there are lingering walredo processes.

[^1]: Meaning, pageserver finishes its shutdown procedure and calls exit(0) on its own terms, instead of hitting the systemd unit's TimeoutSec= limit and getting SIGKILLed.

While systemd should never lock up like it does, maybe we can avoid hitting that bug by cleaning up properly.

Changes

This PR adds a shutdown method to WalRedoManager and hooks it up to tenant shutdown.

We keep track of intent to shutdown through the new enum ProcessOnceCell stored inside the pre-existing redo_process field. A gate is added to keep track of running processes, using the new type struct Process.

Future Work

Requests that don't need the redo process will not observe the shutdown (see doc comment). Doing so would be nice for completeness sake, but doesn't provide much benefit because Tenant and Timeline already shut down all walredo users.

Testing

I did manual testing to confirm that the problem exists before this PR and that it's gone after. Setup:

With this PR, we always observe

$ strace -e kill,wait4 -f -p "$(pgrep pageserver)"
...
[pid 591120] --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=591215, si_uid=1000} ---
[pid 591134] kill(591174, SIGKILL)      = 0
[pid 591134] wait4(591174,  <unfinished ...>
[pid 591142] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_KILLED, si_pid=591174, si_uid=1000, si_status=SIGKILL, si_utime=0, si_stime=0} ---
[pid 591134] <... wait4 resumed>[{WIFSIGNALED(s) && WTERMSIG(s) == SIGKILL}], 0, NULL) = 591174
...
+++ exited with 0 +++

Before this PR, we'd usually observe just

...
[pid 596239] --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=596455, si_uid=1000} ---
...
+++ exited with 0 +++

Refs

refs https://github.com/neondatabase/cloud/issues/11387

github-actions[bot] commented 4 days ago

2922 tests run: 2805 passed, 0 failed, 117 skipped (full report)


Flaky tests (1) #### Postgres 15 - `test_subscriber_restart`: [release](https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8150/9660168473/index.html#suites/8be0c222d5601535470e7e5978bbfb03/f0a73108d9170bfb/retries)

Code coverage* (full report)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
e402c46de0a007db6b48dddbde450ddbb92e6ceb at 2024-06-25T10:33:54.354Z :recycle: