Closed aome510 closed 1 year ago
The slack discussion inlined:
Hmm, I think what happens here is that when you killall -9 postgres
, it kills the postgres process as you expected, but it also kills the WAL redo postgress process that's launched by the pageserver. The pageserver doesn't notice that it's dead, until it tries to send a message to it. That's when you get the Broken pipe error. It re-launches it after that, but that one unlucky request gets the error. We could probably detect the broken pipe when we try to send a message, and re-launch the process without passing the error to the client.
It's pretty minor: you only get that error if the WAL redo process dies unexpectedly, and it does heal itself for the next request. But still it would be nice to detect the dead WAL redo process earlier and avoid propagating the error to the client.
@koivunej / @knizhnik maybe you can fix this before / along with your changes around wal-redo perf work
Oki, this was one of the questions I had, wondering if it should support transparent relaunching. Thanks for ping.
With request pipelining this does become quite complicated. What we'd essentially want is to have a backup of all inflight requests to be retried immediatedly upon restart, and do the restart right away.
I guess such thing could happen with OOM killer but ... the ensuing thundering herd of re-launching multiple processes automatically and retrying could turn out to be quite dangerious.
@problame @koivunej @hlinnaka I have followed the action to reproduce:
killall resulted in:
./target/debug/neon_local pg list
NODE ADDRESS TIMELINE BRANCH NAME LSN STATUS
main 127.0.0.1:55432 78ee64b6b85a5ff56d11f94ed02909f6 main 0/16ED6F0 crashed
and than:
./target/debug/neon_local pg start main
Starting existing postgres main...
Safekeepers synced on 0/16ED6F0
Extracting base backup to create postgres instance: path=.neon/pgdatadirs/tenants/9e41b3a4fa757867b90396f50409b359/main port=55432
Starting postgres node at 'host=127.0.0.1 port=55432 user=cloud_admin dbname=postgres'
psql -p55432 -h 127.0.0.1 -U cloud_admin postgres
psql (15.2, server 14.6)
Type "help" for help.
postgres=# select * from t;
key | value
-----+-------
1 | 1
(1 row)
postgres=#
so this doesn't repro for me.
I would like to close it unless there is anything else needed
This hasn't been fixed. Here's a script to reproduce it:
(Note: it does this:
# Clean up after previous iterations
killall -9 postgres pageserver safekeeper storage_broker
rm -rf .neon
so don't run it if you have any valuable data in .neon
)
yes with this script it fails for me
I see three possible workarrounds:
std::process::Command
rather than tokio::process:Command
. I failed to find a way to asynchronously wait for it's status.3) seems to be the simplest one - I have implemented it in #3739 We may also want limit number of retry attempts to avoid infinite restart of wal reo process in case of some error or incorrect wal record (which may consume a lot of page server resources).
@knizhnik can you close this one ?
Fixed by PR #3739
If the WAL redo process dies unexpectedly, reading the data will return "Broken pipe" error because the
pageserver
doesn't notice the WAL process is dead.To reproduce this error, follows the below steps:
It's better to detect this when receiving the "broken pipe" error internally then re-launch the process without passing the error to the client.
Additional context: https://neondb.slack.com/archives/C036U0GRMRB/p1652466529681229