Broken pipe reading error if the WAL redo process dies unexpectedly

neondatabase / neon

Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, code-like database branching, and scale to zero.

https://neon.tech

Apache License 2.0

14.38k stars 417 forks source link

Broken pipe reading error if the WAL redo process dies unexpectedly #1700

Closed aome510 closed 1 year ago

aome510 commented 2 years ago

If the WAL redo process dies unexpectedly, reading the data will return "Broken pipe" error because the pageserver doesn't notice the WAL process is dead.

To reproduce this error, follows the below steps:

> psql -p55432 -h 127.0.0.1 -U zenith_admin postgres

postgres=# select * from t;
 key | value
-----+-------
(0 rows)

postgres=# insert into t values(1,1);
INSERT 0 1

# kill all PG clients and the associated WAL redo processes
> killall -9 postgres

> ./target/debug/neon_local pg start main

> psql -p55432 -h 127.0.0.1 -U zenith_admin postgres

postgres=# select * from t;
ERROR:  could not read block 0 in rel 1663/13236/16384.0 from page server at lsn 0/0171CA40
DETAIL:  page server returned error: Broken pipe (os error 32)
postgres=# select * from t;
 key | value
-----+-------
   1 | 1
(1 row)

It's better to detect this when receiving the "broken pipe" error internally then re-launch the process without passing the error to the client.

Additional context: https://neondb.slack.com/archives/C036U0GRMRB/p1652466529681229

hlinnaka commented 2 years ago

The slack discussion inlined:

Hmm, I think what happens here is that when you killall -9 postgres, it kills the postgres process as you expected, but it also kills the WAL redo postgress process that's launched by the pageserver. The pageserver doesn't notice that it's dead, until it tries to send a message to it. That's when you get the Broken pipe error. It re-launches it after that, but that one unlucky request gets the error. We could probably detect the broken pipe when we try to send a message, and re-launch the process without passing the error to the client.

It's pretty minor: you only get that error if the WAL redo process dies unexpectedly, and it does heal itself for the next request. But still it would be nice to detect the dead WAL redo process earlier and avoid propagating the error to the client.

problame commented 1 year ago

@koivunej / @knizhnik maybe you can fix this before / along with your changes around wal-redo perf work

koivunej commented 1 year ago

Oki, this was one of the questions I had, wondering if it should support transparent relaunching. Thanks for ping.

koivunej commented 1 year ago

With request pipelining this does become quite complicated. What we'd essentially want is to have a backup of all inflight requests to be retried immediatedly upon restart, and do the restart right away.

I guess such thing could happen with OOM killer but ... the ensuing thundering herd of re-launching multiple processes automatically and retrying could turn out to be quite dangerious.

shanyp commented 1 year ago

@problame @koivunej @hlinnaka I have followed the action to reproduce:

killall resulted in:

./target/debug/neon_local pg list
 NODE  ADDRESS          TIMELINE                          BRANCH NAME  LSN        STATUS
 main  127.0.0.1:55432  78ee64b6b85a5ff56d11f94ed02909f6  main         0/16ED6F0  crashed

and than:

./target/debug/neon_local pg start main
Starting existing postgres main...
Safekeepers synced on 0/16ED6F0
Extracting base backup to create postgres instance: path=.neon/pgdatadirs/tenants/9e41b3a4fa757867b90396f50409b359/main port=55432
Starting postgres node at 'host=127.0.0.1 port=55432 user=cloud_admin dbname=postgres'
psql -p55432 -h 127.0.0.1 -U cloud_admin postgres
psql (15.2, server 14.6)
Type "help" for help.

postgres=# select * from t;
 key | value
-----+-------
   1 | 1
(1 row)

postgres=#

so this doesn't repro for me.

I would like to close it unless there is anything else needed

hlinnaka commented 1 year ago

This hasn't been fixed. Here's a script to reproduce it:

repro-1700.sh.txt

(Note: it does this:

# Clean up after previous iterations
killall -9 postgres pageserver safekeeper storage_broker
rm -rf .neon

so don't run it if you have any valuable data in .neon)

shanyp commented 1 year ago

yes with this script it fails for me

knizhnik commented 1 year ago

I see three possible workarrounds:

Somehow listen state of walredo process and restart it in case of failure. As far as we do not want to increase number of physical threads in pageserver, it can be done only using Tokio task. But waledo process is std::process::Command rather than tokio::process:Command. I failed to find a way to asynchronously wait for it's status.
Check status of process before applying wal records. I afraid that it may have negative effect on performance.
Handle error, restart process and retry walredo request.

3) seems to be the simplest one - I have implemented it in #3739 We may also want limit number of retry attempts to avoid infinite restart of wal reo process in case of some error or incorrect wal record (which may consume a lot of page server resources).

https://github.com/neondatabase/neon/pull/3739

shanyp commented 1 year ago

@knizhnik can you close this one ?

knizhnik commented 1 year ago

Fixed by PR #3739