Standby node sometimes is degenerated and gets status down

KarolLipnicki commented 3 months ago

I have cluster of pgPool and Postgres (You can find configuration here: https://github.com/pgpool/pgpool2/issues/43)

Sometimes, one of the nodes (standby node) comes down and in pgpool logs i see following errors

2024-03-13 11:32:31.993: PostgreSQL JDBC Driver pid 3937693: LOG: received degenerate backend request for node_id: 0 from pid [3937693] 2024-03-13 11:32:32.003: watchdog pid 3937631: LOG: watchdog received the failover command from local pgpool-II on IPC interface 2024-03-13 11:32:32.016: watchdog pid 3937631: LOG: watchdog is processing the failover command [DEGENERATE_BACKEND_REQUEST] received from local pgpool-II on IPC interface 2024-03-13 11:32:32.017: watchdog pid 3937631: LOG: we have got the consensus to perform the failover 2024-03-13 11:32:32.017: watchdog pid 3937631: DETAIL: 1 node(s) voted in the favor 2024-03-13 11:32:32.017: PostgreSQL JDBC Driver pid 3937693: LOG: signal_user1_to_parent_with_reason(0) 2024-03-13 11:32:32.017: main pid 3937628: LOG: Pgpool-II parent process received SIGUSR1 2024-03-13 11:32:32.017: main pid 3937628: LOG: Pgpool-II parent process has received failover request 2024-03-13 11:32:32.018: watchdog pid 3937631: LOG: received the failover indication from Pgpool-II on IPC interface 2024-03-13 11:32:32.018: watchdog pid 3937631: LOG: watchdog is informed of failover start by the main process 2024-03-13 11:32:32.018: main pid 3937628: LOG: === Starting degeneration. shutdown host server03(5432) === 2024-03-13 11:32:32.022: main pid 3937628: LOG: Do not restart children because we are switching over node id 0 host: server03 port: 5432 and we are in streaming replication mode 2024-03-13 11:32:32.049: main pid 3937628: LOG: child pid 3937649 needs to restart because pool 0 uses backend 0 2024-03-13 11:32:32.082: main pid 3937628: LOG: child pid 9943 needs to restart because pool 1 uses backend 0 2024-03-13 11:32:32.128: main pid 3937628: LOG: child pid 3937693 needs to restart because pool 1 uses backend 0 2024-03-13 11:32:32.231: main pid 3937628: LOG: child pid 3937732 needs to restart because pool 1 uses backend 0 2024-03-13 11:32:32.293: main pid 3937628: LOG: child pid 3937762 needs to restart because pool 1 uses backend 0 2024-03-13 11:32:32.307: main pid 3937628: LOG: child pid 3953147 needs to restart because pool 1 uses backend 0 2024-03-13 11:32:32.386: main pid 3937628: LOG: child pid 3937809 needs to restart because pool 1 uses backend 0 2024-03-13 11:32:32.425: main pid 3937628: LOG: child pid 9505 needs to restart because pool 1 uses backend 0 2024-03-13 11:32:32.478: main pid 3937628: LOG: execute command: /etc/pgpool-II/failover.sh 0 server03 5432 /var/lib/pgsql/15/data 1 server04 0 1 5432 /var/lib/pgsql/15/data server04 5432 + FAILED_NODE_ID=0 + FAILED_NODE_HOST=server03 + FAILED_NODE_PORT=5432 + FAILED_NODE_PGDATA=/var/lib/pgsql/15/data + NEW_MAIN_NODE_ID=1 + NEW_MAIN_NODE_HOST=server04 + OLD_MAIN_NODE_ID=0 + OLD_PRIMARY_NODE_ID=1 + NEW_MAIN_NODE_PORT=5432 + NEW_MAIN_NODE_PGDATA=/var/lib/pgsql/15/data + OLD_PRIMARY_NODE_HOST=server04 + OLD_PRIMARY_NODE_PORT=5432 + PGHOME=/usr/pgsql-15 ++ tr -- -. _ ++ echo server03 + REPL_SLOT_NAME=server03 + POSTGRESQL_STARTUP_USER=postgres + SSH_KEY_FILE=id_rsa_pgpool + SSH_OPTIONS='-o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ~/.ssh/id_rsa_pgpool' + echo failover.sh: start: failed_node_id=0 failed_host=server03 old_primary_node_id=1 new_main_node_id=1 new_main_host=server04 failover.sh: start: failed_node_id=0 failed_host=server03 old_primary_node_id=1 new_main_node_id=1 new_main_host=server04 + '[' 1 -lt 0 ']' + ssh -T -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i '~/.ssh/id_rsa_pgpool' postgres@server04 ls /tmp Warning: Permanently added 'server04' (ED25519) to the list of known hosts.^M IPv4: 10.128.61.15 + '[' 0 -ne 0 ']' + '[' 1 '!=' -1 -a 0 '!=' 1 ']' + /usr/pgsql-15/bin/psql -h server04 -p 5432 postgres -c 'SELECT pg_drop_replication_slot('\''server03'\'');' + '[' 1 -ne 0 ']' + echo ERROR: failover.sh: drop replication slot '"server03"' failed. You may need to drop replication slot manually. ERROR: failover.sh: drop replication slot "server03" failed. You may need to drop replication slot manually. + echo failover.sh: end: standby node is down. Skipping failover. failover.sh: end: standby node is down. Skipping failover. + exit 0 2024-03-13 11:32:32.975: main pid 3937628: LOG: failover: set new primary node: 1 2024-03-13 11:32:32.975: main pid 3937628: LOG: failover: set new main node: 1 2024-03-13 11:32:32.975: main pid 3937628: LOG: child pid 3937649 needs to restart because pool 0 uses backend 0 2024-03-13 11:32:32.975: main pid 3937628: LOG: child pid 9943 needs to restart because pool 1 uses backend 0 2024-03-13 11:32:32.976: main pid 3937628: LOG: child pid 3937693 needs to restart because pool 1 uses backend 0 2024-03-13 11:32:32.976: main pid 3937628: LOG: child pid 3937732 needs to restart because pool 1 uses backend 0 2024-03-13 11:32:32.976: main pid 3937628: LOG: child pid 3937762 needs to restart because pool 1 uses backend 0 2024-03-13 11:32:32.977: main pid 3937628: LOG: child pid 3953147 needs to restart because pool 1 uses backend 0 2024-03-13 11:32:32.978: main pid 3937628: LOG: child pid 3937809 needs to restart because pool 1 uses backend 0 2024-03-13 11:32:32.979: main pid 3937628: LOG: child pid 9505 needs to restart because pool 1 uses backend 0 2024-03-13 11:32:32.989: watchdog pid 3937631: LOG: received the failover indication from Pgpool-II on IPC interface 2024-03-13 11:32:32.989: watchdog pid 3937631: LOG: watchdog is informed of failover end by the main process 2024-03-13 11:32:32.989: main pid 3937628: LOG: === Failover done. shutdown host server03(5432) === 2024-03-13 11:32:32.985: sr_check_worker pid 3937844: LOG: worker process received restart request 2024-03-13 11:32:33.002: [unknown] pid 3937739: LOG: failover or failback event detected 2024-03-13 11:32:33.002: [unknown] pid 3937739: DETAIL: restarting myself 2024-03-13 11:32:33.002: [unknown] pid 3937772: LOG: failover or failback event detected 2024-03-13 11:32:33.002: [unknown] pid 3937772: DETAIL: restarting myself 2024-03-13 11:32:33.004: [unknown] pid 3937814: LOG: failover or failback event detected 2024-03-13 11:32:33.004: [unknown] pid 3937814: DETAIL: restarting myself 2024-03-13 11:32:33.006: [unknown] pid 11907: LOG: selecting backend connection 2024-03-13 11:32:33.006: [unknown] pid 11907: DETAIL: failover or failback event detected, discarding existing connections 2024-03-13 11:32:33.340: [unknown] pid 3937688: LOG: failover or failback event detected 2024-03-13 11:32:33.340: [unknown] pid 3937688: DETAIL: restarting myself 2024-03-13 11:32:33.442: [unknown] pid 3937787: LOG: failover or failback event detected 2024-03-13 11:32:33.442: [unknown] pid 3937787: DETAIL: restarting myself 2024-03-13 11:32:33.442: [unknown] pid 3937806: LOG: failover or failback event detected 2024-03-13 11:32:33.442: [unknown] pid 3937806: DETAIL: restarting myself 2024-03-13 11:32:33.443: [unknown] pid 4177626: LOG: selecting backend connection 2024-03-13 11:32:33.443: [unknown] pid 4177626: DETAIL: failover or failback event detected, discarding existing connections 2024-03-13 11:32:33.456: [unknown] pid 4043199: LOG: failover or failback event detected 2024-03-13 11:32:33.456: [unknown] pid 4043199: DETAIL: restarting myself 2024-03-13 11:32:33.457: [unknown] pid 3937824: LOG: failover or failback event detected

On primary node in pgpool logs i see

2024-03-13 11:32:32.992: watchdog pid 1351024: LOG: signal_user1_to_parent_with_reason(2) 2024-03-13 11:32:33.019: main pid 1351021: LOG: Pgpool-II parent process received SIGUSR1 2024-03-13 11:32:33.029: main pid 1351021: LOG: Pgpool-II parent process received sync backend signal from watchdog 2024-03-13 11:32:33.031: main pid 1351021: LOG: leader watchdog has performed failover 2024-03-13 11:32:33.031: main pid 1351021: DETAIL: syncing the backend states from the LEADER watchdog node 2024-03-13 11:32:33.031: watchdog pid 1351024: LOG: received the get data request from local pgpool-II on IPC interface 2024-03-13 11:32:33.032: watchdog pid 1351024: LOG: get data request from local pgpool-II node received on IPC interface is forwarded to leader watchdog node "server03:9999 Linux server03.luxmed.pl" 2024-03-13 11:32:33.032: watchdog pid 1351024: DETAIL: waiting for the reply... 2024-03-13 11:32:33.034: main pid 1351021: LOG: leader watchdog node "server03:9999 Linux server03.luxmed.pl" returned status for 2 backend nodes 2024-03-13 11:32:33.034: main pid 1351021: LOG: backend:0 is set to down status 2024-03-13 11:32:33.034: main pid 1351021: DETAIL: backend:0 is DOWN on cluster leader "server03:9999 Linux server03.luxmed.pl" 2024-03-13 11:32:33.034: main pid 1351021: LOG: 1 backend node(s) were detached because of backend status sync from "server03:9999 Linux server03.luxmed.pl" 2024-03-13 11:32:33.034: main pid 1351021: DETAIL: restarting the children processes 2024-03-13 11:32:33.034: main pid 1351021: LOG: Node 1 is not down (status: 1) 2024-03-13 11:32:33.044: sr_check_worker pid 1356663: LOG: worker process received restart request

:( Then my cluster looks that

This occurs yesterday and today at the same time (around 11:33 am). In this case helped me to run

pcp_attach_node -h localhost -p 9898 -U pgpool 0

In postgresql.log for this time I found

2024-03-13 11:32:32.020 CET [38413] LOG: could not receive data from client: Connection reset by peer 2024-03-13 11:32:32.054 CET [42027] LOG: could not receive data from client: Connection reset by peer

Uptime of postgresql database show

pg_postmaster_start_time ------------------------------ 2024-03-12 11:55:32.61799+01 (1 row)

Can it be related to database size which is 100GB ? This is also interesting how looks connections sum statistics after error occuring, how it growth

When searching more i found on node which became down:

2024-03-13 11:32:30.594 CET [1352969] LOG: checkpoint starting: wal 2024-03-13 11:32:32.025 CET [1685398] LOG: could not receive data from client: Connection reset by peer 2024-03-13 11:32:32.055 CET [1684166] LOG: could not receive data from client: Connection reset by peer 2024-03-13 11:32:32.076 CET [1669071] LOG: could not receive data from client: Connection reset by peer 2024-03-13 11:32:32.082 CET [1684074] LOG: unexpected EOF on client connection with an open transaction 2024-03-13 11:32:32.326 CET [1527416] LOG: unexpected EOF on client connection with an open transaction 2024-03-13 11:32:32.973 CET [1689966] ERROR: replication slot "l000r00pgdb03" is active for PID 1356662 2024-03-13 11:32:32.973 CET [1689966] STATEMENT: SELECT pg_drop_replication_slot('l000r00pgdb03'); 2024-03-13 11:32:44.122 CET [1690077] ERROR: Próba zmiany statusu na ST, inny proces już podjął akcję

tatsuo-ishii commented 3 months ago

2024-03-13 11:32:31.993: PostgreSQL JDBC Driver pid 3937693: LOG: received degenerate backend request for node_id: 0 from pid [3937693]

Looks like pgpool child process (pid 3937693) found something wrong and decided to raise a fail over request. In order to know the cause of this, I need pgpool logs before this happened. Can you share the logs?

KarolLipnicki commented 3 months ago

Sure, logs attached pgpool_logs.zip

tatsuo-ishii commented 3 months ago

The cause of the fail over was this: pgpool-server03.log:2024-03-13 11:32:31.989: PostgreSQL JDBC Driver pid 3937693: LOG: Error message from backend: DB node id: 0 message: "canceling statement due to conflict with recovery"

In this the standby disconnected to pgpool and pgpool failed to read data from the standby, and pgpool triggered failover. If you want to avoid failover in this case, you can set: failover_on_backend_error = off

KarolLipnicki commented 3 months ago

OK, i'll try to set this parameter and I let You know if this helps

KarolLipnicki commented 3 months ago

Untill now everything is OK :)

tatsuo-ishii commented 3 months ago

Great!

pgpool / pgpool2

Standby node sometimes is degenerated and gets status down #44