Open KarolLipnicki opened 3 months ago
2024-03-13 11:32:31.993: PostgreSQL JDBC Driver pid 3937693: LOG: received degenerate backend request for node_id: 0 from pid [3937693]
Looks like pgpool child process (pid 3937693) found something wrong and decided to raise a fail over request. In order to know the cause of this, I need pgpool logs before this happened. Can you share the logs?
Sure, logs attached pgpool_logs.zip
The cause of the fail over was this:
pgpool-server03.log:2024-03-13 11:32:31.989: PostgreSQL JDBC Driver pid 3937693: LOG: Error message from backend: DB node id: 0 message: "canceling statement due to conflict with recovery"
In this the standby disconnected to pgpool and pgpool failed to read data from the standby, and pgpool triggered failover. If you want to avoid failover in this case, you can set:
failover_on_backend_error = off
OK, i'll try to set this parameter and I let You know if this helps
Untill now everything is OK :)
Great!
I have cluster of pgPool and Postgres (You can find configuration here: https://github.com/pgpool/pgpool2/issues/43)
Sometimes, one of the nodes (standby node) comes down and in pgpool logs i see following errors
2024-03-13 11:32:31.993: PostgreSQL JDBC Driver pid 3937693: LOG: received degenerate backend request for node_id: 0 from pid [3937693]
2024-03-13 11:32:32.003: watchdog pid 3937631: LOG: watchdog received the failover command from local pgpool-II on IPC interface
2024-03-13 11:32:32.016: watchdog pid 3937631: LOG: watchdog is processing the failover command [DEGENERATE_BACKEND_REQUEST] received from local pgpool-II on IPC interface
2024-03-13 11:32:32.017: watchdog pid 3937631: LOG: we have got the consensus to perform the failover
2024-03-13 11:32:32.017: watchdog pid 3937631: DETAIL: 1 node(s) voted in the favor
2024-03-13 11:32:32.017: PostgreSQL JDBC Driver pid 3937693: LOG: signal_user1_to_parent_with_reason(0)
2024-03-13 11:32:32.017: main pid 3937628: LOG: Pgpool-II parent process received SIGUSR1
2024-03-13 11:32:32.017: main pid 3937628: LOG: Pgpool-II parent process has received failover request
2024-03-13 11:32:32.018: watchdog pid 3937631: LOG: received the failover indication from Pgpool-II on IPC interface
2024-03-13 11:32:32.018: watchdog pid 3937631: LOG: watchdog is informed of failover start by the main process
2024-03-13 11:32:32.018: main pid 3937628: LOG: === Starting degeneration. shutdown host server03(5432) ===
2024-03-13 11:32:32.022: main pid 3937628: LOG: Do not restart children because we are switching over node id 0 host: server03 port: 5432 and we are in streaming replication mode
2024-03-13 11:32:32.049: main pid 3937628: LOG: child pid 3937649 needs to restart because pool 0 uses backend 0
2024-03-13 11:32:32.082: main pid 3937628: LOG: child pid 9943 needs to restart because pool 1 uses backend 0
2024-03-13 11:32:32.128: main pid 3937628: LOG: child pid 3937693 needs to restart because pool 1 uses backend 0
2024-03-13 11:32:32.231: main pid 3937628: LOG: child pid 3937732 needs to restart because pool 1 uses backend 0
2024-03-13 11:32:32.293: main pid 3937628: LOG: child pid 3937762 needs to restart because pool 1 uses backend 0
2024-03-13 11:32:32.307: main pid 3937628: LOG: child pid 3953147 needs to restart because pool 1 uses backend 0
2024-03-13 11:32:32.386: main pid 3937628: LOG: child pid 3937809 needs to restart because pool 1 uses backend 0
2024-03-13 11:32:32.425: main pid 3937628: LOG: child pid 9505 needs to restart because pool 1 uses backend 0
2024-03-13 11:32:32.478: main pid 3937628: LOG: execute command: /etc/pgpool-II/failover.sh 0 server03 5432 /var/lib/pgsql/15/data 1 server04 0 1 5432 /var/lib/pgsql/15/data server04 5432
+ FAILED_NODE_ID=0
+ FAILED_NODE_HOST=server03
+ FAILED_NODE_PORT=5432
+ FAILED_NODE_PGDATA=/var/lib/pgsql/15/data
+ NEW_MAIN_NODE_ID=1
+ NEW_MAIN_NODE_HOST=server04
+ OLD_MAIN_NODE_ID=0
+ OLD_PRIMARY_NODE_ID=1
+ NEW_MAIN_NODE_PORT=5432
+ NEW_MAIN_NODE_PGDATA=/var/lib/pgsql/15/data
+ OLD_PRIMARY_NODE_HOST=server04
+ OLD_PRIMARY_NODE_PORT=5432
+ PGHOME=/usr/pgsql-15
++ tr -- -. _
++ echo server03
+ REPL_SLOT_NAME=server03
+ POSTGRESQL_STARTUP_USER=postgres
+ SSH_KEY_FILE=id_rsa_pgpool
+ SSH_OPTIONS='-o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ~/.ssh/id_rsa_pgpool'
+ echo failover.sh: start: failed_node_id=0 failed_host=server03 old_primary_node_id=1 new_main_node_id=1 new_main_host=server04
failover.sh: start: failed_node_id=0 failed_host=server03 old_primary_node_id=1 new_main_node_id=1 new_main_host=server04
+ '[' 1 -lt 0 ']'
+ ssh -T -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i '~/.ssh/id_rsa_pgpool' postgres@server04 ls /tmp
Warning: Permanently added 'server04' (ED25519) to the list of known hosts.^M
IPv4: 10.128.61.15
+ '[' 0 -ne 0 ']'
+ '[' 1 '!=' -1 -a 0 '!=' 1 ']'
+ /usr/pgsql-15/bin/psql -h server04 -p 5432 postgres -c 'SELECT pg_drop_replication_slot('\''server03'\'');'
+ '[' 1 -ne 0 ']'
+ echo ERROR: failover.sh: drop replication slot '"server03"' failed. You may need to drop replication slot manually.
ERROR: failover.sh: drop replication slot "server03" failed. You may need to drop replication slot manually.
+ echo failover.sh: end: standby node is down. Skipping failover.
failover.sh: end: standby node is down. Skipping failover.
+ exit 0
2024-03-13 11:32:32.975: main pid 3937628: LOG: failover: set new primary node: 1
2024-03-13 11:32:32.975: main pid 3937628: LOG: failover: set new main node: 1
2024-03-13 11:32:32.975: main pid 3937628: LOG: child pid 3937649 needs to restart because pool 0 uses backend 0
2024-03-13 11:32:32.975: main pid 3937628: LOG: child pid 9943 needs to restart because pool 1 uses backend 0
2024-03-13 11:32:32.976: main pid 3937628: LOG: child pid 3937693 needs to restart because pool 1 uses backend 0
2024-03-13 11:32:32.976: main pid 3937628: LOG: child pid 3937732 needs to restart because pool 1 uses backend 0
2024-03-13 11:32:32.976: main pid 3937628: LOG: child pid 3937762 needs to restart because pool 1 uses backend 0
2024-03-13 11:32:32.977: main pid 3937628: LOG: child pid 3953147 needs to restart because pool 1 uses backend 0
2024-03-13 11:32:32.978: main pid 3937628: LOG: child pid 3937809 needs to restart because pool 1 uses backend 0
2024-03-13 11:32:32.979: main pid 3937628: LOG: child pid 9505 needs to restart because pool 1 uses backend 0
2024-03-13 11:32:32.989: watchdog pid 3937631: LOG: received the failover indication from Pgpool-II on IPC interface
2024-03-13 11:32:32.989: watchdog pid 3937631: LOG: watchdog is informed of failover end by the main process
2024-03-13 11:32:32.989: main pid 3937628: LOG: === Failover done. shutdown host server03(5432) ===
2024-03-13 11:32:32.985: sr_check_worker pid 3937844: LOG: worker process received restart request
2024-03-13 11:32:33.002: [unknown] pid 3937739: LOG: failover or failback event detected
2024-03-13 11:32:33.002: [unknown] pid 3937739: DETAIL: restarting myself
2024-03-13 11:32:33.002: [unknown] pid 3937772: LOG: failover or failback event detected
2024-03-13 11:32:33.002: [unknown] pid 3937772: DETAIL: restarting myself
2024-03-13 11:32:33.004: [unknown] pid 3937814: LOG: failover or failback event detected
2024-03-13 11:32:33.004: [unknown] pid 3937814: DETAIL: restarting myself
2024-03-13 11:32:33.006: [unknown] pid 11907: LOG: selecting backend connection
2024-03-13 11:32:33.006: [unknown] pid 11907: DETAIL: failover or failback event detected, discarding existing connections
2024-03-13 11:32:33.340: [unknown] pid 3937688: LOG: failover or failback event detected
2024-03-13 11:32:33.340: [unknown] pid 3937688: DETAIL: restarting myself
2024-03-13 11:32:33.442: [unknown] pid 3937787: LOG: failover or failback event detected
2024-03-13 11:32:33.442: [unknown] pid 3937787: DETAIL: restarting myself
2024-03-13 11:32:33.442: [unknown] pid 3937806: LOG: failover or failback event detected
2024-03-13 11:32:33.442: [unknown] pid 3937806: DETAIL: restarting myself
2024-03-13 11:32:33.443: [unknown] pid 4177626: LOG: selecting backend connection
2024-03-13 11:32:33.443: [unknown] pid 4177626: DETAIL: failover or failback event detected, discarding existing connections
2024-03-13 11:32:33.456: [unknown] pid 4043199: LOG: failover or failback event detected
2024-03-13 11:32:33.456: [unknown] pid 4043199: DETAIL: restarting myself
2024-03-13 11:32:33.457: [unknown] pid 3937824: LOG: failover or failback event detected
On primary node in pgpool logs i see
2024-03-13 11:32:32.992: watchdog pid 1351024: LOG: signal_user1_to_parent_with_reason(2)
2024-03-13 11:32:33.019: main pid 1351021: LOG: Pgpool-II parent process received SIGUSR1
2024-03-13 11:32:33.029: main pid 1351021: LOG: Pgpool-II parent process received sync backend signal from watchdog
2024-03-13 11:32:33.031: main pid 1351021: LOG: leader watchdog has performed failover
2024-03-13 11:32:33.031: main pid 1351021: DETAIL: syncing the backend states from the LEADER watchdog node
2024-03-13 11:32:33.031: watchdog pid 1351024: LOG: received the get data request from local pgpool-II on IPC interface
2024-03-13 11:32:33.032: watchdog pid 1351024: LOG: get data request from local pgpool-II node received on IPC interface is forwarded to leader watchdog node "server03:9999 Linux server03.luxmed.pl"
2024-03-13 11:32:33.032: watchdog pid 1351024: DETAIL: waiting for the reply...
2024-03-13 11:32:33.034: main pid 1351021: LOG: leader watchdog node "server03:9999 Linux server03.luxmed.pl" returned status for 2 backend nodes
2024-03-13 11:32:33.034: main pid 1351021: LOG: backend:0 is set to down status
2024-03-13 11:32:33.034: main pid 1351021: DETAIL: backend:0 is DOWN on cluster leader "server03:9999 Linux server03.luxmed.pl"
2024-03-13 11:32:33.034: main pid 1351021: LOG: 1 backend node(s) were detached because of backend status sync from "server03:9999 Linux server03.luxmed.pl"
2024-03-13 11:32:33.034: main pid 1351021: DETAIL: restarting the children processes
2024-03-13 11:32:33.034: main pid 1351021: LOG: Node 1 is not down (status: 1)
2024-03-13 11:32:33.044: sr_check_worker pid 1356663: LOG: worker process received restart request
:( Then my cluster looks that
node_id | hostname | port | status | pg_status | lb_weight | role | pg_role | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_status_change
---------+---------------+------+--------+-----------+-----------+---------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------
0 | l000r00pgdb03 | 5432 | down | up | 0.500000 | standby | standby | 622513 | true | 7440 | streaming | async | 2024-03-13 11:46:48
1 | l000r00pgdb04 | 5432 | up | up | 0.500000 | primary | primary | 3442065 | false | 0 | | | 2024-03-12 13:18:34
This occurs yesterday and today at the same time (around 11:33 am). In this case helped me to run
pcp_attach_node -h localhost -p 9898 -U pgpool 0
In postgresql.log for this time I found
2024-03-13 11:32:32.020 CET [38413] LOG: could not receive data from client: Connection reset by peer
2024-03-13 11:32:32.054 CET [42027] LOG: could not receive data from client: Connection reset by peer
Uptime of postgresql database show
pg_postmaster_start_time
------------------------------
2024-03-12 11:55:32.61799+01
(1 row)
Can it be related to database size which is 100GB ? This is also interesting how looks connections sum statistics after error occuring, how it growth
When searching more i found on node which became down:
2024-03-13 11:32:30.594 CET [1352969] LOG: checkpoint starting: wal
2024-03-13 11:32:32.025 CET [1685398] LOG: could not receive data from client: Connection reset by peer
2024-03-13 11:32:32.055 CET [1684166] LOG: could not receive data from client: Connection reset by peer
2024-03-13 11:32:32.076 CET [1669071] LOG: could not receive data from client: Connection reset by peer
2024-03-13 11:32:32.082 CET [1684074] LOG: unexpected EOF on client connection with an open transaction
2024-03-13 11:32:32.326 CET [1527416] LOG: unexpected EOF on client connection with an open transaction
2024-03-13 11:32:32.973 CET [1689966] ERROR: replication slot "l000r00pgdb03" is active for PID 1356662
2024-03-13 11:32:32.973 CET [1689966] STATEMENT: SELECT pg_drop_replication_slot('l000r00pgdb03');
2024-03-13 11:32:44.122 CET [1690077] ERROR: Próba zmiany statusu na ST, inny proces już podjął akcję