Pgpool2 showing all nodes standby mode at "show pool_nodes" commands

wasiualhasib commented 5 days ago

I found suddenly pgpool showing all pg_role in standby mode for all nodes but from database side we found node1 is primary , node2 and node3 are stanbdy.

I tried removing pgpool status file but it is also not working. Later stop all services and removing all status file start again pgpool service at all nodes. Then I found primary is showing as primary at pg_role but when I attach the standby node it is terminated pgpool connection for long time. After few minute I observed it is again showing all node as standby node. Even pg_promote_node not working.

Finally, when we recovered node2 and node3 I found it is recovered and issue solved but I found replication_state and replication_sync_state not showing.

We have checked all the configurations with the previous all pgpool nodes it is found ok. But one thing I noticed is that at node1 pgpool version is 4.2.3 where at other two nodes it is 4.2.2 . Is it the reason for showing this mismatch of information?

Where pgpool store that pg_role status information?

@pengbo0328
@codeforall @chen-5033

Pgpool at stable state

 node_id |  hostname  | port | status | lb_weight |  role   | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_status_change
---------+------------+------+--------+-----------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------
 0       | pgpool1 | 5432 | up     | 0.333333  | standby| 1397651867 | false                  | 0                 |      streaming     |  async  | 2023-10-03 01:22:24
 1       | pgpool2 | 5432 | up     | 0.333333  | standby | 1109534591 | true                  | 0                 |      streaming     |  async  | 2023-10-03 01:32:21
 2       | pgpool3 | 5432 | up     | 0.333333  | standby | 1511004884 | false                 | 0                 |       streaming    |  async | 2023-10-03 01:26:01

At the time of pgpool2 error

 node_id |  hostname  | port | status | lb_weight |  role   | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_status_change
---------+------------+------+--------+-----------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------
 0       | pgpool1 | 5432 | up     | 0.333333  | standby| 1397651867 | false             | 0                 |                   |                        | 2023-10-03 01:22:24
 1       | pgpool2 | 5432 | up     | 0.333333  | standby | 1109534591 | true              | 0                 |                   |                       | 2023-10-03 01:32:21
 2       | pgpool3 | 5432 | up     | 0.333333  | standby | 1511004884 | false             | 0                 |                     |                     | 2023-10-03 01:26:01

After all pgpool service stop , remove all status file and later restart pgpool service.

 node_id |  hostname  | port | status | lb_weight |  role   | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_status_change
---------+------------+------+--------+-----------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------
 0       | pgpool1 | 5432 | up     | 0.333333  | primary| 1397651867 | false             | 0                 |                   |                        | 2023-10-03 01:22:24
 1       | pgpool2 | 5432 | Down| 0.333333  | standby | 1109534591 | true              | 0                 |                   |                       | 2023-10-03 01:32:21
 2       | pgpool3 | 5432 | Down| 0.333333  | standby | 1511004884 | false             | 0                 |                     |                     | 2023-10-03 01:26:01

After recovery pgpool2, where replication_state and replication_sync_state is missing

 node_id |  hostname  | port | status | lb_weight |  role   | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_status_change
---------+------------+------+--------+-----------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------
 0       | pgpool1 | 5432 | up     | 0.333333  | primary | 1397651867 | false             | 0                 |                   |                        | 2023-10-03 01:22:24
 1       | pgpool2 | 5432 | up     | 0.333333  | standby | 1109534591 | true              | 0                 |                   |                       | 2023-10-03 01:32:21
 2       | pgpool3 | 5432 | up     | 0.333333  | standby | 1511004884 | false             | 0                 |                     |                     | 2023-10-03 01:26:01

pengbo0328 commented 5 days ago

We have checked all the configurations with the previous all pgpool nodes it is found ok. But one thing I noticed is that at node1 pgpool version is 4.2.3 where at other two nodes it is 4.2.2 . Is it the reason for showing this mismatch of information?

I don't think the minor version differences are causing this issue.

Could you ensure that the two standby servers are connected to the primary and receiving WAL from it? Please connect to the primary PostgreSQL and run the following command:

select * from pg_stat_replication;

And to show replication_state and replication_sync_state, the user specified in sr_check_user needs to be a PostgreSQL superuser or a member of the pg_monitor group.

wasiualhasib commented 4 days ago

At the backend it was working smoothly, no issue at replication it was OK, issue at pgpool. Pgpool was unable to find primary node. Basically we did security update using "yum update --security" at node2 and node3 which cause this line found:

1284 [/usr/lib/tmpfiles.d/pgpool-II-pg13.conf:1] Line references path below legacy directory /var/run/, updating /var/run/pgpool → /run/pgpool; please update the tmpfiles.d/ d

Inside that file (/usr/lib/tmpfiles.d/pgpool-II-pg13.conf) we found below line: d /var/run/pgpool 0755 postgres postgres

We followed below steps to security update on: For node3:

pcp_detach_node -n2 -U pgpool
yum update --security
systemctl stop pgpool
systemctop stop postgersql13
reboot
pcp_attach_node -n2 -U pgpool

For node2:

pcp_detach_node -n1 -U pgpool
yum update --security
systemctl stop pgpool
systemctop stop postgersql13
reboot
pcp_attach_node -n1 -U pgpool

When we tried to attach 2nd node2 it become stuck, and exit with an error that pgpool "terminating connection". And also observed 9898 not up instantly if we do restart pgpool.

This is the output of pg_stat_replication

postgres=# select * from pg_stat_replication;
  pid  | usesysid | usename | application_name |  client_addr  | client_hostname | client_port |         backend_start         | backend_xmin |   state   |   sent_lsn   |  write_lsn   |  flush_lsn   |  replay_lsn  |    write_lag    |    flush_lag    |   replay_lag    | sync_priority | sync_state |          reply_time
-------+----------+---------+------------------+---------------+-----------------+-------------+-------------------------------+--------------+-----------+--------------+--------------+--------------+--------------+-----------------+-----------------+-----------------+---------------+------------+-------------------------------
 44017 |    16385 | repl    | pgpool2       | 192.168.43.100 |                 |       42938 | 2024-10-07 20:24:14.238575+00 |              | streaming | 215/6D0DCD18 | 215/6D0DCD18 | 215/6D0DCD18 | 215/6D0DCD18 | 00:00:00.00044  | 00:00:00.001959 | 00:00:00.002018 |             0 | async      | 2024-10-09 03:20:07.405041+00
 47187 |    16385 | repl    | pgpool3       | 192.168.43.101 |                 |       59070 | 2024-10-07 20:29:11.300226+00 |              | streaming | 215/6D0DCD18 | 215/6D0DCD18 | 215/6D0DCD18 | 215/6D0DCD18 | 00:00:00.001313 | 00:00:00.003643 | 00:00:00.003715 |             0 | async      | 2024-10-09 03:20:07.406941+00

pgpool / pgpool2

Pgpool2 showing all nodes standby mode at "show pool_nodes" commands #73