Open wasiualhasib opened 5 days ago
We have checked all the configurations with the previous all pgpool nodes it is found ok. But one thing I noticed is that at node1 pgpool version is 4.2.3 where at other two nodes it is 4.2.2 . Is it the reason for showing this mismatch of information?
I don't think the minor version differences are causing this issue.
Could you ensure that the two standby servers are connected to the primary and receiving WAL from it? Please connect to the primary PostgreSQL and run the following command:
select * from pg_stat_replication;
And to show replication_state
and replication_sync_state
, the user specified in sr_check_user
needs to be a PostgreSQL superuser or a member of the pg_monitor
group.
At the backend it was working smoothly, no issue at replication it was OK, issue at pgpool. Pgpool was unable to find primary node. Basically we did security update using "yum update --security" at node2 and node3 which cause this line found:
1284 [/usr/lib/tmpfiles.d/pgpool-II-pg13.conf:1] Line references path below legacy directory /var/run/, updating /var/run/pgpool → /run/pgpool; please update the tmpfiles.d/ d
Inside that file (/usr/lib/tmpfiles.d/pgpool-II-pg13.conf) we found below line:
d /var/run/pgpool 0755 postgres postgres
We followed below steps to security update on: For node3:
For node2:
When we tried to attach 2nd node2 it become stuck, and exit with an error that pgpool "terminating connection". And also observed 9898 not up instantly if we do restart pgpool.
This is the output of pg_stat_replication
postgres=# select * from pg_stat_replication;
pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | backend_xmin | state | sent_lsn | write_lsn | flush_lsn | replay_lsn | write_lag | flush_lag | replay_lag | sync_priority | sync_state | reply_time
-------+----------+---------+------------------+---------------+-----------------+-------------+-------------------------------+--------------+-----------+--------------+--------------+--------------+--------------+-----------------+-----------------+-----------------+---------------+------------+-------------------------------
44017 | 16385 | repl | pgpool2 | 192.168.43.100 | | 42938 | 2024-10-07 20:24:14.238575+00 | | streaming | 215/6D0DCD18 | 215/6D0DCD18 | 215/6D0DCD18 | 215/6D0DCD18 | 00:00:00.00044 | 00:00:00.001959 | 00:00:00.002018 | 0 | async | 2024-10-09 03:20:07.405041+00
47187 | 16385 | repl | pgpool3 | 192.168.43.101 | | 59070 | 2024-10-07 20:29:11.300226+00 | | streaming | 215/6D0DCD18 | 215/6D0DCD18 | 215/6D0DCD18 | 215/6D0DCD18 | 00:00:00.001313 | 00:00:00.003643 | 00:00:00.003715 | 0 | async | 2024-10-09 03:20:07.406941+00
I found suddenly pgpool showing all pg_role in standby mode for all nodes but from database side we found node1 is primary , node2 and node3 are stanbdy.
I tried removing pgpool status file but it is also not working. Later stop all services and removing all status file start again pgpool service at all nodes. Then I found primary is showing as primary at pg_role but when I attach the standby node it is terminated pgpool connection for long time. After few minute I observed it is again showing all node as standby node. Even pg_promote_node not working.
Finally, when we recovered node2 and node3 I found it is recovered and issue solved but I found replication_state and replication_sync_state not showing.
We have checked all the configurations with the previous all pgpool nodes it is found ok. But one thing I noticed is that at node1 pgpool version is 4.2.3 where at other two nodes it is 4.2.2 . Is it the reason for showing this mismatch of information?
Where pgpool store that pg_role status information?
@pengbo0328
@codeforall @chen-5033
Pgpool at stable state
At the time of pgpool2 error
After all pgpool service stop , remove all status file and later restart pgpool service.
After recovery pgpool2, where replication_state and replication_sync_state is missing