pgpool / pgpool2

This is the official mirror of git://git.postgresql.org/git/pgpool2.git. Note that this is just a *mirror* - we don't work with pull requests on github. Please subscribe to pgpool-hackers mailing list from our website and submit your patch to this mailing list.
https://www.pgpool.net
Other
340 stars 91 forks source link

Pgpool Version 4.4.4 #64

Open pradeepp-ai opened 3 months ago

pradeepp-ai commented 3 months ago

I Need help or suggestions on the issue below pgpool-II.

Example:- Node1 -- Primary (Postgres & pgpool installed) Node2 -- Standby (Postgres & pgpool installed) Node3 -- Witness (Pgpool installed)

Incident:-

  1. In the pgpool cluster we have node1 is primary, node2 is secondary, and node3 is witness.
  2. node1 becomes unresponsive then failover is triggered and then node2 becomes the new primary.
  3. After 15 minutes node1 became responsive again and node1 is in primary state and we have node 2 also in primary so a split brain situation happened.
  4. Node 1 itself is decided as leader/coordinator based on the node 1 pgpool record and more number of connected standby nodes (based on old state of information) , now i am the true leader, it is decided as the true leader and asks node 2 to step down.
  5. In this case for 15 minutes data written to node 2 new primary after node 1 responsive data gets started to return on node 1 again so here data gets inconsistent.

Please provide a suggestion in this case.

NOTE:- We are using only for auto-failover not for auto-failback so auto-failback is disabled.

Thanks & Regards, Pradeep Prajapati

tatsuo-ishii commented 3 months ago

What do you mean by "witness"? I am asking because there's no such an idea "witness" in pgpool. Can you share pgpool.conf? (you can hide sensitive data like password or IP address)

pradeepp-ai commented 3 months ago

Hi tatsuo-ishii,

Thank you for your reply.

Ok, there is no such witness node. I am briefing the server information and sharing the pgpool configuration file for your reference.

So, we have 3 servers we can refer it as node 1, node 2 & node 3.

Node 1 have following services installed like postgresql-15.4 and pgpool-II 4.4.4 Node 2 also have following services installed like postgresql-15.4 and pgpool-II 4.4.4 But, In node 3 we have installed only pgpool-II 4.4.4

We are using 3 pgpool-II 4.4.4 and 2 postgresql-15.4 Database so that pgpool will use consensus to elect the leader if two pgpool agrees.

We have assigned the pgpool_node_id as follow for Node 1 = 0, Node 2 = 1 and Node 3 = 2

Kindly find the all 3 nodes pgpool.conf file.

Please let me know as per my configuration in how much time failover will complete or share me the formula to calculate the failover time.

FYI, We are using pgpool for auto-failover and manual recovery instead of auto-recovery that why we have disable the auto-failback and we are not using load balancing for the time being.

Please let me know if you have any query related to setup and configuration.

FYI, All nodes pgpool.conf file is same except wd_priority. We have given the wd_priority as follow for Node 1 = 1 , Node 2 = 2 and Node 3 = 0 (Because we don't want VIP IP to established on node 3)

node3-pgpool-conf.txt node2-pgpool-conf.txt node1-pgpool-conf.txt

tatsuo-ishii commented 2 months ago

There's nothing unexpected in the pgpool.conf files and I need more info. If the incident above can be reproduced, can you provide the output of pcp_watchdog_info and pcp_node_info on node1 and node when this happened:

node1 becomes unresponsive then failover is triggered and then node2 becomes the new primary.

Also I would like to confirm that you did not restart any of node1, node2, node3 and PostgreSQL during 1-5.

Please let me know as per my configuration in how much time failover will complete or share me the formula to calculate the failover time.

I guess you are asking how long it will take before failover is triggered when some errors occurred. The formula for this is roughly: (health_check_period + health_check_timeout + health_check_retry_delay + connect_timeout) * health_check_max_retries

Note that this could be shorter if health_check_timeout or connect_timeout checking immediately reports back the failure.