VIP issue - Githubissues

ani-tvm commented 3 years ago

Hi,

We have three node cluster (with ETCD/HAProxy/postgres in all node). We rebooted two nodes. We have following two issues.

VIP automatically assigned to node2 which is not a Leader (node 3 is now leader).
Node 1 is in failed state.

how to solve this. I am new in this patroni postgres cluster.

Thanks

vitabaks commented 3 years ago

If with_haproxy_load_balancing: true, the Type A scheme will be used and Keepalived will be installed. If the master node goes down, the VIP will move to one of the available balancer nodes with working HAProxy process. This does not have to be a Patroni leader.

If with_haproxy_load_balancing: false, the Type B scheme will be used and vip-manager will be installed. If the master node goes down, the VIP will move to a new leader.

patronictl list And attach a log of the patroni from node 1.

vitabaks commented 3 years ago

P. S. If you have a cluster of three nodes, you should not reboot two nodes at the same time. For your cluster to withstand the fall of two nodes, it must be at least five nodes. See: http://thesecretlivesofdata.com/raft/

ani-tvm commented 3 years ago

If with_haproxy_load_balancing: true, the Type A scheme will be used and Keepalived will be installed. If the master node goes down, the VIP will move to one of the available balancer nodes with working HAProxy process. This does not have to be a Patroni leader.

If with_haproxy_load_balancing: false, the Type B scheme will be used and vip-manager will be installed. If the master node goes down, the VIP will move to a new leader.

patronictl list And attach a log of the patroni from node 1.

Thanks for the help.

keepalived, haproxy is running in all servers. How i can switch the VIP from this node to Leader node ? because users access through VIP only they are not able to connect.
Following the postgres log from node1 having issue.

2020-08-09 09:00:26 UTC [3568-1] LOG: started streaming WAL from primary at 3/A0000000 on timeline 19 2020-08-09 09:00:26 UTC [3568-2] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 0000001300000003000000A0 has already been removed 2020-08-09 09:00:31 UTC [3622-1] LOG: started streaming WAL from primary at 3/A0000000 on timeline 19 2020-08-09 09:00:31 UTC [3622-2] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 0000001300000003000000A0 has already been removed 2020-08-09 09:00:36 UTC [3690-1] LOG: started streaming WAL from primary at 3/A0000000 on timeline 19 2020-08-09 09:00:36 UTC [3690-2] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 0000001300000003000000A0 has already been removed 2020-08-09 09:00:41 UTC [3747-1] LOG: started streaming WAL from primary at 3/A0000000 on timeline 19 2020-08-09 09:00:41 UTC [3747-2] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 0000001300000003000000A0 has already been removed 2020-08-09 09:00:46 UTC [3809-1] LOG: started streaming WAL from primary at 3/A0000000 on timeline 19 2020-08-09 09:00:46 UTC [3809-2] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 0000001300000003000000A0 has already been removed

vitabaks commented 3 years ago

The balancer is a separate component of the cluster and it is not tied to the role of the master; the balancer can even be on a separate server. I do not understand why you need to transfer a VIP to a node with a leader role, this is not necessary.
requested WAL segment 0000001300000003000000A0 has already been removed

Re-initialize this replica node. patronictl reinit <node_name> And maybe you should increase wal_keep_segments.

ani-tvm commented 3 years ago

Thanks for your help.

Earlier Users are authenticated and connected through VIP:haproxy port 5000/5001 to DB. Now users not authenticating through the VIP. So I doubt about, if I switch the VIP to leader node then it will authenticate through haproxy. How can I switch VIP.
If I do reinit command, is it will affect database in other nodes? how can I safely do this? /usr/local/bin/patronictl -d etcd://127.0.0.1 reinit

Thanks

vitabaks commented 3 years ago

Now users not authenticating through the VIP

Try ping to cluster_vip, telnet to cluster_vip 5000 See haproxy stats http://node_ip:7000

How can I switch VIP.

You shouldn't be doing this. But if necessary, you can increase the weight or priority in keepalived conf of one the balancer node. And systemctl reload keepalived

If I do reinit command, is it will affect database in other nodes?

(by default) Patroni used pg_basebackup (from master) in order to create new replicas.

patronictl help

See this Tutorial: Management of High-Availability PostgreSQL clusters with Patroni https://pgconf.ru/en/2018/108567

ani-tvm commented 3 years ago

1. ping succeeded to cluster_ip telnet to clusterip -- > trying to connect - getting blank screen - after 2 minutes got the following message and disconnected. E)SERRORC08P01Mclient_login_timeout

haproxy stats http://x.x.x.x:7000 ---> command error. HA-Proxy version 1.8.20 2019/04/25

thanks for sharing the video url, i will watch and get back to you.

Thanks

vitabaks commented 3 years ago

haproxy stats http://x.x.x.x:7000 ---> command error.

open in browser :) example:

and systemctl status haproxy on all balancer nodes.

ani-tvm commented 3 years ago

Thanks for the help.

All 3 servers haproxy status is active and running.

vitabaks commented 3 years ago

Hi @ani-tvm

As a result, is working for you now? Are you able to connect to the cluster?

vitabaks / postgresql_cluster

VIP issue #45