Closed orlakwahr closed 1 year ago
Hi @orlakwahr
Run the patronictl utility on the server that has everything in order with the network)
and later setup etcd, haproxy and keepalived on it manually now I have 4 nodes in the cluster
I don't understand why you do it manually, the load balancer can also be added automatically
Use the add_balancer.yml
playbook for this.
https://github.com/vitabaks/postgresql_cluster#cluster-scaling
Also, note:
if you want the cluster to survive the failure of two servers, then you need a etcd cluster of 5 nodes (N/2+1). In a cluster of three nodes, it is possible to lose only 1 server, if more nodes are unavailable then the etcd cluster is not healthy because there is no quorum.
See how raft works: http://thesecretlivesofdata.com/raft/
thanks. but the problem is that a cluster of 3 nodes can not automatically fix loss of 1 node. I am trying to understand how to make it handle the loss automatically.
Patroni does a great job of handling auto failover.
Please show examples of your problem, and attach patroni logs, maybe I didn't understand the question.
Apr 28 13:49:01 localhost patroni[988]: 2022-04-28 13:48:58,543 INFO: Lock owner: pg-db3; I am pg-db1
Apr 28 13:49:01 localhost patroni[988]: 2022-04-28 13:49:01,048 INFO: Selected new etcd server http://10.16.18.135:2379
Apr 28 13:49:01 localhost patroni[988]: 2022-04-28 13:49:01,048 ERROR: Request to server http://10.16.18.133:2379 failed: MaxRetryError('HTTPConnectionPool(host=\'10.16.18.133\', port=2379): Max retries exceeded with url: /v2/keys/service/postgres-cluster/members/pg-db1 (Caused by ReadTimeoutError("HTTPConnectionPool(host=\'10.16.18.133\', port=2379): Read timed out. (read timeout=2.499780117010232)",))',)
Apr 28 13:49:01 localhost patroni[988]: 2022-04-28 13:49:01,048 INFO: Reconnection allowed, looking for another server.
Apr 28 13:49:01 localhost patroni[988]: 2022-04-28 13:49:01,048 ERROR:
Apr 28 13:49:01 localhost patroni[988]: Traceback (most recent call last):
Apr 28 13:49:01 localhost patroni[988]: File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 449, in _make_request
Apr 28 13:49:01 localhost patroni[988]: six.raise_from(e, None)
Apr 28 13:49:01 localhost patroni[988]: File "<string>", line 3, in raise_from
Apr 28 13:49:01 localhost patroni[988]: File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 444, in _make_request
Apr 28 13:49:01 localhost patroni[988]: httplib_response = conn.getresponse()
Apr 28 13:49:01 localhost patroni[988]: File "/usr/lib64/python3.6/http/client.py", line 1361, in getresponse
Apr 28 13:49:01 localhost patroni[988]: response.begin()
Apr 28 13:49:01 localhost patroni[988]: File "/usr/lib64/python3.6/http/client.py", line 311, in begin
Apr 28 13:49:01 localhost patroni[988]: version, status, reason = self._read_status()
Apr 28 13:49:01 localhost patroni[988]: File "/usr/lib64/python3.6/http/client.py", line 272, in _read_status
Apr 28 13:49:01 localhost patroni[988]: line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
Apr 28 13:49:01 localhost patroni[988]: File "/usr/lib64/python3.6/socket.py", line 586, in readinto
Apr 28 13:49:01 localhost patroni[988]: return self._sock.recv_into(b)
Apr 28 13:49:01 localhost patroni[988]: socket.timeout: timed out
Apr 28 13:49:01 localhost patroni[988]: During handling of the above exception, another exception occurred:
Apr 28 13:49:01 localhost patroni[988]: Traceback (most recent call last):
Apr 28 13:49:01 localhost patroni[988]: File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 710, in urlopen
Apr 28 13:49:01 localhost patroni[988]: chunked=chunked,
Apr 28 13:49:01 localhost patroni[988]: File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 451, in _make_request
Apr 28 13:49:01 localhost patroni[988]: self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
Apr 28 13:49:01 localhost patroni[988]: File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 341, in _raise_timeout
Apr 28 13:49:01 localhost patroni[988]: self, url, "Read timed out. (read timeout=%s)" % timeout_value
Apr 28 13:49:01 localhost patroni[988]: urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='10.16.18.133', port=2379): Read timed out. (read timeout=2.499780117010232)
Apr 28 13:49:01 localhost patroni[988]: During handling of the above exception, another exception occurred:
Apr 28 13:49:01 localhost patroni[988]: Traceback (most recent call last):
Apr 28 13:49:01 localhost patroni[988]: File "/usr/local/lib/python3.6/site-packages/patroni/dcs/etcd.py", line 211, in _do_http_request
Apr 28 13:49:01 localhost patroni[988]: response = request_executor(method, base_uri + path, **kwargs)
Apr 28 13:49:01 localhost patroni[988]: File "/usr/local/lib/python3.6/site-packages/urllib3/request.py", line 79, in request
Apr 28 13:49:01 localhost patroni[988]: method, url, fields=fields, headers=headers, **urlopen_kw
Apr 28 13:49:01 localhost patroni[988]: File "/usr/local/lib/python3.6/site-packages/urllib3/request.py", line 170, in request_encode_body
Apr 28 13:49:01 localhost patroni[988]: return self.urlopen(method, url, **extra_kw)
Apr 28 13:49:01 localhost patroni[988]: File "/usr/local/lib/python3.6/site-packages/urllib3/poolmanager.py", line 376, in urlopen
Apr 28 13:49:01 localhost patroni[988]: response = conn.urlopen(method, u.request_uri, **kw)
Apr 28 13:49:01 localhost patroni[988]: File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 786, in urlopen
Apr 28 13:49:01 localhost patroni[988]: method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
Apr 28 13:49:01 localhost patroni[988]: File "/usr/local/lib/python3.6/site-packages/urllib3/util/retry.py", line 592, in increment
Apr 28 13:49:01 localhost patroni[988]: raise MaxRetryError(_pool, url, error or ResponseError(cause))
Apr 28 13:49:01 localhost patroni[988]: urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='10.16.18.133', port
=2379): Max retries exceeded with url: /v2/keys/service/postgres-cluster/members/pg-db1 (Caused by ReadTimeoutError("HTTPConnectionPool(host='10.16.18.133', port=2379): Read timed out. (read timeout=2.499780117010232)",))
Apr 28 13:49:01 localhost patroni[988]: During handling of the above exception, another exception occurred:
Apr 28 13:49:01 localhost patroni[988]: Traceback (most recent call last):
Apr 28 13:49:01 localhost patroni[988]: File "/usr/local/lib/python3.6/site-packages/patroni/dcs/etcd.py", line 566, in wrapper
Apr 28 13:49:01 localhost patroni[988]: retval = func(self, *args, **kwargs) is not None
Apr 28 13:49:01 localhost patroni[988]: File "/usr/local/lib/python3.6/site-packages/patroni/dcs/etcd.py", line 659, in touch_member
Apr 28 13:49:01 localhost patroni[988]: return self._client.set(self.member_path, data, None if permanent else self._ttl)
Apr 28 13:49:01 localhost patroni[988]: File "/usr/local/lib/python3.6/site-packages/etcd/client.py", line 721, in set
Apr 28 13:49:01 localhost patroni[988]: return self.write(key, value, ttl=ttl)
Apr 28 13:49:01 localhost patroni[988]: File "/usr/local/lib/python3.6/site-packages/etcd/client.py", line 500, in write
Apr 28 13:49:01 localhost patroni[988]: response = self.api_execute(path, method, params=params)
Apr 28 13:49:01 localhost patroni[988]: File "/usr/local/lib/python3.6/site-packages/patroni/dcs/etcd.py", line 256, in api_execute
Apr 28 13:49:01 localhost patroni[988]: response = self._do_http_request(retry, machines_cache, request_executor, method, path, **kwargs)
Apr 28 13:49:01 localhost patroni[988]: File "/usr/local/lib/python3.6/site-packages/patroni/dcs/etcd.py", line 230, in _do_http_request
Apr 28 13:49:01 localhost patroni[988]: raise etcd.EtcdException('{0} {1} request failed'.format(method, path))
Apr 28 13:49:01 localhost patroni[988]: etcd.EtcdException: PUT /v2/keys/service/postgres-cluster/members/pg-db1 request failed
Apr 28 13:49:01 localhost patroni[988]: 2022-04-28 13:49:01,050 INFO: no action. I am (pg-db1), a secondary, and following a leader (pg-db3)
Apr 28 13:49:01 localhost etcd[1078]: got unexpected response error (etcdserver: request timed out)
Apr 28 13:49:02 localhost etcd[1078]: 33070dbf2451ad42 [term: 261] received a MsgVote message with higher term from 99de4181ac8b022 [term: 262]
Apr 28 13:49:02 localhost etcd[1078]: 33070dbf2451ad42 became follower at term 262
Apr 28 13:49:02 localhost etcd[1078]: 33070dbf2451ad42 [logterm: 203, index: 16576203, vote: 0] cast MsgVote for 99de4181ac8b022 [logterm: 203, index: 16576203] at term 262
Apr 28 13:49:03 localhost etcd[1078]: health check for peer a5b9a4993cb72fad could not connect: dial tcp 10.16.18.134:2380: connect: no route to host (prober "ROUND_TRIPPER_RAFT_MESSAGE")
Apr 28 13:49:03 localhost etcd[1078]: health check for peer f12d636a65b65d2f could not connect: dial tcp 10.16.18.141:2380: connect: no route to host (prober "ROUND_TRIPPER_RAFT_MESSAGE")
Apr 28 13:49:03 localhost etcd[1078]: got unexpected response error (etcdserver: request timed out)
Apr 28 13:49:07 localhost etcd[1078]: 33070dbf2451ad42 is starting a new election at term 262
Apr 28 13:49:07 localhost etcd[1078]: 33070dbf2451ad42 became candidate at term 263
Apr 28 13:49:07 localhost etcd[1078]: 33070dbf2451ad42 received MsgVoteResp from 33070dbf2451ad42 at term 263
Apr 28 13:49:07 localhost etcd[1078]: 33070dbf2451ad42 [logterm: 203, index: 16576203] sent MsgVote request to a5b9a4993cb72fad at term 263
Apr 28 13:49:07 localhost etcd[1078]: 33070dbf2451ad42 [logterm: 203, index: 16576203] sent MsgVote request to f12d636a65b65d2f at term 263
Apr 28 13:49:07 localhost etcd[1078]: 33070dbf2451ad42 [logterm: 203, index: 16576203] sent MsgVote request to 99de4181ac8b022 at term 263
Apr 28 13:49:07 localhost etcd[1078]: 33070dbf2451ad42 received MsgVoteResp from 99de4181ac8b022 at term 263
Apr 28 13:49:07 localhost etcd[1078]: 33070dbf2451ad42 [quorum:3] has received 2 MsgVoteResp votes and 0 vote rejections
Apr 28 13:49:07 localhost etcd[1078]: got unexpected response error (etcdserver: request timed out)
Apr 28 13:49:08 localhost etcd[1078]: health check for peer a5b9a4993cb72fad could not connect: dial tcp 10.16.18.134:2380: connect: no route to host (prober "ROUND_TRIPPER_RAFT_MESSAGE")
Apr 28 13:49:08 localhost etcd[1078]: health check for peer f12d636a65b65d2f could not connect: dial tcp 10.16.18.141:2380: connect: no route to host (prober "ROUND_TRIPPER_RAFT_MESSAGE")
Apr 28 13:49:08 localhost patroni[988]: 2022-04-28 13:49:08,539 INFO: Selected new etcd server http://10.16.18.133:2379
Apr 28 13:49:11 localhost patroni[988]: 2022-04-28 13:49:08,542 INFO: Lock owner: pg-db3; I am pg-db1
Apr 28 13:49:11 localhost patroni[988]: 2022-04-28 13:49:11,047 INFO: Selected new etcd server http://10.16.18.135:2379
Apr 28 13:49:11 localhost patroni[988]: 2022-04-28 13:49:11,047 ERROR: Request to server http://10.16.18.133:2379 fail
root@pg-db1:keepalived# /usr/local/bin/patronictl list postgres-cluster
2022-04-28 14:42:52,915 - ERROR - Request to server http://10.16.18.141:2379 failed: MaxRetryError("HTTPConnectionPool(host='10.16.18.141', port=2379): Max retries exceeded with url: /v2/keys/service/postgres-cluster/?recursive=true (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7fe4ce701d68>, 'Connection to 10.16.18.141 timed out. (connect timeout=1.25)'))",)
+--------------+----------------+---------+---------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+ Cluster: postgres-cluster (7068655480869002509) --+----+-----------+
| pg-db1 | 10.16.18.133 | Replica | running | 13 | 0 |
| pg-db2 | 10.16.18.134 | Replica | running | 13 | 0 |
| pg-db3 | 10.16.18.135 | Leader | running | 13 | |
+--------------+----------------+---------+---------+----+-----------+
and so when these 3 nodes up and running - the cluster is OK. but if I disconnect network on any one of them - it still says running but obviously a lot of errors
INFO: no action. I am (pg-db1), a secondary, and following a leader (pg-db3)
1) If there is no network on the replica then nothing happens the replica will continue to work, the replication lag will simply accumulate since there is no access to primary.
when the network is restored, the replica will try to catch up with the primary by restoring all the WALs (if they are still available or are in the archive). In a load balancing scheme (haproxy), such a lagging replica will be excluded from balancing if its replication lag exceeds maximum_lag_on_failover
.
2) if the network disappears on primary (leader), then it will no longer be able to update the leader key in DCS (etcd) and after a while, usually 30 seconds (ttl
), a new leader (primary) will be selected. And the former leader will be restarted as a replica.
I have made one more cluster with patroni 2.1.1 and there indeed all works as expected. bit this one is 2.1.3 and basically it does not work as expected. I guess the python trace hints on the cause of the problem - I do not have such a trace on 2.1.1
Try to open the issue on the Patroni project repository
Hello, a cluster structure with 3 nodes has been created 3 pcs etc 3 versions of postgresql 14.5 I have set up a cluster structure for 3 Haproxy. I'm getting the following error. I have set up 0 presentations again because I think it is related to my organization and I am getting the same error again. how can I follow a path related to etcd synchronization disorder
etcd service log:
Nov 11 12:50:18 etcd[1234]: got unexpected response error (etcdserver: request timed out) Nov 11 12:50:22 etcd[1234]: f067dab16206035b [logterm: 44, index: 11008, vote: fff1b3af9b1bfc49] ignored MsgVote from be385cae4113fc0b [logterm: 44, index> Nov 11 12:50:22 etcd[1234]: got unexpected response error (etcdserver: request timed out) Nov 11 12:50:25 etcd[1234]: got unexpected response error (etcdserver: request timed out) Nov 11 12:50:29 etcd[1234]: f067dab16206035b [logterm: 44, index: 11008, vote: fff1b3af9b1bfc49] ignored MsgVote from be385cae4113fc0b [logterm: 44, index> Nov 11 12:50:32 etcd[1234]: got unexpected response error (etcdserver: request timed out) Nov 11 12:50:35 etcd[1234]: got unexpected response error (etcdserver: request timed out) Nov 11 12:50:38 etcd[1234]: f067dab16206035b [logterm: 44, index: 11008, vote: fff1b3af9b1bfc49] ignored MsgVote from be385cae4113fc0b [logterm: 44, index> Nov 11 12:50:42 etcd[1234]: got unexpected response error (etcdserver: request timed out) Nov 11 12:50:45 etcd[1234]: got unexpected response error (etcdserver: request timed out) Nov 11 12:50:47 etcd[1234]: f067dab16206035b [logterm: 44, index: 11008, vote: fff1b3af9b1bfc49] ignored MsgVote from be385cae4113fc0b [logterm: 44, index> Nov 11 12:50:52 etcd[1234]: got unexpected response error (etcdserver: request timed out) Nov 11 12:50:53 etcd[1234]: got unexpected response error (etcdserver: request timed out) [merged 1 repeated lines in 1.88s] Nov 11 12:50:55 etcd[1234]: f067dab16206035b [logterm: 44, index: 11008, vote: fff1b3af9b1bfc49] ignored MsgVote from be385cae4113fc0b [logterm: 44, index> Nov 11 12:51:02 etcd[1234]: got unexpected response error (etcdserver: request timed out) Nov 11 12:51:03 etcd[1234]: f067dab16206035b [logterm: 44, index: 11414, vote: fff1b3af9b1bfc49] ignored MsgVote from be385cae4113fc0b [logterm: 44, index> Nov 11 12:51:04 etcd[1234]: got unexpected response error (etcdserver: request timed out) [merged 1 repeated lines in 1.93s] Nov 11 12:51:09 etcd[1234]: f067dab16206035b [logterm: 44, index: 11414, vote: fff1b3af9b1bfc49] ignored MsgVote from fff1b3af9b1bfc49 [logterm: 44, index> Nov 11 12:51:09 etcd[1234]: f067dab16206035b [term: 44] received a MsgApp message with higher term from fff1b3af9b1bfc49 [term: 64] Nov 11 12:51:09 etcd[1234]: f067dab16206035b became follower at term 64 Nov 11 12:51:12 etcd[1234]: got unexpected response error (etcdserver: request timed out) Nov 11 12:51:13 etcd[1234]: got unexpected response error (etcdserver: request timed out) [merged 1 repeated lines in 1.83s] Nov 11 12:51:18 etcd[1234]: got unexpected response error (etcdserver: request timed out) Nov 11 12:51:22 etcd[1234]: got unexpected response error (etcdserver: request timed out) Nov 11 12:51:28 etcd[1234]: f067dab16206035b [logterm: 64, index: 11415, vote: 0] ignored MsgVote from be385cae4113fc0b [logterm: 64, index: 11458] at ter> Nov 11 12:51:36 etcd[1234]: got unexpected response error (etcdserver: request timed out) Nov 11 12:51:37 etcd[1234]: f067dab16206035b [logterm: 64, index: 11415, vote: 0] ignored MsgVote from be385cae4113fc0b [logterm: 64, index: 11458] at ter> Nov 11 12:51:42 etcd[1234]: got unexpected response error (etcdserver: request timed out)
Patroni.yml:
bootstrap: method: initdb dcs: ttl: 30 loop_wait: 10 retry_timeout: 10 maximum_lag_on_failover: 1048576 master_start_timeout: 300
hello
Please see a similar issue https://github.com/etcd-io/etcd/issues/11809
And my recommendations - https://github.com/vitabaks/postgresql_cluster#recommendations
пт, 11 нояб. 2022 г. в 13:14, ramazan @.***>:
Hello, a cluster structure with 3 nodes has been created 3 pcs etc 3 versions of postgresql 14.5 I have set up a cluster structure for 3 Haproxy. I'm getting the following error. I have set up 0 presentations again because I think it is related to my organization and I am getting the same error again. how can I follow a path related to etcd synchronization disorder
etcd service log:
Nov 11 12:50:18 etcd[1234]: got unexpected response error (etcdserver: request timed out) Nov 11 12:50:22 etcd[1234]: f067dab16206035b [logterm: 44, index: 11008, vote: fff1b3af9b1bfc49] ignored MsgVote from be385cae4113fc0b [logterm: 44, index> Nov 11 12:50:22 etcd[1234]: got unexpected response error (etcdserver: request timed out) Nov 11 12:50:25 etcd[1234]: got unexpected response error (etcdserver: request timed out) Nov 11 12:50:29 etcd[1234]: f067dab16206035b [logterm: 44, index: 11008, vote: fff1b3af9b1bfc49] ignored MsgVote from be385cae4113fc0b [logterm: 44, index> Nov 11 12:50:32 etcd[1234]: got unexpected response error (etcdserver: request timed out) Nov 11 12:50:35 etcd[1234]: got unexpected response error (etcdserver: request timed out) Nov 11 12:50:38 etcd[1234]: f067dab16206035b [logterm: 44, index: 11008, vote: fff1b3af9b1bfc49] ignored MsgVote from be385cae4113fc0b [logterm: 44, index> Nov 11 12:50:42 etcd[1234]: got unexpected response error (etcdserver: request timed out) Nov 11 12:50:45 etcd[1234]: got unexpected response error (etcdserver: request timed out) Nov 11 12:50:47 etcd[1234]: f067dab16206035b [logterm: 44, index: 11008, vote: fff1b3af9b1bfc49] ignored MsgVote from be385cae4113fc0b [logterm: 44, index> Nov 11 12:50:52 etcd[1234]: got unexpected response error (etcdserver: request timed out) Nov 11 12:50:53 etcd[1234]: got unexpected response error (etcdserver: request timed out) [merged 1 repeated lines in 1.88s] Nov 11 12:50:55 etcd[1234]: f067dab16206035b [logterm: 44, index: 11008, vote: fff1b3af9b1bfc49] ignored MsgVote from be385cae4113fc0b [logterm: 44, index> Nov 11 12:51:02 etcd[1234]: got unexpected response error (etcdserver: request timed out) Nov 11 12:51:03 etcd[1234]: f067dab16206035b [logterm: 44, index: 11414, vote: fff1b3af9b1bfc49] ignored MsgVote from be385cae4113fc0b [logterm: 44, index> Nov 11 12:51:04 etcd[1234]: got unexpected response error (etcdserver: request timed out) [merged 1 repeated lines in 1.93s] Nov 11 12:51:09 etcd[1234]: f067dab16206035b [logterm: 44, index: 11414, vote: fff1b3af9b1bfc49] ignored MsgVote from fff1b3af9b1bfc49 [logterm: 44, index> Nov 11 12:51:09 etcd[1234]: f067dab16206035b [term: 44] received a MsgApp message with higher term from fff1b3af9b1bfc49 [term: 64] Nov 11 12:51:09 etcd[1234]: f067dab16206035b became follower at term 64 Nov 11 12:51:12 etcd[1234]: got unexpected response error (etcdserver: request timed out) Nov 11 12:51:13 etcd[1234]: got unexpected response error (etcdserver: request timed out) [merged 1 repeated lines in 1.83s] Nov 11 12:51:18 etcd[1234]: got unexpected response error (etcdserver: request timed out) Nov 11 12:51:22 etcd[1234]: got unexpected response error (etcdserver: request timed out) Nov 11 12:51:28 etcd[1234]: f067dab16206035b [logterm: 64, index: 11415, vote: 0] ignored MsgVote from be385cae4113fc0b [logterm: 64, index: 11458] at ter> Nov 11 12:51:36 etcd[1234]: got unexpected response error (etcdserver: request timed out) Nov 11 12:51:37 etcd[1234]: f067dab16206035b [logterm: 64, index: 11415, vote: 0] ignored MsgVote from be385cae4113fc0b [logterm: 64, index: 11458] at ter> Nov 11 12:51:42 etcd[1234]: got unexpected response error (etcdserver: request timed out)
Patroni.yml:
bootstrap: method: initdb dcs: ttl: 30 loop_wait: 10 retry_timeout: 10 maximum_lag_on_failover: 1048576 master_start_timeout: 300
— Reply to this email directly, view it on GitHub https://github.com/vitabaks/postgresql_cluster/issues/170#issuecomment-1311499775, or unsubscribe https://github.com/notifications/unsubscribe-auth/AI2LV7WK5AFPLD3GDYV5PNDWHYMBVANCNFSM5UUMHRZQ . You are receiving this because you modified the open/close state.Message ID: @.***>
Hi.
I setup a cluster with 3 nodes with haproxy, keepalived and etcd - that works. then I add a node as per documentation with add_pgnode.yml and later setup etcd, haproxy and keepalived on it manually now I have 4 nodes in the cluster - works. ok now I start sending load to the cluster I get all around very good RPS.
now I disconnect network on node 4 - in 10 seconds or so the node gets removed from the list /usr/local/bin/patronictl list postgres-cluster even before 10 seconds haproxy stops balancing traffic to the node and I get 0 errors and node nicely gets removed from the cluster. If I later on network on node 4 again - it gets in sync and the appears in /usr/local/bin/patronictl list postgres-cluster
now one mode test I disconnect network from node 4 and - it gets removed from the cluster automatically. and then I disconnect network on any other read replica - now things get funny. the node is not getting removed from the /usr/local/bin/patronictl list postgres-cluster
it actually sits there with state running even though it is not even in the network.
things will get even more funny if I disconnect network on the master - it also never gets removed from the cluster and actually new master is not getting assigned.
with both latter scenarios a lot of errors and RPS degraded dramatically to an unusable extent in case with master disconnection.
any pointer on how I fix this?
thanks