How does the failover really works

Hello guys,

I'm doing some testing with my postgres cluster in which I am turning off (poweroff) one by one either the leader, the replica or the standby node.

During this testing I have seen that there could be some problems with either my config or my cluster...

More exactly, if I am left with the leader and the standby node, when I poweroff the leader there is no switchover of the standby node happening. Shouldn't this node become a leader / takeover?

In the following case the leader is poweroff but somehow still listed in patronictl as running????

root@db2:~# patronictl list
2020-12-27 21:04:02,289 - ERROR - Failed to get list of machines from http://10.10.11.6:2379/v2: MaxRetryError("HTTPConnectionPool(host='10.10.11.6', port=2379): Max retries exceeded with url: /v2/machines (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7fd0aa8139a0>, 'Connection to 10.10.11.6 timed out. (connect timeout=1.6666666666666667)'))")
2020-12-27 21:04:03,961 - ERROR - Failed to get list of machines from http://10.10.11.4:2379/v2: MaxRetryError("HTTPConnectionPool(host='10.10.11.4', port=2379): Max retries exceeded with url: /v2/machines (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7fd0aa813bb0>, 'Connection to 10.10.11.4 timed out. (connect timeout=1.6666666666666667)'))")
+ Cluster: test (6883197882048089956) ---+----+-----------+--------------+
| Member | Host       | Role         | State   | TL | Lag in MB | Tags         |
+--------+------------+--------------+---------+----+-----------+--------------+
| db1    | 10.10.11.4 | Leader       | running |  5 |           | nosync: true |
| db2    | 10.10.11.5 | Sync Standby | running |  5 |         0 |              |

After I poweron the leader the roles switched and the other one becomes a replica...

root@db2:~# patronictl list
2020-12-27 21:11:25,254 - ERROR - Failed to get list of machines from http://10.10.11.6:2379/v2: MaxRetryError("HTTPConnectionPool(host='10.10.11.6', port=2379): Max retries exceeded with url: /v2/machines (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7fb4c47ab9a0>, 'Connection to 10.10.11.6 timed out. (connect timeout=1.6666666666666667)'))")
+ Cluster: test (6883197882048089956) ---+-----------+
| Member | Host       | Role    | State   | TL | Lag in MB |
+--------+------------+---------+---------+----+-----------+
| db1    | 10.10.11.4 | Replica | running |  5 |         0 |
| db2    | 10.10.11.5 | Leader  | running |  6 |           |
+--------+------------+---------+---------+----+-----------+

Another issue is after I poweron the other 2 servers and they join the cluster both of them are replicas. Shouldn't there be a standby node anymore?

+ Cluster: test (6883197882048089956) ---+-----------+
| Member | Host       | Role    | State   | TL | Lag in MB |
+--------+------------+---------+---------+----+-----------+
| db0    | 10.10.11.6 | Leader  | running |  4 |           |
| db1    | 10.10.11.4 | Replica | running |  4 |         0 |
| db2    | 10.10.11.5 | Replica | running |  4 |         0 |
+--------+------------+---------+---------+----+-----------+

After a reboot of all 3 nodes at the same time the cluster is back to it's initial state

root@db0:~# patronictl list
+ Cluster: test (6883197882048089956) ---+----+-----------+
| Member | Host       | Role         | State   | TL | Lag in MB |
+--------+------------+--------------+---------+----+-----------+
| db0    | 10.10.11.6 | Replica      | running |  5 |         0 |
| db1    | 10.10.11.4 | Sync Standby | running |  5 |         0 |
| db2    | 10.10.11.5 | Leader       | running |  5 |           |
+--------+------------+--------------+---------+----+-----------+

I would appreciate it very much if someone could explain to me how does the replica and sync standby nodes should behave when the leader is gone.

Thank you!!!!!!!!!!!!!!!!!

More exactly, if I am left with the leader and the standby node, when I poweroff the leader there is no switchover of the standby node happening. Shouldn't this node become a leader / takeover?

In the following case the leader is poweroff but somehow still lister in patronictl as running????

Based on the information you provided us, I can assume that you have deployed the etcd cluster on the same servers as the patroni cluster.

2020-12-27 21:04:02,289 - ERROR - Failed to get list of machines from http://10.10.11.6:2379/v2: MaxRetryError("HTTPConnectionPool(host='10.10.11.6', port=2379): Max retries exceeded with url: /v2/machines (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7fd0aa8139a0>, 'Connection to 10.10.11.6 timed out. (connect timeout=1.6666666666666667)'))")
2020-12-27 21:04:03,961 - ERROR - Failed to get list of machines from http://10.10.11.4:2379/v2: MaxRetryError("HTTPConnectionPool(host='10.10.11.4', port=2379): Max retries exceeded with url: /v2/machines (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7fd0aa813bb0>, 'Connection to 10.10.11.4 timed out. (connect timeout=1.6666666666666667)'))")

At this stage, your etcd cluster is unhealthy. Because 10.10.11.4 and 10.10.11.6 nodes are not available.

Raft Consensus Algorithm requires N/2 +1 A cluster of 3 nodes will survive the failure of 1 node only .

Please don't forget to provide the patroni logs.

patroni log:

sudo journalctl -u patroni -n 100 -f

etcd cluster state:

etcdctl cluster-health

Thank you for the detailed and helpful reply @vitabaks . Now I understand a bit more on how the election process works and also the number of nodes needed to achieve some fault tolerance.

I still have the question regarding replica and sync standby unclear in my brain.

What is the difference between a replica and a sync standby ?
Why are 2 nodes becoming replicas and 1 of them is not returning/reverting to being a sync standby ?

I am recreating the cluster and going to the add the patroni logs.

Added logs and info. Any idea why there is no sync standby anymore in the cluster?

root@db2:~# patronictl list
+ Cluster: paperstack (6883197882048089956) ---+-----------+
| Member | Host       | Role    | State   | TL | Lag in MB |
+--------+------------+---------+---------+----+-----------+
| db0    | 10.10.11.4 | Replica | running |  4 |         0 |
| db1    | 10.10.11.6 | Leader  | running |  5 |           |
| db2    | 10.10.11.5 | Replica | running |  4 |         0 |
+--------+------------+---------+---------+----+-----------+

root@db2:~# etcdctl cluster-health
member 5e43a58f8a98b8d3 is healthy: got healthy result from http://10.10.11.4:2379
member a854d21dd4241ccf is healthy: got healthy result from http://10.10.11.5:2379
member d1396512de4977f1 is healthy: got healthy result from http://10.10.11.6:2379
cluster is healthy

root@db2:~# journalctl -u patroni -n 100 -f
-- Logs begin at Mon 2020-12-28 00:13:18 CET. --
Dec 28 00:31:07 db2 patroni[522]: 2020-12-28 00:31:07,569 ERROR: Request to server http://10.10.11.4:2379 failed: MaxRetryError("HTTPConnectionPool(host='10.10.11.4', port=2379): Max retries exceeded with url: /v2/keys/service/paperstack/members/db2 (Caused by ProtocolError('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer')))")
Dec 28 00:31:07 db2 patroni[522]: 2020-12-28 00:31:07,570 INFO: Reconnection allowed, looking for another server.
Dec 28 00:31:07 db2 patroni[522]: 2020-12-28 00:31:07,570 ERROR:
Dec 28 00:31:07 db2 patroni[522]: Traceback (most recent call last):
Dec 28 00:31:07 db2 patroni[522]:   File "/usr/local/lib/python3.8/dist-packages/urllib3/connectionpool.py", line 597, in urlopen
Dec 28 00:31:07 db2 patroni[522]:     httplib_response = self._make_request(conn, method, url,
Dec 28 00:31:07 db2 patroni[522]:   File "/usr/local/lib/python3.8/dist-packages/urllib3/connectionpool.py", line 384, in _make_request
Dec 28 00:31:07 db2 patroni[522]:     six.raise_from(e, None)
Dec 28 00:31:07 db2 patroni[522]:   File "<string>", line 2, in raise_from
Dec 28 00:31:07 db2 patroni[522]:   File "/usr/local/lib/python3.8/dist-packages/urllib3/connectionpool.py", line 380, in _make_request
Dec 28 00:31:07 db2 patroni[522]:     httplib_response = conn.getresponse()
Dec 28 00:31:07 db2 patroni[522]:   File "/usr/lib/python3.8/http/client.py", line 1347, in getresponse
Dec 28 00:31:07 db2 patroni[522]:     response.begin()
Dec 28 00:31:07 db2 patroni[522]:   File "/usr/lib/python3.8/http/client.py", line 307, in begin
Dec 28 00:31:07 db2 patroni[522]:     version, status, reason = self._read_status()
Dec 28 00:31:07 db2 patroni[522]:   File "/usr/lib/python3.8/http/client.py", line 268, in _read_status
Dec 28 00:31:07 db2 patroni[522]:     line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
Dec 28 00:31:07 db2 patroni[522]:   File "/usr/lib/python3.8/socket.py", line 669, in readinto
Dec 28 00:31:07 db2 patroni[522]:     return self._sock.recv_into(b)
Dec 28 00:31:07 db2 patroni[522]: ConnectionResetError: [Errno 104] Connection reset by peer
Dec 28 00:31:07 db2 patroni[522]: During handling of the above exception, another exception occurred:
Dec 28 00:31:07 db2 patroni[522]: Traceback (most recent call last):
Dec 28 00:31:07 db2 patroni[522]:   File "/usr/local/lib/python3.8/dist-packages/patroni/dcs/etcd.py", line 210, in _do_http_request
Dec 28 00:31:07 db2 patroni[522]:     response = request_executor(method, base_uri + path, **kwargs)
Dec 28 00:31:07 db2 patroni[522]:   File "/usr/local/lib/python3.8/dist-packages/urllib3/request.py", line 70, in request
Dec 28 00:31:07 db2 patroni[522]:     return self.request_encode_body(method, url, fields=fields,
Dec 28 00:31:07 db2 patroni[522]:   File "/usr/local/lib/python3.8/dist-packages/urllib3/request.py", line 150, in request_encode_body
Dec 28 00:31:07 db2 patroni[522]:     return self.urlopen(method, url, **extra_kw)
Dec 28 00:31:07 db2 patroni[522]:   File "/usr/local/lib/python3.8/dist-packages/urllib3/poolmanager.py", line 324, in urlopen
Dec 28 00:31:07 db2 patroni[522]:     response = conn.urlopen(method, u.request_uri, **kw)
Dec 28 00:31:07 db2 patroni[522]:   File "/usr/local/lib/python3.8/dist-packages/urllib3/connectionpool.py", line 637, in urlopen
Dec 28 00:31:07 db2 patroni[522]:     retries = retries.increment(method, url, error=e, _pool=self,
Dec 28 00:31:07 db2 patroni[522]:   File "/usr/local/lib/python3.8/dist-packages/urllib3/util/retry.py", line 399, in increment
Dec 28 00:31:07 db2 patroni[522]:     raise MaxRetryError(_pool, url, error or ResponseError(cause))
Dec 28 00:31:07 db2 patroni[522]: urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='10.10.11.4', port=2379): Max retries exceeded with url: /v2/keys/service/paperstack/members/db2 (Caused by ProtocolError('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer')))
Dec 28 00:31:07 db2 patroni[522]: During handling of the above exception, another exception occurred:
Dec 28 00:31:07 db2 patroni[522]: Traceback (most recent call last):
Dec 28 00:31:07 db2 patroni[522]:   File "/usr/local/lib/python3.8/dist-packages/patroni/dcs/etcd.py", line 562, in wrapper
Dec 28 00:31:07 db2 patroni[522]:     retval = func(self, *args, **kwargs) is not None
Dec 28 00:31:07 db2 patroni[522]:   File "/usr/local/lib/python3.8/dist-packages/patroni/dcs/etcd.py", line 640, in touch_member
Dec 28 00:31:07 db2 patroni[522]:     return self._client.set(self.member_path, data, None if permanent else self._ttl)
Dec 28 00:31:07 db2 patroni[522]:   File "/usr/local/lib/python3.8/dist-packages/etcd/client.py", line 721, in set
Dec 28 00:31:07 db2 patroni[522]:     return self.write(key, value, ttl=ttl)
Dec 28 00:31:07 db2 patroni[522]:   File "/usr/local/lib/python3.8/dist-packages/etcd/client.py", line 500, in write
Dec 28 00:31:07 db2 patroni[522]:     response = self.api_execute(path, method, params=params)
Dec 28 00:31:07 db2 patroni[522]:   File "/usr/local/lib/python3.8/dist-packages/patroni/dcs/etcd.py", line 255, in api_execute
Dec 28 00:31:07 db2 patroni[522]:     response = self._do_http_request(retry, machines_cache, request_executor, method, path, **kwargs)
Dec 28 00:31:07 db2 patroni[522]:   File "/usr/local/lib/python3.8/dist-packages/patroni/dcs/etcd.py", line 229, in _do_http_request
Dec 28 00:31:07 db2 patroni[522]:     raise etcd.EtcdException('{0} {1} request failed'.format(method, path))
Dec 28 00:31:07 db2 patroni[522]: etcd.EtcdException: PUT /v2/keys/service/paperstack/members/db2 request failed
Dec 28 00:31:07 db2 patroni[522]: 2020-12-28 00:31:07,577 INFO: following a different leader because i am not the healthiest node
Dec 28 00:31:17 db2 patroni[522]: 2020-12-28 00:31:17,545 INFO: Lock owner: db1; I am db2
Dec 28 00:31:17 db2 patroni[522]: 2020-12-28 00:31:17,545 INFO: does not have lock
Dec 28 00:31:17 db2 patroni[522]: 2020-12-28 00:31:17,551 INFO: Local timeline=3 lsn=0/A000000
Dec 28 00:31:17 db2 patroni[522]: 2020-12-28 00:31:17,683 INFO: master_timeline=5
Dec 28 00:31:17 db2 patroni[522]: 2020-12-28 00:31:17,709 INFO: master: history=1        0/9000000        no recovery target specified
Dec 28 00:31:17 db2 patroni[522]: 2        0/9000150        no recovery target specified
Dec 28 00:31:17 db2 patroni[522]: 3        0/A0000A0        no recovery target specified
Dec 28 00:31:17 db2 patroni[522]: 4        0/B0000A0        no recovery target specified
Dec 28 00:31:17 db2 patroni[1103]: server signaled
Dec 28 00:31:17 db2 patroni[522]: 2020-12-28 00:31:17,798 INFO: no action.  i am a secondary and i am following a leader
Dec 28 00:31:23 db2 patroni[522]: 2020-12-28 00:31:23,050 INFO: Lock owner: db1; I am db2
Dec 28 00:31:23 db2 patroni[522]: 2020-12-28 00:31:23,050 INFO: does not have lock
Dec 28 00:31:23 db2 patroni[522]: 2020-12-28 00:31:23,056 INFO: Local timeline=4 lsn=0/B000000
Dec 28 00:31:23 db2 patroni[522]: 2020-12-28 00:31:23,189 INFO: master_timeline=5
Dec 28 00:31:23 db2 patroni[522]: 2020-12-28 00:31:23,214 INFO: master: history=1        0/9000000        no recovery target specified
Dec 28 00:31:23 db2 patroni[522]: 2        0/9000150        no recovery target specified
Dec 28 00:31:23 db2 patroni[522]: 3        0/A0000A0        no recovery target specified
Dec 28 00:31:23 db2 patroni[522]: 4        0/B0000A0        no recovery target specified
Dec 28 00:31:23 db2 patroni[522]: 2020-12-28 00:31:23,279 INFO: no action.  i am a secondary and i am following a leader
Dec 28 00:31:33 db2 patroni[522]: 2020-12-28 00:31:33,004 INFO: Lock owner: db1; I am db2
Dec 28 00:31:33 db2 patroni[522]: 2020-12-28 00:31:33,004 INFO: does not have lock
Dec 28 00:31:33 db2 patroni[522]: 2020-12-28 00:31:33,017 INFO: no action.  i am a secondary and i am following a leader
Dec 28 00:31:43 db2 patroni[522]: 2020-12-28 00:31:43,006 INFO: Lock owner: db1; I am db2
Dec 28 00:31:43 db2 patroni[522]: 2020-12-28 00:31:43,006 INFO: does not have lock
Dec 28 00:31:43 db2 patroni[522]: 2020-12-28 00:31:43,018 INFO: no action.  i am a secondary and i am following a leader
Dec 28 00:31:53 db2 patroni[522]: 2020-12-28 00:31:53,003 INFO: Lock owner: db1; I am db2
Dec 28 00:31:53 db2 patroni[522]: 2020-12-28 00:31:53,003 INFO: does not have lock
Dec 28 00:31:53 db2 patroni[522]: 2020-12-28 00:31:53,017 INFO: no action.  i am a secondary and i am following a leader
Dec 28 00:32:03 db2 patroni[522]: 2020-12-28 00:32:03,003 INFO: Lock owner: db1; I am db2
Dec 28 00:32:03 db2 patroni[522]: 2020-12-28 00:32:03,003 INFO: does not have lock
Dec 28 00:32:03 db2 patroni[522]: 2020-12-28 00:32:03,017 INFO: no action.  i am a secondary and i am following a leader
Dec 28 00:32:13 db2 patroni[522]: 2020-12-28 00:32:13,008 INFO: Lock owner: db1; I am db2
Dec 28 00:32:13 db2 patroni[522]: 2020-12-28 00:32:13,008 INFO: does not have lock
Dec 28 00:32:13 db2 patroni[522]: 2020-12-28 00:32:13,029 INFO: no action.  i am a secondary and i am following a leader
Dec 28 00:32:23 db2 patroni[522]: 2020-12-28 00:32:23,006 INFO: Lock owner: db1; I am db2
Dec 28 00:32:23 db2 patroni[522]: 2020-12-28 00:32:23,006 INFO: does not have lock
Dec 28 00:32:23 db2 patroni[522]: 2020-12-28 00:32:23,022 INFO: no action.  i am a secondary and i am following a leader
Dec 28 00:32:33 db2 patroni[522]: 2020-12-28 00:32:33,005 INFO: Lock owner: db1; I am db2
Dec 28 00:32:33 db2 patroni[522]: 2020-12-28 00:32:33,005 INFO: does not have lock
Dec 28 00:32:33 db2 patroni[522]: 2020-12-28 00:32:33,019 INFO: no action.  i am a secondary and i am following a leader
Dec 28 00:32:43 db2 patroni[522]: 2020-12-28 00:32:43,004 INFO: Lock owner: db1; I am db2
Dec 28 00:32:43 db2 patroni[522]: 2020-12-28 00:32:43,004 INFO: does not have lock
Dec 28 00:32:43 db2 patroni[522]: 2020-12-28 00:32:43,018 INFO: no action.  i am a secondary and i am following a leader
Dec 28 00:32:53 db2 patroni[522]: 2020-12-28 00:32:53,005 INFO: Lock owner: db1; I am db2
Dec 28 00:32:53 db2 patroni[522]: 2020-12-28 00:32:53,005 INFO: does not have lock
Dec 28 00:32:53 db2 patroni[522]: 2020-12-28 00:32:53,018 INFO: no action.  i am a secondary and i am following a leader
Dec 28 00:33:03 db2 patroni[522]: 2020-12-28 00:33:03,015 INFO: Lock owner: db1; I am db2
Dec 28 00:33:03 db2 patroni[522]: 2020-12-28 00:33:03,015 INFO: does not have lock
Dec 28 00:33:03 db2 patroni[522]: 2020-12-28 00:33:03,029 INFO: no action.  i am a secondary and i am following a leader
Dec 28 00:33:13 db2 patroni[522]: 2020-12-28 00:33:13,005 INFO: Lock owner: db1; I am db2
Dec 28 00:33:13 db2 patroni[522]: 2020-12-28 00:33:13,006 INFO: does not have lock
Dec 28 00:33:13 db2 patroni[522]: 2020-12-28 00:33:13,019 INFO: no action.  i am a secondary and i am following a leader
Dec 28 00:33:23 db2 patroni[522]: 2020-12-28 00:33:23,006 INFO: Lock owner: db1; I am db2
Dec 28 00:33:23 db2 patroni[522]: 2020-12-28 00:33:23,006 INFO: does not have lock
Dec 28 00:33:23 db2 patroni[522]: 2020-12-28 00:33:23,020 INFO: no action.  i am a secondary and i am following a leader

What is the difference between a replica and a sync standby ?

replica - it is a standby server in asynchronous replication mode. sync standby - it is a standby server in synchronous replication mode (if synchronous_mode: true).

See here for more details.

Why are 2 nodes becoming replicas and 1 of them is not returning/reverting to being a sync standby ?

If you specified synchronous_node_count: 1, one of the standby servers should automatically be added to synchronous_standby_names. That is, the standby server must switch to synchronous replication mode after it receives all the latest WAL changes from the Primary (leader) server.

If you look at the postgres logs, you will see the following entries:

2020-12-28 12:01:56 MSK [96968-95]  LOG:  parameter "synchronous_standby_names" removed from configuration file, reset to default
2020-12-28 12:02:26 MSK [96968-96]  LOG:  received SIGHUP, reloading configuration files
2020-12-28 12:02:26 MSK [96968-97]  LOG:  parameter "synchronous_standby_names" changed to "pgnode01"
2020-12-28 12:02:34 MSK [100961-1] 10.128.64.140(2800) replicator@[unknown] LOG:  standby "pgnode01" is now a synchronous standby with priority 1

If this does not happen, then the standby server is not yet up to date, there is a replication lag.

Any idea why there is no sync standby anymore in the cluster?

patronictl displays the state of the cluster at a given time. At this point in time, the leader has TL 5 (time line), and the standby servers are still in TL 4.

To dynamically monitor the cluster state , try patronictl with the -W option (automatic screen refresh every 2 seconds):

patronictl list -t -W

P.S. Don't forget to check the postgres logs as well.

Hi @vitabaks

Thank you once again for the detailed explanation.

You are correct, it seems like the slaves are somehow thinking they have some other timeline or files and the leader is telling them they have already been removed.

Any ideas on how to recover from this?

This is on the leader...

2020-12-28 11:01:35 CET [1077-1] 10.10.11.5(29310) replicator@[unknown] ERROR:  requested WAL segment 00000004000000000000000B has already been removed
2020-12-28 11:01:39 CET [1085-1] 10.10.11.4(39488) replicator@[unknown] ERROR:  requested WAL segment 00000004000000000000000B has already been removed
2020-12-28 11:01:39 CET [1086-1] 10.10.11.5(29322) replicator@[unknown] ERROR:  requested WAL segment 00000004000000000000000B has already been removed
2020-12-28 11:01:44 CET [1092-1] 10.10.11.4(39502) replicator@[unknown] ERROR:  requested WAL segment 00000004000000000000000B has already been removed
2020-12-28 11:01:44 CET [1093-1] 10.10.11.5(29340) replicator@[unknown] ERROR:  requested WAL segment 00000004000000000000000B has already been removed
2020-12-28 11:01:49 CET [1098-1] 10.10.11.4(39514) replicator@[unknown] ERROR:  requested WAL segment 00000004000000000000000B has already been removed
2020-12-28 11:01:49 CET [1100-1] 10.10.11.5(29352) replicator@[unknown] ERROR:  requested WAL segment 00000004000000000000000B has already been removed
2020-12-28 11:01:54 CET [1109-1] 10.10.11.4(39528) replicator@[unknown] ERROR:  requested WAL segment 00000004000000000000000B has already been removed
2020-12-28 11:01:54 CET [1111-1] 10.10.11.5(29366) replicator@[unknown] ERROR:  requested WAL segment 00000004000000000000000B has already been removed
2020-12-28 11:01:59 CET [1118-1] 10.10.11.4(39544) replicator@[unknown] ERROR:  requested WAL segment 00000004000000000000000B has already been removed
2020-12-28 11:01:59 CET [1119-1] 10.10.11.5(29378) replicator@[unknown] ERROR:  requested WAL segment 00000004000000000000000B has already been removed
2020-12-28 11:02:04 CET [1124-1] 10.10.11.4(39558) replicator@[unknown] ERROR:  requested WAL segment 00000004000000000000000B has already been removed
2020-12-28 11:02:04 CET [1126-1] 10.10.11.5(29396) replicator@[unknown] ERROR:  requested WAL segment 00000004000000000000000B has already been removed
2020-12-28 11:02:09 CET [1132-1] 10.10.11.4(39570) replicator@[unknown] ERROR:  requested WAL segment 00000004000000000000000B has already been removed
2020-12-28 11:02:09 CET [1133-1] 10.10.11.5(29408) replicator@[unknown] ERROR:  requested WAL segment 00000004000000000000000B has already been removed
2020-12-28 11:02:14 CET [1139-1] 10.10.11.4(39584) replicator@[unknown] ERROR:  requested WAL segment 00000004000000000000000B has already been removed
2020-12-28 11:02:14 CET [1140-1] 10.10.11.5(29422) replicator@[unknown] ERROR:  requested WAL segment 00000004000000000000000B has already been removed
2020-12-28 11:02:19 CET [1145-1] 10.10.11.4(39600) replicator@[unknown] ERROR:  requested WAL segment 00000004000000000000000B has already been removed
2020-12-28 11:02:19 CET [1147-1] 10.10.11.5(29434) replicator@[unknown] ERROR:  requested WAL segment 00000004000000000000000B has already been removed
2020-12-28 11:02:24 CET [1152-1] 10.10.11.4(39614) replicator@[unknown] ERROR:  requested WAL segment 00000004000000000000000B has already been removed
2020-12-28 11:02:24 CET [1154-1] 10.10.11.5(29452) replicator@[unknown] ERROR:  requested WAL segment 00000004000000000000000B has already been removed
2020-12-28 11:02:29 CET [1160-1] 10.10.11.4(39626) replicator@[unknown] ERROR:  requested WAL segment 00000004000000000000000B has already been removed
2020-12-28 11:02:29 CET [1161-1] 10.10.11.5(29464) replicator@[unknown] ERROR:  requested WAL segment 00000004000000000000000B has already been removed
2020-12-28 11:02:34 CET [1167-1] 10.10.11.4(39640) replicator@[unknown] ERROR:  requested WAL segment 00000004000000000000000B has already been removed
2020-12-28 11:02:34 CET [1168-1] 10.10.11.5(29478) replicator@[unknown] ERROR:  requested WAL segment 00000004000000000000000B has already been removed
2020-12-28 11:02:39 CET [1174-1] 10.10.11.4(39656) replicator@[unknown] ERROR:  requested WAL segment 00000004000000000000000B has already been removed
2020-12-28 11:02:39 CET [1175-1] 10.10.11.5(29490) replicator@[unknown] ERROR:  requested WAL segment 00000004000000000000000B has already been removed

This is on one of the replicas....

2020-12-28 11:03:29 CET [1499-1]  LOG:  started streaming WAL from primary at 0/B000000 on timeline 4
2020-12-28 11:03:29 CET [1499-2]  FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 00000004000000000000000B has already been removed
2020-12-28 11:03:29.928 P00   INFO: archive-get command begin 2.31: [00000007.history, pg_wal/RECOVERYHISTORY] --exec-id=1501-ca3f13d8 --log-level-console=info --log-level-file=detail --log-path=/var/log/pgbackrest --pg1-path=/appl/postgres/13/main --process-max=2 --repo1-path=/ --repo1-s3-bucket=postgres-backup --repo1-s3-endpoint=s3.nl-ams.scw.cloud --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=nl-ams --no-repo1-s3-verify-tls --repo1-type=s3 --stanza=s3_stanza
2020-12-28 11:03:30.095 P00   INFO: unable to find 00000007.history in the archive
2020-12-28 11:03:30.095 P00   INFO: archive-get command end: completed successfully (168ms)

Any ideas on how to recover from this?

wal_keep_segments

Specifies the minimum number of past log file segments kept in the pg_wal directory, in case a standby server needs to fetch them for streaming replication. Each segment is normally 16 megabytes. If a standby server connected to the sending server falls behind by more than wal_keep_segments segments, the sending server might remove a WAL segment still needed by the standby, in which case the replication connection will be terminated.

requested WAL segment 00000004000000000000000B has already been removed

by default we set wal_keep_segments=130. On servers with low write load this should be sufficient. I would recommend that you increase the value of wal_keep_segments

PostgreSQL 13 wal_keep_segments is replaced by wal_keep_size.

However, the standby server can recover by fetching the segment from archive, if WAL archiving is in use.

I see you are using pgbackrest ("archive-get" in restore_command). But apparently your s3-bucket does not have the necessary WAL archives. Or your stanza contain archives of another cluster.

I've edited the number of wal_keep_segments and now I notices that I have some inconsistency ...

In my ansible files the value is correct -> vars/main.yml: - {option: "wal_keep_segments", value: "230"} When I do a patronictl show-config I see a totally different value.

Another one is -> archive_command: cd . The one that I configured is -> pgbackrest --stanza=s3_stanza archive-push %p

root@db2:~# patronictl show-config
loop_wait: 10
master_start_timeout: 300
maximum_lag_on_failover: 1048576
postgresql:
  parameters:
    archive_command: cd .
    archive_mode: true
    archive_timeout: 1800s
    auto_explain.log_analyze: true
    auto_explain.log_buffers: true
    auto_explain.log_min_duration: 10s
    auto_explain.log_nested_statements: true
    auto_explain.log_timing: false
    auto_explain.log_triggers: true
    auto_explain.log_verbose: true
    autovacuum: true
    autovacuum_analyze_scale_factor: 0.02
    autovacuum_max_workers: 5
    autovacuum_naptime: 1s
    autovacuum_vacuum_cost_delay: 20
    autovacuum_vacuum_cost_limit: 200
    autovacuum_vacuum_scale_factor: 0.01
    checkpoint_completion_target: 0.9
    checkpoint_timeout: 15min
    default_statistics_target: 1000
    effective_cache_size: 4GB
    effective_io_concurrency: 200
    hot_standby: true
    hot_standby_feedback: true
    huge_pages: try
    idle_in_transaction_session_timeout: 600000
    log_checkpoints: true
    log_directory: /var/log/postgresql
    log_filename: postgresql-%a.log
    log_line_prefix: '%t [%p-%l] %r %q%u@%d '
    log_lock_waits: true
    log_rotation_age: 1d
    log_rotation_size: 0
    log_temp_files: 0
    log_truncate_on_rotation: true
    logging_collector: true
    maintenance_work_mem: 256MB
    max_connections: 300
    max_files_per_process: 4096
    max_locks_per_transaction: 128
    max_parallel_maintenance_workers: 2
    max_parallel_workers: 24
    max_parallel_workers_per_gather: 4
    max_prepared_transactions: 0
    max_replication_slots: 10
    max_standby_streaming_delay: 30s
    max_wal_senders: 10
    max_wal_size: 4GB
    max_worker_processes: 24
    min_wal_size: 2GB
    old_snapshot_threshold: 60min
    pg_stat_statements.max: 10000
    pg_stat_statements.save: false
    pg_stat_statements.track: all
    random_page_cost: 1.1
    seq_page_cost: 1
    shared_buffers: 512MB
    shared_preload_libraries: pg_stat_statements,auto_explain
    superuser_reserved_connections: 10
    synchronous_commit: true
    track_activities: true
    track_counts: true
    track_functions: all
    track_io_timing: true
    wal_buffers: 32MB
    wal_keep_segments: 130
    wal_level: replica
    wal_log_hints: true
    wal_receiver_status_interval: 10s
    work_mem: 512MB
  use_pg_rewind: true
  use_slots: true
retry_timeout: 10
synchronous_mode: true
synchronous_mode_strict: true
synchronous_node_count: 1
ttl: 30

The backup is new & full and should have the correct files with the correct config into it...

postgres@db2:~$ pgbackrest info
stanza: s3_stanza
    status: ok
    cipher: none

    db (current)
        wal archive min/max (13-1): 000000010000000000000001/00000003000000000000000E

        full backup: 20201228-184418F
            timestamp start/stop: 2020-12-28 18:44:18 / 2020-12-28 18:45:59
            wal start/stop: 00000003000000000000000E / 00000003000000000000000E
            database size: 30.8MB, backup size: 30.8MB
            repository size: 3.8MB, repository backup size: 3.8MB

In my ansible files the value is correct -> vars/main.yml: - {option: "wal_keep_segments", value: "230"} When I do a patronictl show-config I see a totally different value.

The postgresql parameters in the playbook variable files (vars / main.yml ....) will be applied during the initial deployment of the cluster only. You can change the parameters using patronictl edit-config later.

I am recreating the cluster and going to the add the patroni logs. ... The backup is new & full and should have the correct files with the correct config into it...

When recreating the cluster, did you use a backup file? That is, you specified the patroni_cluster_bootstrap_method variable: "initdb" or "pgbackrest"?

The backup has been done after I did an edit-config and changed the needed parameters.

The cluster is deployed each time via restore from backup.

vars/main.yml:patroni_cluster_bootstrap_method: "pgbackrest" # or "wal-g", "pgbackrest"

vars/main.yml:patroni_cluster_bootstrap_method: "pgbackrest"

Ok. Please run the following commands on one of the postgres servers:

su postgres
mkdir -p /tmp/rst

pgbackrest --stanza=s3_stanza --db-include=postgres --pg1-path=/tmp/rst --delta restore --log-level-console=info

if patroni.dynamic.json file is present

cat /tmp/rst/patroni.dynamic.json | jq .

There is nothing being restored it seems.

root@db2:~# su - postgres
postgres@db2:~$ mkdir -p /tmp/rst
postgres@db2:~$ pgbackrest --stanza=s3_stanza --db-include=postgres --pg1-path=/tmp/rst --delta restore --log-level-console=info
2020-12-28 20:54:02.244 P00   INFO: restore command begin 2.31: --db-include=postgres --delta --exec-id=20730-d384833e --log-level-console=info --log-level-file=detail --log-path=/var/log/pgbackrest --pg1-path=/tmp/rst --process-max=2 --recovery-option=recovery_target_action=promote --repo1-path=/ --repo1-s3-bucket=postgres-backup --repo1-s3-endpoint=s3.nl-ams.scw.cloud --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=nl-ams --no-repo1-s3-verify-tls --repo1-type=s3 --stanza=s3_stanza
WARN: --delta or --force specified but unable to find 'PG_VERSION' or 'backup.manifest' in '/tmp/rst' to confirm that this is a valid $PGDATA directory.  --delta and --force have been disabled and if any files exist in the destination directories the restore will be aborted.
2020-12-28 20:54:02.680 P00   INFO: restore backup set 20201228-184418F
2020-12-28 20:54:02.680 P00   INFO: remap data directory to '/tmp/rst'
ERROR: [081]: system databases (template0, postgres, etc.) are included by default
2020-12-28 20:54:02.682 P00   INFO: restore command end: aborted with exception [081]
postgres@db2:~$ cat /tmp/rst/patroni.dynamic.json | jq .
cat: /tmp/rst/patroni.dynamic.json: No such file or directory
postgres@db2:~$ ls /tmp/rst

ERROR: [081]: system databases (template0, postgres, etc.) are included by default

try without db-include

pgbackrest --stanza=s3_stanza --pg1-path=/tmp/rst --delta restore --log-level-console=info

Now it worked.

2020-12-28 21:06:55.055 P02   INFO: restore file /tmp/rst/base/1/13266 (0B, 100%)
2020-12-28 21:06:55.056 P00   INFO: write updated /tmp/rst/postgresql.auto.conf
2020-12-28 21:06:55.075 P00   INFO: restore global/pg_control (performed last to ensure aborted restores cannot be started)
2020-12-28 21:06:55.078 P00   INFO: restore command end: completed successfully (344605ms)
postgres@db2:~$ cat /tmp/rst/patroni.dynamic.json | jq .
{
  "loop_wait": 10,
  "master_start_timeout": 300,
  "maximum_lag_on_failover": 1048576,
  "postgresql": {
    "parameters": {
      "archive_command": "pgbackrest --stanza=s3_stanza archive-push %p",
      "archive_mode": true,
      "archive_timeout": "1800s",
      "auto_explain.log_analyze": true,
      "auto_explain.log_buffers": true,
      "auto_explain.log_min_duration": "10s",
      "auto_explain.log_nested_statements": true,
      "auto_explain.log_timing": false,
      "auto_explain.log_triggers": true,
      "auto_explain.log_verbose": true,
      "autovacuum": true,
      "autovacuum_analyze_scale_factor": 0.02,
      "autovacuum_max_workers": 5,
      "autovacuum_naptime": "1s",
      "autovacuum_vacuum_cost_delay": 20,
      "autovacuum_vacuum_cost_limit": 200,
      "autovacuum_vacuum_scale_factor": 0.01,
      "checkpoint_completion_target": 0.9,
      "checkpoint_timeout": "15min",
      "default_statistics_target": 1000,
      "effective_cache_size": "4GB",
      "effective_io_concurrency": 200,
      "hot_standby": true,
      "hot_standby_feedback": true,
      "huge_pages": "try",
      "idle_in_transaction_session_timeout": 600000,
      "log_checkpoints": true,
      "log_directory": "/var/log/postgresql",
      "log_filename": "postgresql-%a.log",
      "log_line_prefix": "%t [%p-%l] %r %q%u@%d ",
      "log_lock_waits": true,
      "log_rotation_age": "1d",
      "log_rotation_size": 0,
      "log_temp_files": 0,
      "log_truncate_on_rotation": true,
      "logging_collector": true,
      "maintenance_work_mem": "256MB",
      "max_connections": 300,
      "max_files_per_process": 4096,
      "max_locks_per_transaction": 128,
      "max_parallel_maintenance_workers": 2,
      "max_parallel_workers": 24,
      "max_parallel_workers_per_gather": 4,
      "max_prepared_transactions": 0,
      "max_replication_slots": 10,
      "max_standby_streaming_delay": "30s",
      "max_wal_senders": 10,
      "max_wal_size": "4GB",
      "max_worker_processes": 24,
      "min_wal_size": "2GB",
      "old_snapshot_threshold": "60min",
      "pg_stat_statements.max": 10000,
      "pg_stat_statements.save": false,
      "pg_stat_statements.track": "all",
      "random_page_cost": 1.1,
      "seq_page_cost": 1,
      "shared_buffers": "512MB",
      "shared_preload_libraries": "pg_stat_statements,auto_explain",
      "superuser_reserved_connections": 10,
      "synchronous_commit": true,
      "track_activities": true,
      "track_counts": true,
      "track_functions": "all",
      "track_io_timing": true,
      "wal_buffers": "32MB",
      "wal_keep_segments": 230,
      "wal_level": "replica",
      "wal_log_hints": true,
      "wal_receiver_status_interval": "10s",
      "work_mem": "512MB"
    },
    "use_pg_rewind": true,
    "use_slots": true
  },
  "retry_timeout": 10,
  "synchronous_mode": true,
  "synchronous_mode_strict": true,
  "synchronous_node_count": 1,
  "ttl": 30
}

Another one is -> archive_command: cd . The one that I configured is -> pgbackrest --stanza=s3_stanza archive-push %p

By default, we disable archive_command after restoring from a backup.

This is necessary to avoid conflicts in the archived log storage when archiving WALs. When multiple clusters try to send WALs to the same storage. For example, when you make multiple clones of a cluster from one backup.

See "Steps to Restore Using pgBackRest:" https://github.com/vitabaks/postgresql_cluster#restore-and-cloning

example:

TASK [Check the patroni.dynamic.json exists] ************************************************************************************************************************************************************
ok: [10.128.64.140]

TASK [Edit patroni.dynamic.json | disable archive_command (if enabled)] *********************************************************************************************************************************
changed: [10.128.64.140]

It is worth adding the ability to control this behavior. Thank you!

It is worth adding the ability to control this behavior. Thank you!

Done 61bf839777987d999594fb0bf30ab9cc53f5a5c6 please clone this playbook again.

@servergeeks

I've edited the number of wal_keep_segments and now I notices that I have some inconsistency ...

In my ansible files the value is correct -> vars/main.yml: - {option: "wal_keep_segments", value: "230"} When I do a patronictl show-config I see a totally different value.

Another one is -> archive_command: cd . The one that I configured is -> pgbackrest --stanza=s3_stanza archive-push %p

If you need to apply the parameters specified in vars main.yml and not disable archive_command after restore from a backup, specify these options:

disable_archive_command: false keep_patroni_dynamic_json: false

Re-run the playbook to deploy the new cluster. Or run the playbook with the tag to just restore the cluster and update the patroni.yml configuration file:

ansible-playbook deploy_pgcluster.yml --tags point_in_time_recovery,patroni_conf

Hi @vitabaks ,

Thank you for the fix & support. I was able to update & redeploy a cluster with the updated playbook and everything is configured as expected.

Thank you & Happy New Year!

vitabaks / postgresql_cluster

How does the failover really works #82