vitabaks / postgresql_cluster

PostgreSQL High-Availability Cluster (based on "Patroni" and DCS "etcd" or "consul"). Automating with Ansible.
MIT License
1.29k stars 352 forks source link

How to configure backup properly with pgbackrest #67

Closed valentin-fischer closed 3 years ago

valentin-fischer commented 3 years ago

Hi there,

I have configured pgbackrest backup with s3 as storage and trying to deploy the cluster from zero again. For some reason after enabling pgbackrest the cluster deployment fails.

It seems to try and restore from backup, although there was no backup yet as this is a new cluster.

Is there an issue in the playbook or am I doing something wrong?

Thanks!

...
TASK [patroni : Prepare | Make sure the ansible required python library is exist] *****************************************************************************************************************************************************************************************************************************************************************************************
ok: [10.10.10.7] => (item=pexpect)
ok: [10.10.10.9] => (item=pexpect)
ok: [10.10.10.10] => (item=pexpect)
ok: [10.10.10.7] => (item=ruamel.yaml)
ok: [10.10.10.9] => (item=ruamel.yaml)
ok: [10.10.10.10] => (item=ruamel.yaml)

TASK [patroni : Stop patroni service on the Replica servers (if running)] *************************************************************************************************************************************************************************************************************************************************************************************************
ok: [10.10.10.7]
ok: [10.10.10.9]

TASK [patroni : Stop patroni service on the Master server (if running)] ***************************************************************************************************************************************************************************************************************************************************************************************************
ok: [10.10.10.10]

TASK [patroni : Remove patroni cluster "test" from DCS (if exist)] ***************************************************************************************************************************************************************************************************************************************************************************************************
ok: [10.10.10.10]

TASK [patroni : Run "/usr/bin/pgbackrest --stanza=s3_stanza --delta restore" on Master] ***********************************************************************************************************************************************************************************************************************************************************************************
changed: [10.10.10.10]

TASK [patroni : Run "/usr/bin/pgbackrest --stanza=s3_stanza --delta restore" on Replica] **********************************************************************************************************************************************************************************************************************************************************************************
changed: [10.10.10.7]
changed: [10.10.10.9]
FAILED - RETRYING: Waiting for restore from backup (2880 retries left).
FAILED - RETRYING: Waiting for restore from backup (2880 retries left).
FAILED - RETRYING: Waiting for restore from backup (2880 retries left).
FAILED - RETRYING: Waiting for restore from backup (2879 retries left).
FAILED - RETRYING: Waiting for restore from backup (2879 retries left).
FAILED - RETRYING: Waiting for restore from backup (2879 retries left).
vitabaks commented 3 years ago

It seems to try and restore from backup, although there was no backup yet as this is a new cluster.

You answered your own question. You don't have a backup.

You need to create a backup copy of your database:

pgbackrest --stanza=<stanza-name> --type=full backup

Then you can use it to deploy a new cluster.

See pgBackRest User Guide https://pgbackrest.org/user-guide-index.html

The playbook does not currently configure a backup (although it can prepare the configuration), the task of this playbook is to deploy a сluster, including deploying from a (pre-existing) backup.

valentin-fischer commented 3 years ago

Hi @vitabaks

Thank you for the fast reply. How can I take a backup as I don't have a cluster/database yet ?

The pgbackrest tries to connect to postgres , and as there is no cluster up, there is no way for the tasks/playbook to continue.

br

vitabaks commented 3 years ago

How can I take a backup as I don't have a cluster/database yet ?

You selected bootstrap method from the pgbackrest backup, but you don't have it. It is not right. Use patroni_cluster_bootstrap_method: "initdb"

valentin-fischer commented 3 years ago

Ok, it seems to have progressed. Now the next failure is related to patroni it seems.

TASK [patroni : Start patroni service on Replica servers] *****************************************************************************************************************************************************************************************************************************************************************************************************************
changed: [10.10.10.9]
changed: [10.10.10.6]

TASK [patroni : Wait for port 8008 to become open on the host] ************************************************************************************************************************************************************************************************************************************************************************************************************
fatal: [10.10.10.9]: FAILED! => {"changed": false, "elapsed": 120, "msg": "Timeout when waiting for 10.10.10.9:8008"}
fatal: [10.10.10.6]: FAILED! => {"changed": false, "elapsed": 120, "msg": "Timeout when waiting for 10.10.10.6:8008"}

NO MORE HOSTS LEFT ********************************************************************************************************************************************************************************************************************************************************************************************************************************************************

PLAY RECAP ****************************************************************************************************************************************************************************************************************************************************************************************************************************************************************
10.10.10.6                 : ok=98   changed=57   unreachable=0    failed=1    skipped=160  rescued=0    ignored=0
10.10.10.8                 : ok=104  changed=60   unreachable=0    failed=0    skipped=160  rescued=0    ignored=0
10.10.10.9                 : ok=98   changed=58   unreachable=0    failed=1    skipped=160  rescued=0    ignored=0
localhost                  : ok=0    changed=0    unreachable=0    failed=0    skipped=1    rescued=0    ignored=0`

When I look in the systemd related service the error is the following:

Oct 12 15:14:10 db1 systemd[1]: Started Runners to orchestrate a high-availability PostgreSQL - patroni.
Oct 12 15:14:10 db1 patroni[14961]: 2020-10-12 15:14:10,905 INFO: Selected new etcd server http://10.10.10.9:2379
Oct 12 15:14:10 db1 patroni[14961]: 2020-10-12 15:14:10,910 INFO: No PostgreSQL configuration items changed, nothing to reload.
Oct 12 15:14:10 db1 patroni[14961]: 2020-10-12 15:14:10,949 CRITICAL: system ID mismatch, node db1 belongs to a different cluster: 6882721315206687972 != 6882721284166605269
Oct 12 15:14:11 db1 systemd[1]: patroni.service: Main process exited, code=exited, status=1/FAILURE
Oct 12 15:14:11 db1 systemd[1]: patroni.service: Failed with result 'exit-code'.

It seems to think this node belongs to some other cluster (no idea why).

vitabaks commented 3 years ago

If you use patroni_cluster_bootstrap_method: "initdb" Make sure you don't have pgbackrest specified in the patroni_create_replica_methods variable.

This is because automation does not clear the PGDATA directory before starting Patroni if ​​pgbackrest is specified in patroni_create_replica_methods.

This is implemented to support --delta restore.

valentin-fischer commented 3 years ago

Thank you for fast replies.

After changing to the settings you mentioned, it started in the end but after a looooooooooooooot of retries. More exactly 31 of retries.

Is this normal ?

If I look into the patroni log of one of the servers it's full with something along the lines

Oct 12 17:28:38 db1 systemd[1]: Started Runners to orchestrate a high-availability PostgreSQL - patroni.
Oct 12 17:28:38 db1 patroni[15030]: 2020-10-12 17:28:38,463 INFO: Selected new etcd server http://10.10.10.8:2379
Oct 12 17:28:38 db1 patroni[15030]: 2020-10-12 17:28:38,470 INFO: No PostgreSQL configuration items changed, nothing to reload.
Oct 12 17:28:38 db1 patroni[15030]: 2020-10-12 17:28:38,496 INFO: Lock owner: db0; I am db1
Oct 12 17:28:38 db1 patroni[15030]: 2020-10-12 17:28:38,514 INFO: trying to bootstrap from leader 'db0'
Oct 12 17:28:39 db1 patroni[15030]: 2020-10-12 17:28:39,558 INFO: replica has been created using basebackup
Oct 12 17:28:39 db1 patroni[15030]: 2020-10-12 17:28:39,560 INFO: bootstrapped from leader 'db0'
Oct 12 17:28:39 db1 patroni[15030]: 2020-10-12 17:28:39,746 INFO: postmaster pid=15065
Oct 12 17:28:39 db1 patroni[15066]: /var/run/postgresql:5432 - no response
Oct 12 17:28:39 db1 patroni[15065]: 2020-10-12 17:28:39 CEST [15065-1]  LOG:  redirecting log output to logging collector process
Oct 12 17:28:39 db1 patroni[15065]: 2020-10-12 17:28:39 CEST [15065-2]  HINT:  Future log output will appear in directory "/var/log/postgresql".
Oct 12 17:28:40 db1 patroni[15071]: /var/run/postgresql:5432 - rejecting connections
Oct 12 17:28:40 db1 patroni[15073]: /var/run/postgresql:5432 - rejecting connections
Oct 12 17:28:41 db1 patroni[15075]: /var/run/postgresql:5432 - rejecting connections
Oct 12 17:28:42 db1 patroni[15077]: /var/run/postgresql:5432 - rejecting connections
Oct 12 17:28:43 db1 patroni[15079]: /var/run/postgresql:5432 - rejecting connections
Oct 12 17:28:44 db1 patroni[15081]: /var/run/postgresql:5432 - rejecting connections
Oct 12 17:28:45 db1 patroni[15083]: /var/run/postgresql:5432 - rejecting connections
Oct 12 17:28:46 db1 patroni[15085]: /var/run/postgresql:5432 - rejecting connections
Oct 12 17:28:47 db1 patroni[15087]: /var/run/postgresql:5432 - rejecting connections
Oct 12 17:28:48 db1 patroni[15030]: 2020-10-12 17:28:48,487 INFO: Lock owner: db0; I am db1
Oct 12 17:28:48 db1 patroni[15030]: 2020-10-12 17:28:48,504 INFO: bootstrap from leader 'db0' in progress
Oct 12 17:28:48 db1 patroni[15089]: /var/run/postgresql:5432 - rejecting connections
..............................
Oct 12 17:29:40 db1 patroni[15030]: 2020-10-12 17:29:40,076 INFO: establishing a new patroni connection to the postgres cluster
Oct 12 17:29:40 db1 patroni[15030]: 2020-10-12 17:29:40,230 INFO: establishing a new patroni connection to the postgres cluster
Oct 12 17:29:40 db1 patroni[15030]: 2020-10-12 17:29:40,234 WARNING: Retry got exception: 'connection problems'
Oct 12 17:29:40 db1 patroni[15030]: 2020-10-12 17:29:40,254 INFO: Error communicating with PostgreSQL. Will try again later
Oct 12 17:29:42 db1 patroni[15235]: /var/run/postgresql:5432 - rejecting connections
Oct 12 17:29:42 db1 patroni[15030]: 2020-10-12 17:29:42,670 INFO: Lock owner: db0; I am db1
Oct 12 17:29:42 db1 patroni[15030]: 2020-10-12 17:29:42,670 INFO: Still starting up as a standby.
Oct 12 17:29:42 db1 patroni[15030]: 2020-10-12 17:29:42,671 INFO: Lock owner: db0; I am db1
Oct 12 17:29:42 db1 patroni[15030]: 2020-10-12 17:29:42,671 INFO: does not have lock
Oct 12 17:29:42 db1 patroni[15030]: 2020-10-12 17:29:42,671 INFO: establishing a new patroni connection to the postgres cluster
Oct 12 17:29:43 db1 patroni[15030]: 2020-10-12 17:29:43,467 INFO: establishing a new patroni connection to the postgres cluster
Oct 12 17:29:43 db1 patroni[15030]: 2020-10-12 17:29:43,472 WARNING: Retry got exception: 'connection problems'
Oct 12 17:29:43 db1 patroni[15030]: 2020-10-12 17:29:43,501 INFO: Error communicating with PostgreSQL. Will try again later
Oct 12 17:29:52 db1 patroni[15256]: /var/run/postgresql:5432 - rejecting connections
Oct 12 17:29:52 db1 patroni[15030]: 2020-10-12 17:29:52,680 INFO: Lock owner: db0; I am db1
Oct 12 17:29:52 db1 patroni[15030]: 2020-10-12 17:29:52,681 INFO: Still starting up as a standby.
Oct 12 17:29:52 db1 patroni[15030]: 2020-10-12 17:29:52,681 INFO: Lock owner: db0; I am db1
Oct 12 17:29:52 db1 patroni[15030]: 2020-10-12 17:29:52,681 INFO: does not have lock
Oct 12 17:29:52 db1 patroni[15030]: 2020-10-12 17:29:52,681 INFO: establishing a new patroni connection to the postgres cluster
Oct 12 17:29:52 db1 patroni[15030]: 2020-10-12 17:29:52,998 INFO: establishing a new patroni connection to the postgres cluster
Oct 12 17:29:53 db1 patroni[15030]: 2020-10-12 17:29:53,002 WARNING: Retry got exception: 'connection problems'
Oct 12 17:29:53 db1 patroni[15030]: 2020-10-12 17:29:53,021 INFO: Error communicating with PostgreSQL. Will try again later
Oct 12 17:30:02 db1 patroni[15260]: /var/run/postgresql:5432 - rejecting connections
Oct 12 17:30:02 db1 patroni[15030]: 2020-10-12 17:30:02,677 INFO: Lock owner: db0; I am db1
Oct 12 17:30:02 db1 patroni[15030]: 2020-10-12 17:30:02,678 INFO: Still starting up as a standby.
Oct 12 17:30:02 db1 patroni[15030]: 2020-10-12 17:30:02,678 INFO: Lock owner: db0; I am db1
Oct 12 17:30:02 db1 patroni[15030]: 2020-10-12 17:30:02,678 INFO: does not have lock
Oct 12 17:30:02 db1 patroni[15030]: 2020-10-12 17:30:02,678 INFO: establishing a new patroni connection to the postgres cluster
Oct 12 17:30:03 db1 patroni[15030]: 2020-10-12 17:30:03,083 INFO: establishing a new patroni connection to the postgres cluster
Oct 12 17:30:03 db1 patroni[15030]: 2020-10-12 17:30:03,087 WARNING: Retry got exception: 'connection problems'
Oct 12 17:30:03 db1 patroni[15030]: 2020-10-12 17:30:03,105 INFO: Error communicating with PostgreSQL. Will try again later
Oct 12 17:30:12 db1 patroni[15264]: /var/run/postgresql:5432 - rejecting connections
Oct 12 17:30:12 db1 patroni[15030]: 2020-10-12 17:30:12,679 INFO: Lock owner: db0; I am db1
Oct 12 17:30:12 db1 patroni[15030]: 2020-10-12 17:30:12,680 INFO: Still starting up as a standby.
Oct 12 17:30:12 db1 patroni[15030]: 2020-10-12 17:30:12,681 INFO: Lock owner: db0; I am db1
Oct 12 17:30:12 db1 patroni[15030]: 2020-10-12 17:30:12,681 INFO: does not have lock
Oct 12 17:30:12 db1 patroni[15030]: 2020-10-12 17:30:12,681 INFO: establishing a new patroni connection to the postgres cluster
Oct 12 17:30:13 db1 patroni[15030]: 2020-10-12 17:30:13,526 INFO: establishing a new patroni connection to the postgres cluster
..............................
Oct 12 17:49:52 db1 patroni[15030]: 2020-10-12 17:49:52,662 INFO: Lock owner: db0; I am db1
Oct 12 17:49:52 db1 patroni[15030]: 2020-10-12 17:49:52,663 INFO: does not have lock
Oct 12 17:49:52 db1 patroni[15030]: 2020-10-12 17:49:52,685 INFO: no action.  i am a secondary and i am following a leader
Oct 12 17:50:02 db1 patroni[15030]: 2020-10-12 17:50:02,655 INFO: Lock owner: db0; I am db1
Oct 12 17:50:02 db1 patroni[15030]: 2020-10-12 17:50:02,655 INFO: does not have lock
Oct 12 17:50:02 db1 patroni[15030]: 2020-10-12 17:50:02,680 INFO: no action.  i am a secondary and i am following a leader
Oct 12 17:50:12 db1 patroni[15030]: 2020-10-12 17:50:12,656 INFO: Lock owner: db0; I am db1
Oct 12 17:50:12 db1 patroni[15030]: 2020-10-12 17:50:12,656 INFO: does not have lock

If manages to recover and finish the deployment after 30 retries

FAILED - RETRYING: Check that the patroni is healthy on the replica server (1170 retries left).
FAILED - RETRYING: Check that the patroni is healthy on the replica server (1169 retries left).
FAILED - RETRYING: Check that the patroni is healthy on the replica server (1169 retries left).

TASK [patroni : Check that the patroni is healthy on the replica server] **************************************************************************************************************************************************************************************************************************************************************************************************
ok: [10.10.10.6]
ok: [10.10.10.9]

TASK [patroni : Turning off postgresql autostart from config "start.conf" (will be managed by patroni)] *******************************************************************************************************************************************************************************************************************************************************************
changed: [10.10.10.6]
changed: [10.10.10.8]
changed: [10.10.10.9]

TASK [patroni : Disable "postgresql@13-main" service] *********************************************************************************************************************************************************************************************************************************************************************************************************************
changed: [10.10.10.8]
changed: [10.10.10.6]
changed: [10.10.10.9]

TASK [patroni : Add PATRONICTL_CONFIG_FILE environment variable into /etc/environment] ************************************************************************************************************************************************************************************************************************************************************************************
changed: [10.10.10.8]
changed: [10.10.10.6]
changed: [10.10.10.9]

TASK [postgresql-users : Make sure the PostgreSQL users are present] ******************************************************************************************************************************************************************************************************************************************************************************************************
changed: [10.10.10.8] => (item=app)
[WARNING]: Module did not set no_log for no_password_changes

TASK [postgresql-databases : Make sure the PostgreSQL databases are present] **********************************************************************************************************************************************************************************************************************************************************************************************
changed: [10.10.10.8] => (item={'db': 'application', 'encoding': 'UTF8', 'lc_collate': 'en_US.UTF-8', 'lc_ctype': 'en_US.UTF-8', 'owner': 'app'})

TASK [pgbouncer/userlist : Get users and password md5 from pg_shadow] *****************************************************************************************************************************************************************************************************************************************************************************************************
ok: [10.10.10.8 -> 10.10.10.8]

TASK [pgbouncer/userlist : Generate /etc/pgbouncer/userlist.txt] **********************************************************************************************************************************************************************************************************************************************************************************************************
changed: [10.10.10.8]
changed: [10.10.10.6]
changed: [10.10.10.9]

RUNNING HANDLER [pgbouncer/userlist : Reload pgbouncer service] ***********************************************************************************************************************************************************************************************************************************************************************************************************
changed: [10.10.10.6]
changed: [10.10.10.8]
changed: [10.10.10.9]

TASK [deploy-finish : Check postgresql cluster health] ********************************************************************************************************************************************************************************************************************************************************************************************************************
ok: [10.10.10.8]

TASK [deploy-finish : PostgreSQL Cluster health] **************************************************************************************************************************************************************************************************************************************************************************************************************************
ok: [10.10.10.8] => {
    "patronictl_result.stdout_lines": [
        "+ Cluster: test (6882755957868354174) ----+-----------+",
        "| Member | Host       | Role    | State   | TL | Lag in MB |",
        "+--------+------------+---------+---------+----+-----------+",
        "| db0    | 10.10.10.8 | Leader  | running |  1 |           |",
        "| db1    | 10.10.10.6 | Replica | running |  1 |         0 |",
        "| db2    | 10.10.10.9 | Replica | running |  1 |         0 |",
        "+--------+------------+---------+---------+----+-----------+"
    ]
}

TASK [deploy-finish : Get postgresql database list] ***********************************************************************************************************************************************************************************************************************************************************************************************************************
ok: [10.10.10.8 -> 10.10.10.8]

TASK [deploy-finish : PostgreSQL list of databases] ***********************************************************************************************************************************************************************************************************************************************************************************************************************
ok: [10.10.10.8] => {
    "dbs_result.stdout_lines": [
        "    name     |  owner   | encoding |   collate   |    ctype    |  size   | tablespace ",
        "-------------+----------+----------+-------------+-------------+---------+------------",
        " application | app      | UTF8     | en_US.UTF-8 | en_US.UTF-8 | 7729 kB | pg_default",
        " postgres    | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | 7877 kB | pg_default",
        "(2 rows)"
    ]
}

TASK [deploy-finish : Create list of nodes] *******************************************************************************************************************************************************************************************************************************************************************************************************************************
ok: [10.10.10.8]

TASK [deploy-finish : PostgreSQL Cluster connection info] *****************************************************************************************************************************************************************************************************************************************************************************************************************
ok: [10.10.10.8] => {
    "msg": [
        "+------------------------------------------------+",
        "address 10.10.10.8,10.10.10.6,10.10.10.9",
        "port 6432 (pgbouncer)",
        "+------------------------------------------------+"
    ]
}

PLAY RECAP ****************************************************************************************************************************************************************************************************************************************************************************************************************************************************************
10.10.10.6                 : ok=106  changed=63   unreachable=0    failed=0    skipped=178  rescued=0    ignored=0
10.10.10.8                 : ok=118  changed=66   unreachable=0    failed=0    skipped=186  rescued=0    ignored=0
10.10.10.9                 : ok=106  changed=63   unreachable=0    failed=0    skipped=178  rescued=0    ignored=0
localhost                  : ok=0    changed=0    unreachable=0    failed=0    skipped=1    rescued=0    ignored=0

PS: Should I enabled pgbackrest again now ?

vitabaks commented 3 years ago

Is this normal ?

I think it's ok :) During this time period, the replica was copying PGDATA from the master using basebackup.

Oct 12 17:28:38 db1 patroni[15030]: 2020-10-12 17:28:38,514 INFO: trying to bootstrap from leader 'db0'
Oct 12 17:28:39 db1 patroni[15030]: 2020-10-12 17:28:39,558 INFO: replica has been created using basebackup

Once the database was up and running, patroni was able to establish a new connection to postgres. Next, we see that the replica has been successfully created:

Oct 12 17:49:52 db1 patroni [15030]: 2020-10-12 17: 49: 52,685 INFO: no action. i am a secondary and i am following a leader
vitabaks commented 3 years ago

Should I enabled pgbackrest again now ?

Make a backup of your data, then you can use this backup to clone/deploy new clusters. Or for the PITR of the current cluster.

https://github.com/vitabaks/postgresql_cluster#restore-and-cloning

valentin-fischer commented 3 years ago

Thank you for all the help and info!

Should there me any other port in here?

image

vitabaks commented 3 years ago

Should there me any other port in here?

How does this relate to pgbackrest?

I was giving an answer here: #65

valentin-fischer commented 3 years ago

Aaa, sorry I missed it.

Thank you for your help!

valentin-fischer commented 3 years ago

Closing this issue as now I understand the workflow of pgbackrest more or less.

Thanks!