Closed valentin-fischer closed 3 years ago
It seems to try and restore from backup, although there was no backup yet as this is a new cluster.
You answered your own question. You don't have a backup.
You need to create a backup copy of your database:
pgbackrest --stanza=<stanza-name> --type=full backup
Then you can use it to deploy a new cluster.
See pgBackRest User Guide https://pgbackrest.org/user-guide-index.html
The playbook does not currently configure a backup (although it can prepare the configuration), the task of this playbook is to deploy a сluster, including deploying from a (pre-existing) backup.
Hi @vitabaks
Thank you for the fast reply. How can I take a backup as I don't have a cluster/database yet ?
The pgbackrest tries to connect to postgres , and as there is no cluster up, there is no way for the tasks/playbook to continue.
br
How can I take a backup as I don't have a cluster/database yet ?
You selected bootstrap method from the pgbackrest backup, but you don't have it. It is not right.
Use patroni_cluster_bootstrap_method: "initdb"
Ok, it seems to have progressed. Now the next failure is related to patroni it seems.
TASK [patroni : Start patroni service on Replica servers] *****************************************************************************************************************************************************************************************************************************************************************************************************************
changed: [10.10.10.9]
changed: [10.10.10.6]
TASK [patroni : Wait for port 8008 to become open on the host] ************************************************************************************************************************************************************************************************************************************************************************************************************
fatal: [10.10.10.9]: FAILED! => {"changed": false, "elapsed": 120, "msg": "Timeout when waiting for 10.10.10.9:8008"}
fatal: [10.10.10.6]: FAILED! => {"changed": false, "elapsed": 120, "msg": "Timeout when waiting for 10.10.10.6:8008"}
NO MORE HOSTS LEFT ********************************************************************************************************************************************************************************************************************************************************************************************************************************************************
PLAY RECAP ****************************************************************************************************************************************************************************************************************************************************************************************************************************************************************
10.10.10.6 : ok=98 changed=57 unreachable=0 failed=1 skipped=160 rescued=0 ignored=0
10.10.10.8 : ok=104 changed=60 unreachable=0 failed=0 skipped=160 rescued=0 ignored=0
10.10.10.9 : ok=98 changed=58 unreachable=0 failed=1 skipped=160 rescued=0 ignored=0
localhost : ok=0 changed=0 unreachable=0 failed=0 skipped=1 rescued=0 ignored=0`
When I look in the systemd related service the error is the following:
Oct 12 15:14:10 db1 systemd[1]: Started Runners to orchestrate a high-availability PostgreSQL - patroni.
Oct 12 15:14:10 db1 patroni[14961]: 2020-10-12 15:14:10,905 INFO: Selected new etcd server http://10.10.10.9:2379
Oct 12 15:14:10 db1 patroni[14961]: 2020-10-12 15:14:10,910 INFO: No PostgreSQL configuration items changed, nothing to reload.
Oct 12 15:14:10 db1 patroni[14961]: 2020-10-12 15:14:10,949 CRITICAL: system ID mismatch, node db1 belongs to a different cluster: 6882721315206687972 != 6882721284166605269
Oct 12 15:14:11 db1 systemd[1]: patroni.service: Main process exited, code=exited, status=1/FAILURE
Oct 12 15:14:11 db1 systemd[1]: patroni.service: Failed with result 'exit-code'.
It seems to think this node belongs to some other cluster (no idea why).
If you use patroni_cluster_bootstrap_method: "initdb" Make sure you don't have pgbackrest specified in the patroni_create_replica_methods variable.
This is because automation does not clear the PGDATA directory before starting Patroni if pgbackrest is specified in patroni_create_replica_methods.
This is implemented to support --delta restore.
Thank you for fast replies.
After changing to the settings you mentioned, it started in the end but after a looooooooooooooot of retries. More exactly 31 of retries.
Is this normal ?
If I look into the patroni log of one of the servers it's full with something along the lines
Oct 12 17:28:38 db1 systemd[1]: Started Runners to orchestrate a high-availability PostgreSQL - patroni.
Oct 12 17:28:38 db1 patroni[15030]: 2020-10-12 17:28:38,463 INFO: Selected new etcd server http://10.10.10.8:2379
Oct 12 17:28:38 db1 patroni[15030]: 2020-10-12 17:28:38,470 INFO: No PostgreSQL configuration items changed, nothing to reload.
Oct 12 17:28:38 db1 patroni[15030]: 2020-10-12 17:28:38,496 INFO: Lock owner: db0; I am db1
Oct 12 17:28:38 db1 patroni[15030]: 2020-10-12 17:28:38,514 INFO: trying to bootstrap from leader 'db0'
Oct 12 17:28:39 db1 patroni[15030]: 2020-10-12 17:28:39,558 INFO: replica has been created using basebackup
Oct 12 17:28:39 db1 patroni[15030]: 2020-10-12 17:28:39,560 INFO: bootstrapped from leader 'db0'
Oct 12 17:28:39 db1 patroni[15030]: 2020-10-12 17:28:39,746 INFO: postmaster pid=15065
Oct 12 17:28:39 db1 patroni[15066]: /var/run/postgresql:5432 - no response
Oct 12 17:28:39 db1 patroni[15065]: 2020-10-12 17:28:39 CEST [15065-1] LOG: redirecting log output to logging collector process
Oct 12 17:28:39 db1 patroni[15065]: 2020-10-12 17:28:39 CEST [15065-2] HINT: Future log output will appear in directory "/var/log/postgresql".
Oct 12 17:28:40 db1 patroni[15071]: /var/run/postgresql:5432 - rejecting connections
Oct 12 17:28:40 db1 patroni[15073]: /var/run/postgresql:5432 - rejecting connections
Oct 12 17:28:41 db1 patroni[15075]: /var/run/postgresql:5432 - rejecting connections
Oct 12 17:28:42 db1 patroni[15077]: /var/run/postgresql:5432 - rejecting connections
Oct 12 17:28:43 db1 patroni[15079]: /var/run/postgresql:5432 - rejecting connections
Oct 12 17:28:44 db1 patroni[15081]: /var/run/postgresql:5432 - rejecting connections
Oct 12 17:28:45 db1 patroni[15083]: /var/run/postgresql:5432 - rejecting connections
Oct 12 17:28:46 db1 patroni[15085]: /var/run/postgresql:5432 - rejecting connections
Oct 12 17:28:47 db1 patroni[15087]: /var/run/postgresql:5432 - rejecting connections
Oct 12 17:28:48 db1 patroni[15030]: 2020-10-12 17:28:48,487 INFO: Lock owner: db0; I am db1
Oct 12 17:28:48 db1 patroni[15030]: 2020-10-12 17:28:48,504 INFO: bootstrap from leader 'db0' in progress
Oct 12 17:28:48 db1 patroni[15089]: /var/run/postgresql:5432 - rejecting connections
..............................
Oct 12 17:29:40 db1 patroni[15030]: 2020-10-12 17:29:40,076 INFO: establishing a new patroni connection to the postgres cluster
Oct 12 17:29:40 db1 patroni[15030]: 2020-10-12 17:29:40,230 INFO: establishing a new patroni connection to the postgres cluster
Oct 12 17:29:40 db1 patroni[15030]: 2020-10-12 17:29:40,234 WARNING: Retry got exception: 'connection problems'
Oct 12 17:29:40 db1 patroni[15030]: 2020-10-12 17:29:40,254 INFO: Error communicating with PostgreSQL. Will try again later
Oct 12 17:29:42 db1 patroni[15235]: /var/run/postgresql:5432 - rejecting connections
Oct 12 17:29:42 db1 patroni[15030]: 2020-10-12 17:29:42,670 INFO: Lock owner: db0; I am db1
Oct 12 17:29:42 db1 patroni[15030]: 2020-10-12 17:29:42,670 INFO: Still starting up as a standby.
Oct 12 17:29:42 db1 patroni[15030]: 2020-10-12 17:29:42,671 INFO: Lock owner: db0; I am db1
Oct 12 17:29:42 db1 patroni[15030]: 2020-10-12 17:29:42,671 INFO: does not have lock
Oct 12 17:29:42 db1 patroni[15030]: 2020-10-12 17:29:42,671 INFO: establishing a new patroni connection to the postgres cluster
Oct 12 17:29:43 db1 patroni[15030]: 2020-10-12 17:29:43,467 INFO: establishing a new patroni connection to the postgres cluster
Oct 12 17:29:43 db1 patroni[15030]: 2020-10-12 17:29:43,472 WARNING: Retry got exception: 'connection problems'
Oct 12 17:29:43 db1 patroni[15030]: 2020-10-12 17:29:43,501 INFO: Error communicating with PostgreSQL. Will try again later
Oct 12 17:29:52 db1 patroni[15256]: /var/run/postgresql:5432 - rejecting connections
Oct 12 17:29:52 db1 patroni[15030]: 2020-10-12 17:29:52,680 INFO: Lock owner: db0; I am db1
Oct 12 17:29:52 db1 patroni[15030]: 2020-10-12 17:29:52,681 INFO: Still starting up as a standby.
Oct 12 17:29:52 db1 patroni[15030]: 2020-10-12 17:29:52,681 INFO: Lock owner: db0; I am db1
Oct 12 17:29:52 db1 patroni[15030]: 2020-10-12 17:29:52,681 INFO: does not have lock
Oct 12 17:29:52 db1 patroni[15030]: 2020-10-12 17:29:52,681 INFO: establishing a new patroni connection to the postgres cluster
Oct 12 17:29:52 db1 patroni[15030]: 2020-10-12 17:29:52,998 INFO: establishing a new patroni connection to the postgres cluster
Oct 12 17:29:53 db1 patroni[15030]: 2020-10-12 17:29:53,002 WARNING: Retry got exception: 'connection problems'
Oct 12 17:29:53 db1 patroni[15030]: 2020-10-12 17:29:53,021 INFO: Error communicating with PostgreSQL. Will try again later
Oct 12 17:30:02 db1 patroni[15260]: /var/run/postgresql:5432 - rejecting connections
Oct 12 17:30:02 db1 patroni[15030]: 2020-10-12 17:30:02,677 INFO: Lock owner: db0; I am db1
Oct 12 17:30:02 db1 patroni[15030]: 2020-10-12 17:30:02,678 INFO: Still starting up as a standby.
Oct 12 17:30:02 db1 patroni[15030]: 2020-10-12 17:30:02,678 INFO: Lock owner: db0; I am db1
Oct 12 17:30:02 db1 patroni[15030]: 2020-10-12 17:30:02,678 INFO: does not have lock
Oct 12 17:30:02 db1 patroni[15030]: 2020-10-12 17:30:02,678 INFO: establishing a new patroni connection to the postgres cluster
Oct 12 17:30:03 db1 patroni[15030]: 2020-10-12 17:30:03,083 INFO: establishing a new patroni connection to the postgres cluster
Oct 12 17:30:03 db1 patroni[15030]: 2020-10-12 17:30:03,087 WARNING: Retry got exception: 'connection problems'
Oct 12 17:30:03 db1 patroni[15030]: 2020-10-12 17:30:03,105 INFO: Error communicating with PostgreSQL. Will try again later
Oct 12 17:30:12 db1 patroni[15264]: /var/run/postgresql:5432 - rejecting connections
Oct 12 17:30:12 db1 patroni[15030]: 2020-10-12 17:30:12,679 INFO: Lock owner: db0; I am db1
Oct 12 17:30:12 db1 patroni[15030]: 2020-10-12 17:30:12,680 INFO: Still starting up as a standby.
Oct 12 17:30:12 db1 patroni[15030]: 2020-10-12 17:30:12,681 INFO: Lock owner: db0; I am db1
Oct 12 17:30:12 db1 patroni[15030]: 2020-10-12 17:30:12,681 INFO: does not have lock
Oct 12 17:30:12 db1 patroni[15030]: 2020-10-12 17:30:12,681 INFO: establishing a new patroni connection to the postgres cluster
Oct 12 17:30:13 db1 patroni[15030]: 2020-10-12 17:30:13,526 INFO: establishing a new patroni connection to the postgres cluster
..............................
Oct 12 17:49:52 db1 patroni[15030]: 2020-10-12 17:49:52,662 INFO: Lock owner: db0; I am db1
Oct 12 17:49:52 db1 patroni[15030]: 2020-10-12 17:49:52,663 INFO: does not have lock
Oct 12 17:49:52 db1 patroni[15030]: 2020-10-12 17:49:52,685 INFO: no action. i am a secondary and i am following a leader
Oct 12 17:50:02 db1 patroni[15030]: 2020-10-12 17:50:02,655 INFO: Lock owner: db0; I am db1
Oct 12 17:50:02 db1 patroni[15030]: 2020-10-12 17:50:02,655 INFO: does not have lock
Oct 12 17:50:02 db1 patroni[15030]: 2020-10-12 17:50:02,680 INFO: no action. i am a secondary and i am following a leader
Oct 12 17:50:12 db1 patroni[15030]: 2020-10-12 17:50:12,656 INFO: Lock owner: db0; I am db1
Oct 12 17:50:12 db1 patroni[15030]: 2020-10-12 17:50:12,656 INFO: does not have lock
If manages to recover and finish the deployment after 30 retries
FAILED - RETRYING: Check that the patroni is healthy on the replica server (1170 retries left).
FAILED - RETRYING: Check that the patroni is healthy on the replica server (1169 retries left).
FAILED - RETRYING: Check that the patroni is healthy on the replica server (1169 retries left).
TASK [patroni : Check that the patroni is healthy on the replica server] **************************************************************************************************************************************************************************************************************************************************************************************************
ok: [10.10.10.6]
ok: [10.10.10.9]
TASK [patroni : Turning off postgresql autostart from config "start.conf" (will be managed by patroni)] *******************************************************************************************************************************************************************************************************************************************************************
changed: [10.10.10.6]
changed: [10.10.10.8]
changed: [10.10.10.9]
TASK [patroni : Disable "postgresql@13-main" service] *********************************************************************************************************************************************************************************************************************************************************************************************************************
changed: [10.10.10.8]
changed: [10.10.10.6]
changed: [10.10.10.9]
TASK [patroni : Add PATRONICTL_CONFIG_FILE environment variable into /etc/environment] ************************************************************************************************************************************************************************************************************************************************************************************
changed: [10.10.10.8]
changed: [10.10.10.6]
changed: [10.10.10.9]
TASK [postgresql-users : Make sure the PostgreSQL users are present] ******************************************************************************************************************************************************************************************************************************************************************************************************
changed: [10.10.10.8] => (item=app)
[WARNING]: Module did not set no_log for no_password_changes
TASK [postgresql-databases : Make sure the PostgreSQL databases are present] **********************************************************************************************************************************************************************************************************************************************************************************************
changed: [10.10.10.8] => (item={'db': 'application', 'encoding': 'UTF8', 'lc_collate': 'en_US.UTF-8', 'lc_ctype': 'en_US.UTF-8', 'owner': 'app'})
TASK [pgbouncer/userlist : Get users and password md5 from pg_shadow] *****************************************************************************************************************************************************************************************************************************************************************************************************
ok: [10.10.10.8 -> 10.10.10.8]
TASK [pgbouncer/userlist : Generate /etc/pgbouncer/userlist.txt] **********************************************************************************************************************************************************************************************************************************************************************************************************
changed: [10.10.10.8]
changed: [10.10.10.6]
changed: [10.10.10.9]
RUNNING HANDLER [pgbouncer/userlist : Reload pgbouncer service] ***********************************************************************************************************************************************************************************************************************************************************************************************************
changed: [10.10.10.6]
changed: [10.10.10.8]
changed: [10.10.10.9]
TASK [deploy-finish : Check postgresql cluster health] ********************************************************************************************************************************************************************************************************************************************************************************************************************
ok: [10.10.10.8]
TASK [deploy-finish : PostgreSQL Cluster health] **************************************************************************************************************************************************************************************************************************************************************************************************************************
ok: [10.10.10.8] => {
"patronictl_result.stdout_lines": [
"+ Cluster: test (6882755957868354174) ----+-----------+",
"| Member | Host | Role | State | TL | Lag in MB |",
"+--------+------------+---------+---------+----+-----------+",
"| db0 | 10.10.10.8 | Leader | running | 1 | |",
"| db1 | 10.10.10.6 | Replica | running | 1 | 0 |",
"| db2 | 10.10.10.9 | Replica | running | 1 | 0 |",
"+--------+------------+---------+---------+----+-----------+"
]
}
TASK [deploy-finish : Get postgresql database list] ***********************************************************************************************************************************************************************************************************************************************************************************************************************
ok: [10.10.10.8 -> 10.10.10.8]
TASK [deploy-finish : PostgreSQL list of databases] ***********************************************************************************************************************************************************************************************************************************************************************************************************************
ok: [10.10.10.8] => {
"dbs_result.stdout_lines": [
" name | owner | encoding | collate | ctype | size | tablespace ",
"-------------+----------+----------+-------------+-------------+---------+------------",
" application | app | UTF8 | en_US.UTF-8 | en_US.UTF-8 | 7729 kB | pg_default",
" postgres | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | 7877 kB | pg_default",
"(2 rows)"
]
}
TASK [deploy-finish : Create list of nodes] *******************************************************************************************************************************************************************************************************************************************************************************************************************************
ok: [10.10.10.8]
TASK [deploy-finish : PostgreSQL Cluster connection info] *****************************************************************************************************************************************************************************************************************************************************************************************************************
ok: [10.10.10.8] => {
"msg": [
"+------------------------------------------------+",
"address 10.10.10.8,10.10.10.6,10.10.10.9",
"port 6432 (pgbouncer)",
"+------------------------------------------------+"
]
}
PLAY RECAP ****************************************************************************************************************************************************************************************************************************************************************************************************************************************************************
10.10.10.6 : ok=106 changed=63 unreachable=0 failed=0 skipped=178 rescued=0 ignored=0
10.10.10.8 : ok=118 changed=66 unreachable=0 failed=0 skipped=186 rescued=0 ignored=0
10.10.10.9 : ok=106 changed=63 unreachable=0 failed=0 skipped=178 rescued=0 ignored=0
localhost : ok=0 changed=0 unreachable=0 failed=0 skipped=1 rescued=0 ignored=0
PS: Should I enabled pgbackrest again now ?
Is this normal ?
I think it's ok :) During this time period, the replica was copying PGDATA from the master using basebackup.
Oct 12 17:28:38 db1 patroni[15030]: 2020-10-12 17:28:38,514 INFO: trying to bootstrap from leader 'db0'
Oct 12 17:28:39 db1 patroni[15030]: 2020-10-12 17:28:39,558 INFO: replica has been created using basebackup
Once the database was up and running, patroni was able to establish a new connection to postgres. Next, we see that the replica has been successfully created:
Oct 12 17:49:52 db1 patroni [15030]: 2020-10-12 17: 49: 52,685 INFO: no action. i am a secondary and i am following a leader
Should I enabled pgbackrest again now ?
Make a backup of your data, then you can use this backup to clone/deploy new clusters. Or for the PITR of the current cluster.
https://github.com/vitabaks/postgresql_cluster#restore-and-cloning
Thank you for all the help and info!
Should there me any other port in here?
Should there me any other port in here?
How does this relate to pgbackrest?
I was giving an answer here: #65
Aaa, sorry I missed it.
Thank you for your help!
Closing this issue as now I understand the workflow of pgbackrest more or less.
Thanks!
Hi there,
I have configured pgbackrest backup with s3 as storage and trying to deploy the cluster from zero again. For some reason after enabling pgbackrest the cluster deployment fails.
It seems to try and restore from backup, although there was no backup yet as this is a new cluster.
Is there an issue in the playbook or am I doing something wrong?
Thanks!