vitabaks / postgresql_cluster

Automated database platform for PostgreSQL® A modern, open-source alternative to cloud-managed databases.
https://postgresql-cluster.org
MIT License
1.83k stars 418 forks source link

pgBackRest: Point-In-Time Recovery (PITR) Improvements #765

Closed vitabaks closed 1 week ago

vitabaks commented 2 months ago

This PR introduces a configurable cluster_restore_timeout variable to replace hardcoded values and provide flexibility. The timeout has been increased to 24 hours, making it more suitable for restoring large databases, including both backup and WAL recovery.

Additionally, we've added the task "Wait for the Standby cluster initialization to complete". This task allows tracking the progress of Standby cluster initialization in the Ansible logs. The same 24-hour timeout is applied to this task to ensure it has ample time to complete for large databases.

Variables:

Additionally

vitabaks commented 2 weeks ago

Test

Result:


TASK [patroni : Stop patroni service on the Replica servers (if running)] ***************************************************************************************************************************************
changed: [172.31.33.111]
changed: [172.31.42.255]

TASK [patroni : Stop patroni service on the Master server (if running)] *****************************************************************************************************************************************
changed: [172.31.39.47]

TASK [patroni : Check that PostgreSQL is stopped] ***************************************************************************************************************************************************************
ok: [172.31.33.111]
ok: [172.31.39.47]
ok: [172.31.42.255]

TASK [patroni : Remove patroni cluster "vitabaks-pgcluster" from DCS (if exist)] ********************************************************************************************************************************
changed: [172.31.39.47]

TASK [patroni : Run "/usr/bin/pgbackrest --stanza=vitabaks-pgcluster --type=time "--target=2024-11-01 12:47:47+00" --delta restore" on Master] ******************************************************************
changed: [172.31.39.47]

TASK [patroni : Run "/usr/bin/pgbackrest --stanza=vitabaks-pgcluster --type=time "--target=2024-11-01 12:47:47+00" --delta restore" on Replica] *****************************************************************
changed: [172.31.33.111]
changed: [172.31.42.255]
FAILED - RETRYING: [172.31.42.255]: Waiting for restore from backup (2880 retries left).
FAILED - RETRYING: [172.31.39.47]: Waiting for restore from backup (2880 retries left).
FAILED - RETRYING: [172.31.33.111]: Waiting for restore from backup (2880 retries left).

TASK [patroni : Waiting for restore from backup] ****************************************************************************************************************************************************************
changed: [172.31.39.47] => (item=True)
changed: [172.31.42.255] => (item=True)
changed: [172.31.33.111] => (item=True)

TASK [patroni : Start PostgreSQL for Recovery] ******************************************************************************************************************************************************************
changed: [172.31.33.111]
changed: [172.31.39.47]
changed: [172.31.42.255]
FAILED - RETRYING: [172.31.39.47]: Wait for the PostgreSQL start command to complete (360 retries left).
FAILED - RETRYING: [172.31.33.111]: Wait for the PostgreSQL start command to complete (360 retries left).
FAILED - RETRYING: [172.31.42.255]: Wait for the PostgreSQL start command to complete (360 retries left).

TASK [patroni : Wait for the PostgreSQL start command to complete] **********************************************************************************************************************************************
changed: [172.31.39.47]
changed: [172.31.33.111]
changed: [172.31.42.255]
FAILED - RETRYING: [172.31.39.47]: Wait for PostgreSQL recovery to complete (WAL apply) (2880 retries left).

TASK [patroni : Wait for PostgreSQL recovery to complete (WAL apply)] *******************************************************************************************************************************************
ok: [172.31.39.47]

TASK [patroni : Get PostgreSQL recovery log] ********************************************************************************************************************************************************************
ok: [172.31.39.47]

TASK [patroni : Print PostgreSQL recovery log] ******************************************************************************************************************************************************************
ok: [172.31.39.47] => {
    "msg": [
        "2024-11-01 12:37:52.272 UTC [11910] LOG:  recovery stopping before commit of transaction 248047, time 2024-10-21 11:57:58.000145+00",
        "2024-11-01 12:37:52.272 UTC [11910] LOG:  redo done at 0/6591C308 system usage: CPU: user: 8.84 s, system: 5.19 s, elapsed: 45.81 s",
        "2024-11-01 12:37:52.272 UTC [11910] LOG:  last completed transaction was at log time 2024-10-21 11:57:57.999688+00"
    ]
}

TASK [patroni : Check that PostgreSQL is stopped] ***************************************************************************************************************************************************************
ok: [172.31.33.111]
ok: [172.31.39.47]
ok: [172.31.42.255]

TASK [patroni : Stop PostgreSQL] ********************************************************************************************************************************************************************************
changed: [172.31.39.47]

TASK [patroni : Check the patroni.dynamic.json exists] **********************************************************************************************************************************************************
ok: [172.31.39.47]
ok: [172.31.33.111]
ok: [172.31.42.255]

TASK [patroni : Start patroni service on the Master server] *****************************************************************************************************************************************************
changed: [172.31.39.47]

TASK [patroni : Wait for port 8008 to become open on the host] **************************************************************************************************************************************************
ok: [172.31.39.47]

TASK [patroni : Check PostgreSQL is started and accepting connections] ******************************************************************************************************************************************
ok: [172.31.39.47]

TASK [patroni : Wait for the cluster to initialize (master is the leader with the lock)] ************************************************************************************************************************
ok: [172.31.39.47]

TASK [patroni : Make sure the Master is not in recovery mode] ***************************************************************************************************************************************************
ok: [172.31.39.47]

TASK [patroni : Make sure the superuser and replication users are present, and password does not differ from the specified] *************************************************************************************
changed: [172.31.39.47] => (item=postgres)
changed: [172.31.39.47] => (item=replicator)

TASK [patroni : Make sure the postgresql users are present, and password does not differ from the specified] ****************************************************************************************************
ok: [172.31.39.47] => (item=pgbouncer)

TASK [patroni : Update postgresql authentication in patroni.yml] ************************************************************************************************************************************************
changed: [172.31.39.47]
changed: [172.31.33.111]
changed: [172.31.42.255]

TASK [patroni : Start patroni service on Replica servers] *******************************************************************************************************************************************************
changed: [172.31.33.111]
changed: [172.31.42.255]

TASK [patroni : Wait for port 8008 to become open on the host] **************************************************************************************************************************************************
ok: [172.31.42.255]
ok: [172.31.33.111]

TASK [patroni : Check that the patroni is healthy on the replica server] ****************************************************************************************************************************************
ok: [172.31.42.255]
ok: [172.31.33.111]

TASK [deploy-finish : Get Postgres users] ***********************************************************************************************************************************************************************
ok: [172.31.39.47]

TASK [deploy-finish : Get Postgres databases] *******************************************************************************************************************************************************************
ok: [172.31.39.47]

TASK [deploy-finish : Get Postgres cluster info] ****************************************************************************************************************************************************************
ok: [172.31.39.47]

TASK [deploy-finish : Postgres list of users] *******************************************************************************************************************************************************************
ok: [172.31.39.47] => {
    "msg": [
        "                              List of roles",
        " Role name  |                         Attributes                         ",
        "------------+------------------------------------------------------------",
        " pgbouncer  | ",
        " postgres   | Superuser, Create role, Create DB, Replication, Bypass RLS",
        " replicator | Replication"
    ]
}

TASK [deploy-finish : Postgres list of databases] ***************************************************************************************************************************************************************
ok: [172.31.39.47] => {
    "msg": [
        "                                                       List of databases",
        "   Name    |  Owner   | Encoding | Locale Provider |   Collate   |    Ctype    | ICU Locale | ICU Rules |   Access privileges   ",
        "-----------+----------+----------+-----------------+-------------+-------------+------------+-----------+-----------------------",
        " postgres  | postgres | UTF8     | libc            | en_US.UTF-8 | en_US.UTF-8 |            |           | ",
        " template0 | postgres | UTF8     | libc            | en_US.UTF-8 | en_US.UTF-8 |            |           | =c/postgres          +",
        "           |          |          |                 |             |             |            |           | postgres=CTc/postgres",
        " template1 | postgres | UTF8     | libc            | en_US.UTF-8 | en_US.UTF-8 |            |           | =c/postgres          +",
        "           |          |          |                 |             |             |            |           | postgres=CTc/postgres",
        "(3 rows)"
    ]
}

TASK [deploy-finish : Postgres Cluster info] ********************************************************************************************************************************************************************
ok: [172.31.39.47] => {
    "msg": [
        "+ Cluster: vitabaks-pgcluster (7428194193594549507) ---------------+----+-----------+",
        "| Member           | Host          | Role    | State               | TL | Lag in MB |",
        "+------------------+---------------+---------+---------------------+----+-----------+",
        "| ip-172-31-33-111 | 172.31.33.111 | Replica | in archive recovery |  3 |         0 |",
        "| ip-172-31-39-47  | 172.31.39.47  | Leader  | running             |  5 |           |",
        "| ip-172-31-42-255 | 172.31.42.255 | Replica | in archive recovery |  3 |         0 |",
        "+------------------+---------------+---------+---------------------+----+-----------+"
    ]
}

PLAY RECAP ******************************************************************************************************************************************************************************************************
172.31.33.111              : ok=38   changed=7    unreachable=0    failed=0    skipped=65   rescued=0    ignored=0   
172.31.39.47               : ok=58   changed=10   unreachable=0    failed=0    skipped=72   rescued=0    ignored=0   
172.31.42.255              : ok=38   changed=7    unreachable=0    failed=0    skipped=65   rescued=0    ignored=0   
localhost                  : ok=19   changed=0    unreachable=0    failed=0    skipped=206  rescued=0    ignored=0