vitabaks / postgresql_cluster

PostgreSQL High-Availability Cluster (based on "Patroni" and DCS "etcd" or "consul"). Automating with Ansible.
MIT License
1.27k stars 340 forks source link

update_pgcluster.yml: Improve error handling #578

Closed vitabaks closed 2 months ago

vitabaks commented 2 months ago
  1. Ignore errors when updating packages
    • to avoid a situation where the database is stopped, and then the playbook is stopped with an error during update packages (for example, when there are problems with dependencies), as a result of which the database remains stopped on one of the cluster servers.

Fixed:


  PLAY [update_pgcluster.yml | Update PostgreSQL HA Cluster (based on "Patroni")] ***

  TASK [Gathering Facts] *********************************************************
  ok: [10.172.0.22]
  ok: [10.172.0.21]
  ok: [10.172.0.20]

  TASK [Include main variables] **************************************************
  ok: [10.172.0.20]
  ok: [10.172.0.21]
  ok: [10.172.0.22]

  TASK [[Prepare] Get Patroni Cluster Leader Node] *******************************
  ok: [10.172.0.21]
  ok: [10.172.0.20]
  ok: [10.172.0.22]

  TASK [[Prepare] Add host to group "primary" (in-memory inventory)] *************
  ok: [10.172.0.20] => (item=10.172.0.20)

  TASK [[Prepare] Add hosts to group "secondary" (in-memory inventory)] **********
  ok: [10.172.0.20] => (item=10.172.0.21)
  ok: [10.172.0.20] => (item=10.172.0.22)

  TASK [Print Patroni Cluster info] **********************************************
  ok: [10.172.0.20] => {
      "msg": [
          "Cluster Name: postgres-cluster",
          "Cluster Leader: pgnode01"
      ]
  }

  PLAY [(1/4) PRE-UPDATE: Perform Pre-Checks] ************************************

  TASK [Include main variables] **************************************************
  ok: [10.172.0.20]
  ok: [10.172.0.21]
  ok: [10.172.0.22]

  TASK [Running Pre-Checks] ******************************************************

  TASK [update : [Pre-Check] (ALL) Test PostgreSQL DB Access] ********************
  ok: [10.172.0.20]
  ok: [10.172.0.22]
  ok: [10.172.0.21]

  TASK [update : [Pre-Check] Make sure that physical replication is active] ******
  ok: [10.172.0.20]

  TASK [update : [Pre-Check] Make sure there is no high replication lag (more than 10.00 MB)] ***
  ok: [10.172.0.20]

  TASK [update : [Pre-Check] Make sure there are no long-running transactions (more than 15 seconds)] ***
  ok: [10.172.0.21]
  ok: [10.172.0.20]
  ok: [10.172.0.22]

  PLAY [(2/4) UPDATE: Secondary] *************************************************

  TASK [Include main variables] **************************************************
  ok: [10.172.0.21]

  TASK [Include OS-specific variables] *******************************************
  ok: [10.172.0.21]

  TASK [Stop read-only traffic] **************************************************

  TASK [update : Edit patroni.yml | enable noloadbalance, nosync, nofailover] ****
  changed: [10.172.0.21] => (item=noloadbalance: true)
  changed: [10.172.0.21] => (item=nosync: true)
  changed: [10.172.0.21] => (item=nofailover: true)

  TASK [update : Reload patroni service] *****************************************
  changed: [10.172.0.21]
  FAILED - RETRYING: [10.172.0.21]: Make sure replica endpoint is unavailable (30 retries left).
  FAILED - RETRYING: [10.172.0.21]: Make sure replica endpoint is unavailable (29 retries left).

  TASK [update : Make sure replica endpoint is unavailable] **********************
  ok: [10.172.0.21]

  TASK [update : Wait for active transactions to complete] ***********************
  ok: [10.172.0.21]

  TASK [Stop Services] ***********************************************************

  TASK [update : Check PostgreSQL is started and accepting connections] **********
  ok: [10.172.0.21]

  TASK [update : Execute CHECKPOINT before stopping PostgreSQL] ******************
  changed: [10.172.0.21]

  TASK [update : Stop Patroni service on the Cluster Replica (pgnode02)] *********
  changed: [10.172.0.21]

  TASK [Update PostgreSQL] *******************************************************

  TASK [update : Update dnf cache] ***********************************************
  changed: [10.172.0.21]

  TASK [update : Install the latest version of PostgreSQL packages] **************
  ok: [10.172.0.21] => (item=postgresql16)
  ok: [10.172.0.21] => (item=postgresql16-server)
  ok: [10.172.0.21] => (item=postgresql16-contrib)

  TASK [Update Patroni] **********************************************************

  TASK [update : Install the latest version of Patroni] **************************
  ok: [10.172.0.21]

  TASK [Update all system packages] **********************************************

  TASK [update : Update dnf cache] ***********************************************
  changed: [10.172.0.21]
  fatal: [10.172.0.21]: FAILED! => {"attempts": 3, "changed": false, "failures": [], "msg": "Depsolve Error occurred: \n Problem: package iptables-legacy-1.8.8-6.el9.2.x86_64 from @System requires (iptables-libs(x86-64) = 1.8.8-6.el9 or iptables-libs(x86-64) = 1.8.8-6.el9_1), but none of the providers can be installed\n  - cannot install both iptables-libs-1.8.10-2.el9.x86_64 from baseos and iptables-libs-1.8.8-6.el9.x86_64 from @System\n  - cannot install both iptables-libs-1.8.8-6.el9.x86_64 from baseos and iptables-libs-1.8.10-2.el9.x86_64 from baseos\n  - cannot install the best update candidate for package iptables-libs-1.8.8-6.el9.x86_64\n  - cannot install the best update candidate for package iptables-legacy-1.8.8-6.el9.2.x86_64", "rc": 1, "results": []}
  FAILED - RETRYING: [10.172.0.21]: Update all system packages (3 retries left).
  FAILED - RETRYING: [10.172.0.21]: Update all system packages (2 retries left).
  FAILED - RETRYING: [10.172.0.21]: Update all system packages (1 retries left).

  TASK [update : Update all system packages] *************************************

  NO MORE HOSTS LEFT *************************************************************

  PLAY RECAP *********************************************************************
  10.172.0.20                : ok=241  changed=88   unreachable=0    failed=0    skipped=706  rescued=0    ignored=0
  10.172.0.21                : ok=208  changed=89   unreachable=0    failed=1    skipped=679  rescued=0    ignored=0
  10.172.0.22                : ok=195  changed=83   unreachable=0    failed=0    skipped=665  rescued=0    ignored=0
  1. Improve the error handling
    • in order to inform about update errors after completing the playbook.