Previously, the pg_upgrade.yml playbook stopped with an error that required a manual execution of the pg_upgrade_rollback.yml playbook, now this will be done automatically in case of problems with the pg_upgrade execution.
Fixed:
TASK [Upgrade Primary] *********************************************************
TASK [upgrade : Upgrade the PostgreSQL to version 15 on the Primary (using pg_upgrade --link)] ***
fatal: [172.30.58.35]: FAILED! => {"changed": true, "cmd": ["/usr/lib/postgresql/15/bin/pg_upgrade", "--username=postgres", "--old-bindir", "/usr/lib/postgresql/14/bin", "--new-bindir", "/usr/lib/postgresql/15/bin", "--old-datadir", "/pgdata/14/main", "--new-datadir", "/pgdata/15/main", "--old-options", "-c config_file=/etc/postgresql/14/main/postgresql.conf", "--new-options", "-c config_file=/etc/postgresql/15/main/postgresql.conf -c shared_preload_libraries='pg_stat_statements,auto_explain,timescaledb,pg_stat_kcache,pg_wait_sampling' -c timescaledb.restoring='on'", "--jobs=24", "--link"], "delta": "0:00:00.513224", "end": "2024-03-21 07:01:40.452925", "msg": "non-zero return code", "rc": 1, "start": "2024-03-21 07:01:39.939701", "stderr": "", "stderr_lines": [], "stdout": "Performing Consistency Checks\n-----------------------------\nChecking cluster versions ok\n\n*failure*\nConsult the last few lines of \"/pgdata/15/main/pg_upgrade_output.d/20240321T070139.948/log/pg_upgrade_server.log\" for\nthe probable cause of the failure.\n\nconnection to server on socket \"/pgdata/.s.PGSQL.50432\" failed: No such file or directory\n\tIs the server running locally and accepting connections on that socket?\n\ncould not connect to source postmaster started with the command:\n\"/usr/lib/postgresql/14/bin/pg_ctl\" -w -l \"/pgdata/15/main/pg_upgrade_output.d/20240321T070139.948/log/pg_upgrade_server.log\" -D \"/pgdata/14/main\" -o \"-p 50432 -b -c config_file=/etc/postgresql/14/main/postgresql.conf -c listen_addresses='' -c unix_socket_permissions=0700 -c unix_socket_directories='/pgdata'\" start\nFailure, exiting", "stdout_lines": ["Performing Consistency Checks", "-----------------------------", "Checking cluster versions ok", "", "*failure*", "Consult the last few lines of \"/pgdata/15/main/pg_upgrade_output.d/20240321T070139.948/log/pg_upgrade_server.log\" for", "the probable cause of the failure.", "", "connection to server on socket \"/pgdata/.s.PGSQL.50432\" failed: No such file or directory", "\tIs the server running locally and accepting connections on that socket?", "", "could not connect to source postmaster started with the command:", "\"/usr/lib/postgresql/14/bin/pg_ctl\" -w -l \"/pgdata/15/main/pg_upgrade_output.d/20240321T070139.948/log/pg_upgrade_server.log\" -D \"/pgdata/14/main\" -o \"-p 50432 -b -c config_file=/etc/postgresql/14/main/postgresql.conf -c listen_addresses='' -c unix_socket_permissions=0700 -c unix_socket_directories='/pgdata'\" start", "Failure, exiting"]}
NO MORE HOSTS LEFT *************************************************************
PLAY RECAP *********************************************************************
172.30.58.31 : ok=53 changed=11 unreachable=0 failed=0 skipped=93 rescued=0 ignored=0
172.30.58.32 : ok=53 changed=11 unreachable=0 failed=0 skipped=93 rescued=0 ignored=0
172.30.58.35 : ok=89 changed=22 unreachable=0 failed=1 skipped=75 rescued=0 ignored=0
Cleaning up project directory and file based variables
00:01
ERROR: Job failed: exit code 2
2. Backup/Restore the patroni.yml configuration file
During the rollback, restore a previously created backup copy of the patroni.yml file, this ensures that all previously set parameters for the previous version of Postgres will be present.
Fixed:
'2024-03-19 11:29:36 MSK [3063-370] LOG: configuration file "/etc/postgresql/14/main/postgresql.conf" contains errors; unaffected changes were applied
'2024-03-19 11:29:36 MSK [4071-1] LOG: could not open temporary statistics file "/var/run/postgresql/14-main.pg_stat_tmp/global.tmp": No such file or directory
'2024-03-19 11:29:36 MSK [4071-2] LOG: could not open temporary statistics file "/var/run/postgresql/14-main.pg_stat_tmp/global.tmp": No such file or directory
'2024-03-19 11:29:36 MSK [4071-3] LOG: could not open temporary statistics file "/var/run/postgresql/14-main.pg_stat_tmp/global.tmp": No such file or directory
3. Add retry and ignore errors for "Update extensions" and "Post-Checks"
After the upgrade is completed and the pools are resumed, statistics are collected, and the database may experience a high load for some time (see https://github.com/vitabaks/postgresql_cluster/pull/601). These changes add repeated attempts for extensions update tasks, as well as ignoring errors in their execution to continue executing the playbook (errors will be visible in the ansible log)
Fixed:
TASK [upgrade : Get list of old PostgreSQL extensions (database: shieldems)] ***
fatal: [172.30.58.34]: FAILED! => {"changed": false, "cmd": ["/usr/lib/postgresql/15/bin/psql", "-p", "5432", "-U", "postgres", "-d", "shieldems", "-tAXc", "select extname from pg_catalog.pg_extension e join pg_catalog.pg_available_extensions ae on extname = ae.name where installed_version <> default_version"], "delta": "0:00:00.037253", "end": "2024-03-20 05:27:55.299617", "msg": "non-zero return code", "rc": 2, "start": "2024-03-20 05:27:55.262364", "stderr": "psql: error: connection to server on socket \"/var/run/postgresql/.s.PGSQL.5432\" failed: FATAL: sorry, too many clients already", "stderr_lines": ["psql: error: connection to server on socket \"/var/run/postgresql/.s.PGSQL.5432\" failed: FATAL: sorry, too many clients already"], "stdout": "", "stdout_lines": []}
NO MORE HOSTS LEFT *************************************************************
PLAY RECAP *********************************************************************
172.30.58.30 : ok=74 changed=21 unreachable=0 failed=0 skipped=114 rescued=0 ignored=0
172.30.58.33 : ok=74 changed=21 unreachable=0 failed=0 skipped=114 rescued=0 ignored=0
172.30.58.34 : ok=136 changed=38 unreachable=0 failed=1 skipped=89 rescued=0 ignored=0
172.30.58.36 : ok=74 changed=21 unreachable=0 failed=0 skipped=114 rescued=0 ignored=0
Cleaning up project directory and file based variables
00:00
ERROR: Job failed: exit code 2
4. Make sure that the sshpass package are installed
if pg_new_wal_dir variable vis defined (for synchronize wal dir)
Fixed:
TASK [upgrade : Make sure the custom WAL directory "/pgwal/15/pg_wal" exists and is empty] ***
ok: [172.30.59.17] => (item=absent)
ok: [172.30.59.18] => (item=absent)
ok: [172.30.59.19] => (item=absent)
changed: [172.30.59.17] => (item=directory)
changed: [172.30.59.18] => (item=directory)
changed: [172.30.59.19] => (item=directory)
TASK [upgrade : Synchronize /pgdata1/15/main/pg_wal to /pgwal/15/pg_wal] *******
fatal: [172.30.59.19]: FAILED! => {"changed": false, "cmd": "sshpass", "msg": "[Errno 2] No such file or directory: b'sshpass'", "rc": 2, "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}
fatal: [172.30.59.18]: FAILED! => {"changed": false, "cmd": "sshpass", "msg": "[Errno 2] No such file or directory: b'sshpass'", "rc": 2, "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}
changed: [172.30.59.17]
NO MORE HOSTS LEFT *************************************************************
PLAY RECAP *********************************************************************
172.30.59.17 : ok=94 changed=26 unreachable=0 failed=0 skipped=83 rescued=0 ignored=0
172.30.59.18 : ok=54 changed=12 unreachable=0 failed=1 skipped=103 rescued=0 ignored=0
172.30.59.19 : ok=54 changed=12 unreachable=0 failed=1 skipped=103 rescued=0 ignored=0
Cleaning up project directory and file based variables
00:01
ERROR: Job failed: exit code 2
1. Perform rollback, if the upgrade failed.
pg_upgrade.yml
playbook stopped with an error that required a manual execution of thepg_upgrade_rollback.yml
playbook, now this will be done automatically in case of problems with the pg_upgrade execution.Fixed:
2. Backup/Restore the patroni.yml configuration file
patroni.yml
file, this ensures that all previously set parameters for the previous version of Postgres will be present.Fixed:
3. Add retry and ignore errors for "Update extensions" and "Post-Checks"
Fixed:
4. Make sure that the
sshpass
package are installedpg_new_wal_dir
variable vis defined (for synchronize wal dir)Fixed: