ansible-playbook update_pgcluster.yml -e target=system fails

chuegel commented 6 months ago

During our scheduled update of the postgresql cluster, I noticed that the playbook failed:

{
  "redirected": false,
  "url": "http://10.83.200.12:8008/replica",
  "status": 503,
  "server": "BaseHTTP/0.6 Python/3.10.12",
  "date": "Tue, 30 Apr 2024 06:40:38 GMT",
  "content_type": "application/json",
  "elapsed": 0,
  "changed": false,
  "json": {
    "state": "running",
    "postmaster_start_time": "2024-03-05 15:25:17.225621+01:00",
    "role": "master",
    "server_version": 150006,
    "xlog": {
      "location": 191966088696
    },
    "timeline": 15,
    "replication": [
      {
        "usename": "replicator",
        "application_name": "postgresql03",
        "client_addr": "10.83.200.14",
        "state": "streaming",
        "sync_state": "async",
        "sync_priority": 0
      }
    ],
    "dcs_last_seen": 1714459232,
    "database_system_identifier": "7253014758852064969",
    "patroni": {
      "version": "3.2.2",
      "scope": "postgres-cluster",
      "name": "postgresql01"
    }
  },
  "msg": "Status code was 503 and not [200]: HTTP Error 503: Service Unavailable",
  "invocation": {
    "module_args": {
      "url": "http://10.83.200.12:8008/replica",
      "status_code": [
        200
      ],
      "force": false,
      "http_agent": "ansible-httpget",
      "use_proxy": true,
      "validate_certs": true,
      "force_basic_auth": false,
      "use_gssapi": false,
      "body_format": "raw",
      "method": "GET",
      "return_content": false,
      "follow_redirects": "safe",
      "timeout": 30,
      "headers": {},
      "remote_src": false,
      "unredirected_headers": [],
      "decompress": true,
      "use_netrc": true,
      "unsafe_writes": false,
      "url_username": null,
      "url_password": null,
      "client_cert": null,
      "client_key": null,
      "dest": null,
      "body": null,
      "src": null,
      "creates": null,
      "removes": null,
      "unix_socket": null,
      "ca_path": null,
      "ciphers": null,
      "mode": null,
      "owner": null,
      "group": null,
      "seuser": null,
      "serole": null,
      "selevel": null,
      "setype": null,
      "attributes": null
    }
  },
  "_ansible_no_log": false,
  "attempts": 300
}

{
  "cmd": "patronictl -c /etc/patroni/patroni.yml switchover postgres-cluster",
  "stdout": "Current cluster topology\r\n+ Cluster: postgres-cluster (7253014758852064969) ------------+----+-----------+\r\n| Member       | Host         | Role    | State               | TL | Lag in MB |\r\n+--------------+--------------+---------+---------------------+----+-----------+\r\n| postgresql01 | 10.83.200.12 | Leader  | running             | 15 |           |\r\n| postgresql02 | 10.83.200.13 | Replica | in archive recovery | 15 |     20497 |\r\n| postgresql03 | 10.83.200.14 | Replica | streaming           | 15 |         0 |\r\n+--------------+--------------+---------+---------------------+----+-----------+\r\nPrimary [postgresql01]: Candidate ['postgresql02', 'postgresql03'] []: When should the switchover take place (e.g. 2024-04-30T09:28 )  [now]: Are you sure you want to switchover cluster postgres-cluster, demoting current leader postgresql01? [y/N]: Switchover failed, details: 503, Switchover failed",
  "rc": 0,
  "start": "2024-04-30 08:28:09.489857",
  "end": "2024-04-30 08:28:13.165841",
  "delta": "0:00:03.675984",
  "changed": true,
  "invocation": {
    "module_args": {
      "command": "patronictl -c /etc/patroni/patroni.yml switchover postgres-cluster",
      "responses": {
        "(.*)Primary(.*)": "postgresql01",
        "(.*)Candidate(.*)": "postgresql02",
        "(.*)When should the switchover take place(.*)": "now",
        "(.*)Are you sure you want to switchover cluster(.*)": "y"
      },
      "timeout": 30,
      "echo": false,
      "chdir": null,
      "creates": null,
      "removes": null
    }
  },
  "stdout_lines": [
    "Current cluster topology",
    "+ Cluster: postgres-cluster (7253014758852064969) ------------+----+-----------+",
    "| Member       | Host         | Role    | State               | TL | Lag in MB |",
    "+--------------+--------------+---------+---------------------+----+-----------+",
    "| postgresql01 | 10.83.200.12 | Leader  | running             | 15 |           |",
    "| postgresql02 | 10.83.200.13 | Replica | in archive recovery | 15 |     20497 |",
    "| postgresql03 | 10.83.200.14 | Replica | streaming           | 15 |         0 |",
    "+--------------+--------------+---------+---------------------+----+-----------+",
    "Primary [postgresql01]: Candidate ['postgresql02', 'postgresql03'] []: When should the switchover take place (e.g. 2024-04-30T09:28 )  [now]: Are you sure you want to switchover cluster postgres-cluster, demoting current leader postgresql01? [y/N]: Switchover failed, details: 503, Switchover failed"
  ],
  "_ansible_no_log": false
}

However, the upgrade of the two replica nodes went well.

Any hints how to recover the actual state of the cluster? Thanks

vitabaks commented 6 months ago

Please attach ansible log

chuegel commented 6 months ago

Thanks for your reply. So, digging deeper into the logs it turned out that the version on the pgbackrest server was lagging behind:

2024-04-30 00:00:06.440 P00   INFO: archive-get command end: aborted with exception [103]
2024-04-30 00:00:06 CEST [442637-1]  LOG:  started streaming WAL from primary at 27/B0000000 on timeline 15
2024-04-30 00:00:06 CEST [442637-2]  FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 0000000F00000027000000B0 has already been removed
2024-04-30 00:00:06.490 P00   INFO: archive-get command begin 2.51: [00000010.history, pg_wal/RECOVERYHISTORY] --exec-id=442639-759ac39e --log-level-console=info --log-level-file=detail --log-path=/var/log/pgbackrest --pg1-path=/var/lib/postgresql/15/main --process-max=4 --repo1-host=10.83.43.119 --repo1-host-user=postgres --repo1-path=/var/lib/pgbackrest --repo1-type=posix --stanza=postgres-cluster
WARN: repo1: [ProtocolError] expected value '2.51' for greeting key 'version' but got '2.50'
      HINT: is the same version of pgBackRest installed on the local and remote host?
ERROR: [103]: unable to find a valid repository

After upgrading pgbackrest server now I get this error:

2024-04-30 09:25:12 CEST [29612-1]  LOG:  started streaming WAL from primary at 27/B0000000 on timeline 15
2024-04-30 09:25:12 CEST [29612-2]  FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 0000000F00000027000000B0 has already been removed
2024-04-30 09:25:12.412 P00   INFO: archive-get command begin 2.51: [00000010.history, pg_wal/RECOVERYHISTORY] --exec-id=29614-ee35ac41 --log-level-console=info --log-level-file=detail --log-path=/var/log/pgbackrest --pg1-path=/var/lib/postgresql/15/main --process-max=4 --repo1-host=10.83.43.119 --repo1-host-user=postgres --repo1-path=/var/lib/pgbackrest --repo1-type=posix --stanza=postgres-cluster
2024-04-30 09:25:12.670 P00   INFO: unable to find 00000010.history in the archive
2024-04-30 09:25:12.771 P00   INFO: archive-get command end: completed successfully (363ms)
2024-04-30 09:25:12 CEST [859-806]  LOG:  waiting for WAL to become available at 27/B0002000

chuegel commented 6 months ago

The other replica seems to be fine:

2024-04-30 09:57:30 CEST [700-23]  LOG:  recovery restart point at 2C/B4010EA8
2024-04-30 09:57:30 CEST [700-24]  DETAIL:  Last completed transaction was at log time 2024-04-30 09:57:22.977464+02.
2024-04-30 10:12:28 CEST [700-25]  LOG:  restartpoint starting: time
2024-04-30 10:12:30 CEST [700-26]  LOG:  restartpoint complete: wrote 21 buffers (0.0%); 0 WAL file(s) added, 0 removed, 1 recycled; write=2.018 s, sync=0.003 s, total=2.037 s; sync files=16, longest=0.002 s, average=0.001 s; distance=46 kB, estimate=14701 kB

vitabaks commented 6 months ago

It is strange that the pgbackrest package has not been updated with target=system

vitabaks commented 6 months ago

I understand everything, you are using a dedicated pgbackest server and it is here that the old package is used. So yes, it's worth updating the pgbackrest server first.

P.S. I switched to minio (s3) and I no longer have similar problems with the pgbackrest versions.

chuegel commented 6 months ago

I understand everything, you are using a dedicated pgbackest server and it is here that the old package is used. So yes, it's worth updating the pgbackrest server first.

P.S. I switched to minio (s3) and I no longer have similar problems with the pgbackrest versions.

Yes, I use a dedicated pgbackrest server. After aligning the versions, the one replica complains with:

2024-04-30 10:29:30 CEST [56835-1]  LOG:  started streaming WAL from primary at 27/B0000000 on timeline 15
2024-04-30 10:29:30 CEST [56835-2]  FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 0000000F00000027000000B0 has already been removed
2024-04-30 10:29:30.877 P00   INFO: archive-get command begin 2.51: [00000010.history, pg_wal/RECOVERYHISTORY] --exec-id=56837-4247d19a --log-level-console=info --log-level-file=detail --log-path=/var/log/pgbackrest --pg1-path=/var/lib/postgresql/15/main --process-max=4 --repo1-host=10.83.43.119 --repo1-host-user=postgres --repo1-path=/var/lib/pgbackrest --repo1-type=posix --stanza=postgres-cluster
2024-04-30 10:29:31.129 P00   INFO: unable to find 00000010.history in the archive

But there is no 00000010.history on the pgbackrest server:

ls -la /var/lib/pgbackrest/archive/postgres-cluster/15-1/
total 192
drwxr-x--- 7 postgres postgres  4096 Apr 27 00:01 .
drwxr-x--- 3 postgres postgres  4096 Apr 30 00:01 ..
-rw-r----- 1 postgres postgres   610 Mar  5 15:35 0000000F.history
drwxr-x--- 2 postgres postgres 36864 Apr 27 00:01 0000000F00000028
drwxr-x--- 2 postgres postgres 32768 Apr 16 07:01 0000000F00000029
drwxr-x--- 2 postgres postgres 36864 Apr 21 12:31 0000000F0000002A
drwxr-x--- 2 postgres postgres 36864 Apr 26 17:31 0000000F0000002B
drwxr-x--- 2 postgres postgres 24576 Apr 30 09:01 0000000F0000002C

chuegel commented 6 months ago

It is strange that the pgbackrest package has not been updated with target=system

Thats because the playbook didn't run agains the pgbackrest host. Not sure why

vitabaks commented 6 months ago

The playbook is designed to update the postgres cluster, not the backup server.

chuegel commented 6 months ago

I understand. The pgbackup package was update on the replicas successully. On the leader, it does a switchover before upgrading packages. Since the switchover failed, the leader had also a older version of pgbackrest. I did manually upgrade the pgbackrest package on leader and pgbackrest server: apt install --only-upgrade pgbackrest

I'm not quite sure which steps to take to recover the one replica.

vitabaks commented 6 months ago

Try reinit replica patronictl reinit postgres-cluster <problem replica name>

After you have fixed the problem with the different versions of the pgbackrest packages between the database servers and the backup servers, try running the update_pgcluster.yml playbook again to complete the cluster update.

vitabaks commented 6 months ago

I will also add an update the pgbackrest package on the backup server in automation.

UPD: https://github.com/vitabaks/postgresql_cluster/pull/648

chuegel commented 6 months ago

Try reinit replica patronictl reinit postgres-cluster <problem replica name>

After you have fixed the problem with the different versions of the pgbackrest packages between the database servers and the backup servers, try running the update_pgcluster.yml playbook again to complete the cluster update.

That worked!!!

patronictl list postgres-cluster
+ Cluster: postgres-cluster (7253014758852064969) --+----+-----------+
| Member       | Host         | Role    | State     | TL | Lag in MB |
+--------------+--------------+---------+-----------+----+-----------+
| postgresql01 | 10.83.200.12 | Leader  | running   | 15 |           |
| postgresql02 | 10.83.200.13 | Replica | streaming | 15 |         0 |
| postgresql03 | 10.83.200.14 | Replica | streaming | 15 |         0 |
+--------------+--------------+---------+-----------+----+-----------+

Thank you, Sir!

vitabaks / postgresql_cluster

ansible-playbook update_pgcluster.yml -e target=system fails #644