pgbackrest / pgbackrest

Reliable PostgreSQL Backup & Restore
https://pgbackrest.org
Other
2.62k stars 217 forks source link

Report timeout before standby replaced to F/56000028- only recovered E/CE0000A0 for incremental backup from the backup database #2483

Open 1053531172 opened 5 days ago

1053531172 commented 5 days ago

Please provide the following information when submitting an issue (feature requests or general comments can skip this):

  1. pgBackRest version:2.47

  2. PostgreSQL version:13.16

  3. Operating system/version - if you have more than one server (for example, a database server, a repository host server, one or more standbys), please specify each: CentOS Linux release 7.2.1511

  4. Did you install pgBackRest from source or from a package? source

  5. Please attach the following as applicable:

    • pgbackrest.conf file(s)
    • postgresql.conf settings applicable to pgBackRest (archive_command, archive_mode, listen_addresses, max_wal_senders, wal_level, port)
    • errors in the postgresql log file before or during the time you experienced the issue
    • log file in /var/log/pgbackrest for the commands run (e.g. /var/log/pgbackrest/mystanza_backup.log)
-------------------PROCESS START-------------------
2024-11-09 07:42:55.611 P00   INFO: backup command begin 2.47: --archive-check --archive-timeout=300 --backup-standby --compress-level=1 --compress-type=zst --config-include-path=/app/pg/pgbackrest/conf/conf.d --config-path=/app/pg/pgbackrest/conf --delta --exec-id=29470-20e72ad7 --log-level-console=detail --log-level-file=info --log-path=/app/pg/pgbackrest/logs --pg1-host=33.2.9.152 --pg2-host=33.2.9.154 --pg3-host=33.2.9.157 --pg1-host-ca-file=/app/pg/pgbackrest/certs/ca.crt --pg2-host-ca-file=/app/pg/pgbackrest/certs/ca.crt --pg3-host-ca-file=/app/pg/pgbackrest/certs/ca.crt --pg1-host-cert-file=/app/pg/pgbackrest/certs/backup-srv.crt --pg2-host-cert-file=/app/pg/pgbackrest/certs/backup-srv.crt --pg3-host-cert-file=/app/pg/pgbackrest/certs/backup-srv.crt --pg1-host-key-file=/app/pg/pgbackrest/certs/backup-srv.key --pg2-host-key-file=/app/pg/pgbackrest/certs/backup-srv.key --pg3-host-key-file=/app/pg/pgbackrest/certs/backup-srv.key --pg1-host-port=6581 --pg2-host-port=6581 --pg3-host-port=6581 --pg1-host-type=tls --pg2-host-type=tls --pg3-host-type=tls --pg1-path=/data/pg_6543/data --pg2-path=/data/pg_6543/data --pg3-path=/data/pg_6543/data --pg1-port=6543 --pg2-port=6543 --pg3-port=6543 --pg1-user=ctgbackup --pg2-user=ctgbackup --pg3-user=ctgbackup --process-max=2 --repo1-path=/data/pg/backup --repo1-retention-full=9999 --repo1-retention-full-type=time --resume --stanza=061cc4c37ca642be8325161a19524e1d_1 --start-fast --type=incr
2024-11-09 07:42:57.663 P00   INFO: last backup label = 20241107-010025F_20241107-204458I, version = 2.47
2024-11-09 07:42:57.663 P00   INFO: execute non-exclusive backup start: backup begins after the requested immediate checkpoint completes
2024-11-09 07:42:58.274 P00   INFO: backup start archive = 000000020000000F00000056, lsn = F/56000028
2024-11-09 07:42:58.274 P00   INFO: wait for replay on the standby to reach F/56000028
2024-11-09 07:47:58.581 P00  ERROR: [082]: timeout before standby replayed to F/56000028 - only reached E/CE0000A0
                                    HINT: is replication running and current on the standby?
                                    HINT: disable the 'backup-standby' option to backup directly from the primary.
2024-11-09 07:47:58.581 P00   INFO: backup command end: aborted with exception [082]
  1. Describe the issue: I deployed an instance with one primary and two backup, where one node had abnormal flow replication in the backup repository, but the status was normal, while the other node had normal backup repository status and flow replication. When I choose to backup from the backup repository, the following error will be reported:
    2024-11-09 07:47:58.581 P00   ERROR: [082]: timeout before standby replayed to F/56000028 - only reached E/CE0000A0
    HINT: is replication running and current on the standby?
    HINT: disable the 'backup-standby' option to backup directly from the primary.
    2024-11-09 07:47:58.581 P00    INFO: backup command end: aborted with exception [082]

    From the error result, it can be seen that it detected that the wal log was not archived, but in fact, the archiving of the master node has always been normal.

    2024-11-09 07:27:51.367 P00   INFO: pushed WAL file '000000020000000F00000054' to the archive asynchronously
    2024-11-09 07:27:51.367 P00   INFO: archive-push command end: completed successfully (403ms)
    2024-11-09 07:42:57.938 P00   INFO: archive-push command begin 2.47: [pg_wal/000000020000000F00000055] --archive-async --archive-header-check --archive-push-queue-max=100GiB --buffer-size=16MiB --config-path=/app/pg/pgbackrest/conf --exec-id=16750-0320e6da --log-level-console=detail --log-level-file=info --log-path=/app/pg/pgbackrest/logs --pg1-path=/data/pg_6543/data --process-max=2 --repo1-host=33.2.9.165 --repo1-host-ca-file=/app/pg/pgbackrest/certs/ca.crt --repo1-host-cert-file=/app/pg/pgbackrest/certs/pg-srv.crt --repo1-host-config-path=/app/pg/pgbackrest/conf --repo1-host-key-file=/app/pg/pgbackrest/certs/pg-srv.key --repo1-host-port=6582 --repo1-host-type=tls --repo1-host-user=ctgbackup --spool-path=/app/pg/pgbackrest/spool --stanza=061cc4c37ca642be8325161a19524e1d_1
    2024-11-09 07:42:58.341 P00   INFO: pushed WAL file '000000020000000F00000055' to the archive asynchronously
    2024-11-09 07:42:58.341 P00   INFO: archive-push command end: completed successfully (404ms)
    2024-11-09 07:58:05.287 P00   INFO: archive-push command begin 2.47: [pg_wal/000000020000000F00000056] --archive-async --archive-header-check --archive-push-queue-max=100GiB --buffer-size=16MiB --config-path=/app/pg/pgbackrest/conf --exec-id=22124-6f1b2514 --log-level-console=detail --log-level-file=info --log-path=/app/pg/pgbackrest/logs --pg1-path=/data/pg_6543/data --process-max=2 --repo1-host=33.2.9.165 --repo1-host-ca-file=/app/pg/pgbackrest/certs/ca.crt --repo1-host-cert-file=/app/pg/pgbackrest/certs/pg-srv.crt --repo1-host-config-path=/app/pg/pgbackrest/conf --repo1-host-key-file=/app/pg/pgbackrest/certs/pg-srv.key --repo1-host-port=6582 --repo1-host-type=tls --repo1-host-user=ctgbackup --spool-path=/app/pg/pgbackrest/spool --stanza=061cc4c37ca642be8325161a19524e1d_1
    2024-11-09 07:58:05.689 P00   INFO: pushed WAL file '000000020000000F00000056' to the archive asynchronously
    2024-11-09 07:58:05.689 P00   INFO: archive-push command end: completed successfully (403ms)

    It appears that it chose to backup to the backup repository with a stream replication exception, but could not find the wal log. Therefore, it selected the latest wal log, which is not present on the backup repository node with the exception. From a time perspective, the backup was performed on 2024-11-09 07:42, while the push time for 00000002000000F00000056 was 2024-11-09 7:58. I checked the main database and found that the last modification time for this wal log was also 7.58, indicating that the main database's push of wal logs was normal.After the abnormal stream replication is restored to normal from the node, incremental backup becomes normal.

    -------------------PROCESS START-------------------
    2024-11-09 09:18:16.809 P00   INFO: backup command begin 2.47: --archive-check --archive-timeout=300 --backup-standby --compress-level=1 --compress-type=zst --config-include-path=/app/pg/pgbackrest/conf/conf.d --config-path=/app/pg/pgbackrest/conf --delta --exec-id=11040-c73db60b --log-level-console=detail --log-level-file=info --log-path=/app/pg/pgbackrest/logs --pg1-host=33.2.9.152 --pg2-host=33.2.9.154 --pg3-host=33.2.9.157 --pg1-host-ca-file=/app/pg/pgbackrest/certs/ca.crt --pg2-host-ca-file=/app/pg/pgbackrest/certs/ca.crt --pg3-host-ca-file=/app/pg/pgbackrest/certs/ca.crt --pg1-host-cert-file=/app/pg/pgbackrest/certs/backup-srv.crt --pg2-host-cert-file=/app/pg/pgbackrest/certs/backup-srv.crt --pg3-host-cert-file=/app/pg/pgbackrest/certs/backup-srv.crt --pg1-host-key-file=/app/pg/pgbackrest/certs/backup-srv.key --pg2-host-key-file=/app/pg/pgbackrest/certs/backup-srv.key --pg3-host-key-file=/app/pg/pgbackrest/certs/backup-srv.key --pg1-host-port=6581 --pg2-host-port=6581 --pg3-host-port=6581 --pg1-host-type=tls --pg2-host-type=tls --pg3-host-type=tls --pg1-path=/data/pg_6543/data --pg2-path=/data/pg_6543/data --pg3-path=/data/pg_6543/data --pg1-port=6543 --pg2-port=6543 --pg3-port=6543 --pg1-user=ctgbackup --pg2-user=ctgbackup --pg3-user=ctgbackup --process-max=2 --repo1-path=/data/pg/backup --repo1-retention-full=9999 --repo1-retention-full-type=time --resume --stanza=061cc4c37ca642be8325161a19524e1d_1 --start-fast --type=incr
    2024-11-09 09:18:18.861 P00   INFO: last backup label = 20241107-010025F_20241107-204458I, version = 2.47
    2024-11-09 09:18:18.861 P00   INFO: execute non-exclusive backup start: backup begins after the requested immediate checkpoint completes
    2024-11-09 09:18:19.471 P00   INFO: backup start archive = 000000020000000F0000005B, lsn = F/5B000028
    2024-11-09 09:18:19.471 P00   INFO: wait for replay on the standby to reach F/5B000028
    2024-11-09 09:18:19.777 P00   INFO: replay on the standby reached F/5B000028
    2024-11-09 09:18:19.778 P00   INFO: check archive for prior segment 000000020000000F0000005A
    2024-11-09 09:18:20.201 P00   WARN: a timeline switch has occurred since the 20241107-010025F_20241107-204458I backup, enabling delta checksum
                                    HINT: this is normal after restoring from backup or promoting a standby.
    2024-11-09 09:20:01.767 P00   INFO: execute non-exclusive backup stop and wait for all WAL segments to archive
    2024-11-09 09:20:01.970 P00   INFO: backup stop archive = 000000020000000F0000005B, lsn = F/5B004230
    2024-11-09 09:20:01.975 P00   INFO: check archive for segment(s) 000000020000000F0000005B:000000020000000F0000005B
    2024-11-09 09:20:02.140 P00   INFO: new backup label = 20241107-010025F_20241109-091818I
    2024-11-09 09:20:02.386 P00   INFO: incr backup size = 166.7MB, file total = 7089
    2024-11-09 09:20:02.391 P00   INFO: backup command end: completed successfully (105583ms)

    What I want to know is, how does pgbackrest choose which backup library to back up? Can I specify a backup node with normal status and normal stream replication for backup to avoid backup failure caused by abnormal stream replication of the backup node?Thank you very much for your answer.

dwsteele commented 2 days ago

From the error result, it can be seen that it detected that the wal log was not archived, but in fact, the archiving of the master node has always been normal.

No, it is detecting that the standby selected for the backup is not receiving WAL data in a timely fashion via streaming replication.

What I want to know is, how does pgbackrest choose which backup library to back up? Can I specify a backup node with normal status and normal stream replication for backup to avoid backup failure caused by abnormal stream replication of the backup node?

You can select which standby is used for a backup by configuring only that standby. If multiple standbys nodes are configured then the first one that accepts a connection will be selected.

We do have plans to be smarter about which standby node we select, but for the most part the current method works well as long as your standbys are up to date. If they are not you may want to prioritize fixing a degraded system rather than applying the additional load of a backup.