ossc-db / pg_rman

Backup and restore management tool for PostgreSQL
http://ossc-db.github.io/pg_rman/index.html
Other
476 stars 77 forks source link

ERROR: switched WAL could not be archived in 10 seconds - postmaster.pid copied 100 #267

Closed TheOriginalGraLargeShrimpakaReaper closed 5 months ago

TheOriginalGraLargeShrimpakaReaper commented 5 months ago

The Architecture / Idea

We use [https://github.com/vitabaks/postgresql_cluster](vitabaks / postgresql_cluster) Repository for the deploy of a Patroni Cluster with etcd and HAproxy on Debian 12 Machines. We deploy an Oracle Linux 9 Machine as pg_rman Host. pg_rman runs basically.

Our Goal is, to use a Remote PostgreSQL RMAN Catalog Database with pg_rman: pg_rman_architecture

We use PostgreSQL 16 on both sides.

So we can execude a Veeam Script on pg_rman on the Virtual IP and the read-only Ports, so we can use the Replica-Nodes for the Backups.

The Issue

We can Initialize the Backups on the pg_rman Hosts. But if we start the Full Backup, the Job works until postmaster.pid copied 100 is arrived. Then pg_rman throws the ERROR: switched WAL could not be archived in 10 seconds, no matter if Primary oder Standby-Host. Here are the Commands, the Results are as Files uploaded. read: /usr/pgsql-16/bin/pg_rman backup --backup-mode=full --with-serverlog --progress --host=<vip> --port=5000 -B /tmp/rman_catalog --pgdata=/var/lib/pgsql/16/data/ --username=<username> --dbname=<dbname>--arclog-path=/var/lib/postgresql/16/main/pg_wal/ --verbose --smooth-checkpoint --debug standby: /usr/pgsql-16/bin/pg_rman backup --backup-mode=full --progress --standby-host=<vip> --standby-port=5001 --pgdata=/var/lib/pgsql/16/data --backup-path=/tmp/rman_catalog_5 --arclog-path=/var/lib/postgresql/16/main/pg_wal --username=<username> --verbose --smooth-checkpoint --password --debug

pg_rman_standby.txt pg_rman_primary.txt

Questions

  1. Can this Architecture run at all?
  2. What is the Problem? Is there a possible way to increase the Timeout? On the Patroni Cluster, the archive_timeout is set to 1800s
TheOriginalGraLargeShrimpakaReaper commented 5 months ago

Here is the Debug Output: DEBUG: executing pg_backup_stop() DEBUG: (query) SET client_min_messages = warning; DEBUG: (query) SELECT * FROM pg_backup_stop($1) DEBUG: (param:0) = true DEBUG: (query) SELECT * FROM pg_walfile_name_offset($1) DEBUG: (param:0) = 0/E000100 DEBUG: backup end point is (WAL file: 00000001000000000000000E, xrecoff: 256) DEBUG: waiting for 00000001000000000000000E is archived DEBUG: (query) SELECT txid_current(); DEBUG: current XID is 747 ERROR: switched WAL could not be archived in 10 seconds DEBUG: update backup status from RUNNING to ERROR

TheOriginalGraLargeShrimpakaReaper commented 5 months ago

Problem

We don't used nfs share from the Remote DBs to the RMAN Catalog Server

Solution

nfs Shares for the $PGDATA, Archivelog Directory and Log Directory.

Further Problems, but topic for another Issue