hapostgres / pg_auto_failover

Postgres extension and service for automated failover and high-availability
Other
1.07k stars 113 forks source link

Primary stuck in wait_primary #924

Closed olcrazypete closed 1 year ago

olcrazypete commented 1 year ago

Migrating HA to a new pg_auto_failover setup in an established cluster after a postgresql14 upgrade.
Setting up with password auth.

Have worked thru issues with pg_hba and setting up replication login user.

Primary node configured and reporting to monitor fine as single node. No issues.

Running node install on replica. pg_basebackup runs and pulls the db copy fine. Start the pgautofailover service which then kicks in the DB post-basebackup. This seems to connect to the primary and then begins repeatedly triggering a reload of the primary configuration. Only setting I see changing at start is setting 'ALTER SYSTEM SET synchronous_standby_names TO 'ANY 1' as expected.

Monitor node shows the replica go into read-only secondary state. However primary node never switches from wait_primary to primary. This of course ends up locking the database.
I've changed the synchrounous commits to local only on the master to stop this issue but this still leaves the replica as unavailable to failover to. The timeline and LSN numbers are reporting and match so I can tell replication is working.

[root@vms05 log]# pg_autoctl --version pg_autoctl version 1.6.4 pg_autoctl extension version 1.6 compiled with PostgreSQL 14.2 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44), 64-bit compatible with Postgres 10, 11, 12, 13, and 14

Primary node - repeated log line.... 2022-08-19 07:26:31.503 PDT [31515] LOG: connection received: host=[local] 2022-08-19 07:26:31.504 PDT [31515] LOG: connection authorized: user=ha-admin database=appdb 2022-08-19 07:26:31.511 PDT [31515] LOG: disconnection: session time: 0:00:00.007 user=ha-admin database=appdb host=[local] 2022-08-19 07:26:31.560 PDT [31516] LOG: connection received: host=[local] 2022-08-19 07:26:31.561 PDT [31516] LOG: connection authorized: user=ha-admin database=appdb 2022-08-19 07:26:31.566 PDT [31516] LOG: disconnection: session time: 0:00:00.006 user=ha-admin database=appdb host=[local] 2022-08-19 07:26:31.583 PDT [31517] LOG: connection received: host=[local] 2022-08-19 07:26:31.583 PDT [31517] LOG: connection authorized: user=ha-admin database=appdb 2022-08-19 07:26:31.585 PDT [31517] LOG: statement: ALTER SYSTEM SET synchronous_standby_names TO 'ANY 1 (pgautofailover_standby_9)'

pg_autoctl output to journald Aug 19 11:06:55 dev6.3sisac.local pg_autoctl[12133]: 11:06:55 12137 ERROR Failed to transition to state "primary", retrying... Aug 19 11:06:56 dev6.3sisac.local pg_autoctl[12133]: 11:06:56 12137 INFO Monitor assigned new state "primary" Aug 19 11:06:56 dev6.3sisac.local pg_autoctl[12133]: 11:06:56 12137 INFO FSM transition from "wait_primary" to "primary": A healthy secondary appeared Aug 19 11:06:56 dev6.3sisac.local pg_autoctl[12133]: 11:06:56 12137 INFO Setting synchronous_standby_names to 'ANY 1 (pgautofailover_standby_9)' Aug 19 11:06:56 dev6.3sisac.local pg_autoctl[12133]: 11:06:56 12137 INFO Reloading Postgres configuration and HBA rules Aug 19 11:06:56 dev6.3sisac.local pg_autoctl[12133]: 11:06:56 12137 WARN Failed to set the standby Target LSN because we don't have a quorum candidate yet Aug 19 11:06:56 dev6.3sisac.local pg_autoctl[12133]: 11:06:56 12137 ERROR Failed to transition from state "wait_primary" to state "primary", see above.

Replica node install... -bash-4.2$ pg_autoctl create postgres --pgdata /var/lib/pgsql/14/data/ --auth scram-sha-256 --no-ssl --username ha-admin --dbname appdb --hostname dev5.xxx.local --pgctl /usr/pgsql-14/bin/pg_ctl --monitor postgres://autoctl_node:xxx@vms05.xxx.local/pg_auto_failover 07:03:59 8498 WARN No encryption is used for network traffic! This allows an attacker on the network to read all replication data. 07:03:59 8498 WARN Using --ssl-self-signed instead of --no-ssl is recommend to achieve more security with the same ease of deployment. 07:03:59 8498 WARN See https://www.postgresql.org/docs/current/libpq-ssl.html for details on how to improve 07:03:59 8498 INFO Using default --ssl-mode "prefer" 07:03:59 8498 INFO Started pg_autoctl postgres service with pid 8501 07:03:59 8498 INFO Started pg_autoctl node-init service with pid 8502 07:03:59 8501 INFO /usr/pgsql-14/bin/pg_autoctl do service postgres --pgdata /var/lib/pgsql/14/data/ -v 07:03:59 8502 INFO This node had been dropped previously, now trying to register it again 07:03:59 8502 INFO Registering Postgres system 7098424926999959600 found at "/var/lib/pgsql/14/data" 07:03:59 8502 INFO Registered node 9 "node_9" (dev5.xxx.local:5432) in formation "default", group 0, state "wait_standby" 07:03:59 8502 INFO Writing keeper state file at "/var/lib/pgsql/.local/share/pg_autoctl/var/lib/pgsql/14/data/pg_autoctl.state" 07:03:59 8502 INFO Making sure postgres was stopped, when it was previously dropped 07:03:59 8502 INFO Writing keeper init state file at "/var/lib/pgsql/.local/share/pg_autoctl/var/lib/pgsql/14/data/pg_autoctl.init" 07:03:59 8502 INFO Successfully registered as "wait_standby" to the monitor. 07:03:59 8502 INFO FSM transition from "dropped" to "wait_standby": Start following a primary 07:03:59 8502 INFO Transition complete: current state is now "wait_standby" 07:03:59 8502 INFO New state for node 5 "node_5" (dev6.3sisac.local:5432): single ➜ wait_primary 07:03:59 8502 INFO New state for node 5 "node_5" (dev6.3sisac.local:5432): wait_primary ➜ wait_primary 07:03:59 8502 INFO FSM transition from "wait_standby" to "catchingup": The primary is now ready to accept a standby 07:03:59 8502 INFO Initialising PostgreSQL as a hot standby 07:03:59 8502 INFO Skipping base backup to use pre-existing PGDATA at "/var/lib/pgsql/14/data" 07:03:59 8507 INFO /usr/pgsql-14/bin/postgres -D /var/lib/pgsql/14/data -p 5432 -h * 07:03:59 8502 INFO PostgreSQL started on port 5432 07:03:59 8502 INFO Transition complete: current state is now "catchingup" 07:03:59 8501 INFO Postgres is now serving PGDATA "/var/lib/pgsql/14/data" on port 5432 with pid 8507 07:03:59 8502 INFO keeper has been successfully initialized. 07:03:59 8498 WARN pg_autoctl service node-init exited with exit status 0 07:03:59 8501 INFO Postgres controller service received signal SIGTERM, terminating 07:03:59 8501 INFO Stopping pg_autoctl postgres service 07:03:59 8501 INFO /usr/pgsql-14/bin/pg_ctl --pgdata /var/lib/pgsql/14/data --wait stop --mode fast 07:04:00 8498 INFO Stop pg_autoctl

Monitor node... Aug 19 07:03:59 vms05.xxx.local pg_autoctl[13559]: 07:03:59 13563 INFO Registering node 9 "node_9" (dev5.xxx.local:5432) to formation "default" with replication quorum true and candidate priority 50 [50] Aug 19 07:03:59 vms05.xxx.local pg_autoctl[13559]: 07:03:59 13563 INFO New state is reported by node 9 "node_9" (dev5.xxx.local:5432): "wait_standby" Aug 19 07:03:59 vms05.xxx.local pg_autoctl[13559]: 07:03:59 13563 INFO New state for node 9 "node_9" (dev5.xxx.local:5432): wait_standby ➜ wait_standby Aug 19 07:03:59 vms05.xxx.local pg_autoctl[13559]: 07:03:59 13563 INFO Setting goal state of node 5 "node_5" (dev6.xxx.local:5432) to wait_primary after node 9 "node_9" (dev5.xxx.local:5432) joined. Aug 19 07:03:59 vms05.xxx.local pg_autoctl[13559]: 07:03:59 13563 INFO New state for node 5 "node_5" (dev6.xxx.local:5432): single ➜ wait_primary Aug 19 07:03:59 vms05.xxx.local pg_autoctl[13559]: 07:03:59 13563 INFO New state is reported by node 5 "node_5" (dev6.xxx.local:5432): "wait_primary" Aug 19 07:03:59 vms05.xxx.local pg_autoctl[13559]: 07:03:59 13563 INFO New state for node 5 "node_5" (dev6.xxx.local:5432): wait_primary ➜ wait_primary Aug 19 07:03:59 vms05.xxx.local pg_autoctl[13559]: 07:03:59 13563 INFO Setting goal state of node 9 "node_9" (dev5.xxx.local:5432) to catchingup after node 5 "node_5" (dev6.xxx.local:5432) converged to wait_primary. Aug 19 07:03:59 vms05.xxx.local pg_autoctl[13559]: 07:03:59 13563 INFO New state for node 9 "node_9" (dev5.xxx.local:5432): wait_standby ➜ catchingup Aug 19 07:03:59 vms05.xxx.local pg_autoctl[13559]: 07:03:59 13563 INFO New state is reported by node 9 "node_9" (dev5.xxx.local:5432): "catchingup" Aug 19 07:03:59 vms05.xxx.local pg_autoctl[13559]: 07:03:59 13563 INFO New state for node 9 "node_9" (dev5.xxx.local:5432): catchingup ➜ catchingup Aug 19 07:04:09 vms05.xxx.local pg_autoctl[13559]: 07:04:09 13563 INFO Node node 9 "node_9" (dev5.xxx.local:5432) is marked as unhealthy by the monitor Aug 19 07:04:09 vms05.xxx.local pg_autoctl[13559]: 07:04:09 13563 INFO New state for node 9 "node_9" (dev5.xxx.local:5432): catchingup ➜ catchingup Aug 19 07:04:13 vms05.xxx.local pg_autoctl[13559]: 07:04:13 13563 INFO Node node 9 "node_9" (dev5.xxx.local:5432) is marked as healthy by the monitor Aug 19 07:04:13 vms05.xxx.local pg_autoctl[13559]: 07:04:13 13563 INFO New state for node 9 "node_9" (dev5.xxx.local:5432): catchingup ➜ catchingup Aug 19 07:04:14 vms05.xxx.local pg_autoctl[13559]: 07:04:14 13563 INFO Setting goal state of node 9 "node_9" (dev5.xxx.local:5432) to secondary after it caught up. Aug 19 07:04:14 vms05.xxx.local pg_autoctl[13559]: 07:04:14 13563 INFO New state for node 9 "node_9" (dev5.xxx.local:5432): catchingup ➜ secondary Aug 19 07:04:14 vms05.xxx.local pg_autoctl[13559]: 07:04:14 13563 INFO New state is reported by node 9 "node_9" (dev5.xxx.local:5432): "secondary" Aug 19 07:04:14 vms05.xxx.local pg_autoctl[13559]: 07:04:14 13563 INFO New state for node 9 "node_9" (dev5.xxx.local:5432): secondary ➜ secondary Aug 19 07:04:14 vms05.xxx.local pg_autoctl[13559]: 07:04:14 13563 INFO Setting goal state of node 5 "node_5" (dev6.xxx.local:5432) to primary now that we have 1 healthy secondary nodes in the quorum. Aug 19 07:04:14 vms05.xxx.local pg_autoctl[13559]: 07:04:14 13563 INFO New state for node 5 "node_5" (dev6.xxx.local:5432): wait_primary ➜ primary

Formation: default - Sync Standbys: 0 07:33:54

Name Node Quorum Priority TLI: LSN Check Connection Report Reported State Assigned State node_5 5 yes 50 3: 4B/99A3690 3s read-write 1s wait_primary primary node_9 9 yes 50 3: 4B/99A1F68 3s read-only 1s secondary secondary

Id Event Time Name Description 130 2022-08-19 07:04:14 node_5 Setting goal state of node 5 "node_5" (dev6.xxx.local:5432) to primary now that we 129 2022-08-19 07:04:14 node_9 New state is reported by node 9 "node_9" (dev5.xxx.local:5432): "secondary" 128 2022-08-19 07:04:14 node_9 Setting goal state of node 9 "node_9" (dev5.xxx.local:5432) to secondary after it c 127 2022-08-19 07:04:13 node_9 Node node 9 "node_9" (dev5.xxx.local:5432) is marked as healthy by the monitor 126 2022-08-19 07:04:09 node_9 Node node 9 "node_9" (dev5.xxx.local:5432) is marked as unhealthy by the monitor 125 2022-08-19 07:03:59 node_9 New state is reported by node 9 "node_9" (dev5.xxx.local:5432): "catchingup" 124 2022-08-19 07:03:59 node_9 Setting goal state of node 9 "node_9" (dev5.xxx.local:5432) to catchingup after nod 123 2022-08-19 07:03:59 node_5 New state is reported by node 5 "node_5" (dev6.xxx.local:5432): "wait_primary" 122 2022-08-19 07:03:59 node_5 Setting goal state of node 5 "node_5" (dev6.xxx.local:5432) to wait_primary after n 121 2022-08-19 07:03:59 node_9 New state is reported by node 9 "node_9" (dev5.xxx.local:5432): "wait_standby"

olcrazypete commented 1 year ago

I ended up manually running a basebackup to the secondary node and promoting it, removed the former primary node, added the new secondary node as a single primary, then re-adding the former primary node. Whatever was stuck fixed it self then and the servers went into primary and secondary mode with failovers happening on demand without issues.