Tweaking repmgr failover settings making the cluster unstable

Attached are logs for the master pod after being restarted by statefulset controller:

Setting up STOP handlers... STARTING SSH (if required)... cp: cannot stat '/home/postgres/.ssh/keys/*': No such file or directory No pre-populated ssh keys! SSH is not enabled! STARTING POSTGRES... SETTING UP POLYMORPHIC VARIABLES (repmgr=3+postgres=9 | repmgr=4, postgres=10)... TUNING UP POSTGRES... Configuring /var/lib/postgresql/data/postgresql.conf

Will add configs to the exists file Adding config 'wal_keep_segments'='250' Adding config 'shared_buffers'='300MB' Adding config 'archive_command'=''/bin/true'' Adding config 'shared_preload_libraries'=''repmgr_funcs'' 2019-03-07 11:03:02.689191+00 Ex-Master restarted as slave to sleep for 15 seconds to enable cluster to elect new master Check all partner nodes for common upstream node... Checking NODE=mysystem-db-node-0.mysystem-db... psql: could not connect to server: Connection refused Is the server running on host "mysystem-db-node-0.mysystem-db" (10.32.0.7) and accepting TCP/IP connections on port 5432? Skipping: failed to get master from the node! Checking NODE=mysystem-db-node-1.mysystem-db...

Pretending master role node - mysystem-db-node-1.mysystem-db Checking NODE=mysystem-db-node-2.mysystem-db... Pretending master role node - mysystem-db-node-1.mysystem-db Auto-detected master name: 'mysystem-db-node-1.mysystem-db' The node was acting as a master before restart! Current master is mysystem-db-node-1.mysystem-db. Will clone/rewind it and act as a standby node... Setting up repmgr... Setting up repmgr config file '/etc/repmgr.conf'... Setting up upstream node... cat: /var/lib/postgresql/data/standby.lock: No such file or directory Previously Locked standby upstream node LOCKED_STANDBY='' Waiting for upstream postgres server... Wait schema replica_db.repmgr_mysystem_cluster on mysystem-db-node-1.mysystem-db:5432(user: replica_user,password: ), will try 30 times with delay 10 seconds (TIMEOUT=300) Schema replica_db.repmgr_mysystem_cluster exists on host mysystem-db-node-1.mysystem-db:5432! Waiting for upstream node id replica_db on mysystem-db-node-1.mysystem-db:5432(user: replica_user,password: ), will try 30 times with delay 10 seconds (TIMEOUT=300) Upstream node id 1001 received from host mysystem-db-node-1.mysystem-db:5432! REPLICATION_UPSTREAM_NODE_ID=1001 Sending in background postgres start... Waiting for upstream postgres server... Wait schema replica_db.repmgr_mysystem_cluster on mysystem-db-node-1.mysystem-db:5432(user: replica_user,password: ***), will try 30 times with delay 10 seconds (TIMEOUT=300) Schema replica_db.repmgr_mysystem_cluster exists on host mysystem-db-node-1.mysystem-db:5432! Data folder is not empty /var/lib/postgresql/data: total 128 drwx------ 19 postgres root 4096 Mar 7 11:08 . drwxr-xr-x 1 postgres postgres 4096 Feb 15 01:24 .. drwx------ 7 postgres postgres 4096 Mar 7 11:02 base drwx------ 2 postgres postgres 4096 Mar 7 11:03 global drwx------ 2 postgres postgres 4096 Mar 7 11:02 pg_commit_ts drwx------ 2 postgres postgres 4096 Mar 7 11:02 pg_dynshmem -rwx------ 1 postgres postgres 4579 Mar 7 11:02 pg_hba.conf -rwx------ 1 postgres postgres 1636 Mar 7 11:02 pg_ident.conf drwx------ 4 postgres postgres 4096 Mar 7 11:03 pg_logical drwx------ 4 postgres postgres 4096 Mar 7 11:02 pg_multixact drwx------ 2 postgres postgres 4096 Mar 7 11:02 pg_notify drwx------ 4 postgres postgres 4096 Mar 7 11:03 pg_replslot drwx------ 2 postgres postgres 4096 Mar 7 11:02 pg_serial drwx------ 2 postgres postgres 4096 Mar 7 11:02 pg_snapshots drwx------ 2 postgres postgres 4096 Mar 7 11:02 pg_stat drwx------ 2 postgres postgres 4096 Mar 7 11:04 pg_stat_tmp drwx------ 2 postgres postgres 4096 Mar 7 11:02 pg_subtrans drwx------ 2 postgres postgres 4096 Mar 7 11:02 pg_tblspc drwx------ 2 postgres postgres 4096 Mar 7 11:02 pg_twophase -rwx------ 1 postgres postgres 3 Mar 7 11:02 PG_VERSION drwx------ 3 postgres postgres 4096 Mar 7 11:03 pg_wal drwx------ 2 postgres postgres 4096 Mar 7 11:02 pg_xact -rwx------ 1 postgres postgres 88 Mar 7 11:02 postgresql.auto.conf -rwx------ 1 postgres postgres 21636 Mar 7 11:08 postgresql.conf -rwx------ 1 postgres postgres 36 Mar 7 11:02 postmaster.opts -rwx------ 1 postgres postgres 95 Mar 7 11:04 postmaster.pid Starting standby node... Instance has been set up already. Rewinding to the latest state Archiving configs ERROR: connection to database failed: could not connect to server: Connection refused Is the server running on host "mysystem-db-node-0.mysystem-db" (10.32.0.7) and accepting TCP/IP connections on port 5432?

Start server to be able to rewind (weird hack to avoid dirty shutdown issue) waiting for server to start....2019-03-07 11:08:23.233 UTC [166] LOG: listening on IPv4 address "127.0.0.1", port 5432 2019-03-07 11:08:23.233 UTC [166] LOG: could not bind IPv6 address "::1": Cannot assign requested address 2019-03-07 11:08:23.233 UTC [166] HINT: Is another postmaster already running on port 5432? If not, wait a few seconds and retry. 2019-03-07 11:08:23.253 UTC [166] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432" 2019-03-07 11:08:23.291 UTC [167] LOG: database system was interrupted; last known up at 2019-03-07 11:03:08 UTC 2019-03-07 11:08:23.313 UTC [167] LOG: database system was not properly shut down; automatic recovery in progress 2019-03-07 11:08:23.324 UTC [167] LOG: redo starts at 0/4000028 2019-03-07 11:08:23.325 UTC [167] LOG: invalid record length at 0/5005010: wanted 24, got 0 2019-03-07 11:08:23.325 UTC [167] LOG: redo done at 0/5004FD8 2019-03-07 11:08:23.325 UTC [167] LOG: last completed transaction was at log time 2019-03-07 11:03:39.55614+00 2019-03-07 11:08:23.395 UTC [166] LOG: database system is ready to accept connections done server started Removing unactive replication slots of partners Getting slots from master node repmgr_slot_1002 Closing slots which exist on master node Closing slot repmgr_slot_1002

Closing slot which current master used on the node

Closing slot repmgr_slot_1001

Stop server 2019-03-07 11:08:23.618 UTC [166] LOG: received fast shutdown request waiting for server to shut down....2019-03-07 11:08:23.625 UTC [166] LOG: aborting any active transactions 2019-03-07 11:08:23.627 UTC [166] LOG: worker process: logical replication launcher (PID 173) exited with exit code 1 2019-03-07 11:08:23.627 UTC [168] LOG: shutting down 2019-03-07 11:08:23.684 UTC [166] LOG: database system is shut down done server stopped Creating replication slot for the standby in case it's not there cat: /var/lib/postgresql/data/recovery.conf: No such file or directory Rewind to the host mysystem-db-node-1.mysystem-db servers diverged at WAL location 0/5005010 on timeline 1 rewinding from last common checkpoint at 0/4000060 on timeline 1 Done! Restoring configs NOTICE: 0 files copied to /var/lib/postgresql/data Tell repmgr to follow upstream for the node NOTICE: looking for configuration file in current directory NOTICE: looking for configuration file in /etc NOTICE: configuration file found at: /etc/repmgr.conf DEBUG: slot name initialised as: repmgr_slot_1000 DEBUG: do_standby_follow() ERROR: connection to database failed: fe_sendauth: no password supplied

Stop server pg_ctl: PID file "/var/lib/postgresql/data/postmaster.pid" does not exist Is server running? Configuring /var/lib/postgresql/data/postgresql.conf Will add configs to the exists file Adding config 'wal_keep_segments'='250' Adding config 'shared_buffers'='300MB' Adding config 'archive_command'=''/bin/true'' Adding config 'shared_preload_libraries'=''repmgr_funcs'' Starting postgres... Waiting for local postgres server recovery if any in progress:LAUNCH_RECOVERY_CHECK_INTERVAL=30 Recovery is in progress: 2019-03-07 11:08:25.972 UTC [264] LOG: listening on IPv4 address "0.0.0.0", port 5432 2019-03-07 11:08:25.972 UTC [264] LOG: listening on IPv6 address "::", port 5432 2019-03-07 11:08:25.996 UTC [264] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432" 2019-03-07 11:08:26.035 UTC [273] LOG: database system was interrupted while in recovery at log time 2019-03-07 11:04:49 UTC 2019-03-07 11:08:26.035 UTC [273] HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target. 2019-03-07 11:08:26.069 UTC [273] LOG: redo starts at 0/4000028 2019-03-07 11:08:26.088 UTC [273] LOG: consistent recovery state reached at 0/5022770 2019-03-07 11:08:26.088 UTC [273] LOG: invalid record length at 0/5022770: wanted 24, got 0 2019-03-07 11:08:26.088 UTC [273] LOG: redo done at 0/5022738 2019-03-07 11:08:26.088 UTC [273] LOG: last completed transaction was at log time 2019-03-07 11:08:23.794413+00 2019-03-07 11:08:26.158 UTC [264] LOG: database system is ready to accept connections RECOVERY_WAL_ID is empty! Not in recovery state (anymore) Waiting for local postgres server start... Wait schema replica_db.public on mysystem-db-node-0.mysystem-db:5432(user: replica_user,password: ***), will try 60 times with delay 10 seconds (TIMEOUT=600) Schema replica_db.public exists on host mysystem-db-node-0.mysystem-db:5432! Unregister the node if it was done before DELETE 1 Registering node with role standby INFO: connecting to standby database ERROR: this node should be a standby (user=replica_user password=replica_pass host=mysystem-db-node-0.mysystem-db dbname=replica_db port=5432 connect_timeout=2) Starting repmgr daemon... [2019-03-07 11:08:56] [NOTICE] looking for configuration file in current directory [2019-03-07 11:08:56] [NOTICE] looking for configuration file in /etc [2019-03-07 11:08:56] [NOTICE] configuration file found at: /etc/repmgr.conf [2019-03-07 11:08:56] [INFO] connecting to database 'user=replica_user password=replica_pass host=mysystem-db-node-0.mysystem-db dbname=replica_db port=5432 connect_timeout=2' [2019-03-07 11:08:56] [INFO] connected to database, checking its state [2019-03-07 11:08:56] [ERROR] This node is marked as inactive and cannot be used for failover [2019-03-07 11:08:56] [HINT] Check that 'repmgr (master|standby) register' was executed for this node [2019-03-07 11:08:56] [INFO] repmgrd terminating...

We resolved the issue by updating the postgres/entrypoint.sh to first sleep to allow cluster to elect a new master, and then delete contents of PGDATA to erase its historical data so as to allow it to come back as standby cat $MASTER_ROLE_LOCK_FILE_NAME if [ -f "$MASTER_ROLE_LOCK_FILE_NAME" ]; then SLEEP_TIME= 15 echo ">>> Ex-Master restarted as slave to sleep for $SLEEP_TIME seconds to enable cluster to elect new master" sleep $SLEEP_TIME echo ">>> Deleting contents of PGDATA" rm -rf $PGDATA/* fi

paunin / PostDock

Tweaking repmgr failover settings making the cluster unstable #217