paunin / PostDock

PostDock - Postgres & Docker - Postgres streaming replication cluster for any docker environment
MIT License
1.11k stars 339 forks source link

Tweaking repmgr failover settings making the cluster unstable #217

Open asawaribhagat opened 5 years ago

asawaribhagat commented 5 years ago

Hello!

While testing failover management provided by PostDock [PostgreSQL 10 repmgr 3.3] using k8s/example2-single-statefulset, I wanted to reduce the time taken to promote replica to master from 15 seconds to 1 second. For this I reduced updated following parameters for repmgrd: reconnect_attempts=1 reconnect_interval=1 Then we killed master using "kubectl delete pod"

I have observed that the master comes back again and joins the cluster as "master" causing the cluster to become unstable. This happens only if the above two configurations for repmgr are tweaked, and works fine with the default configuration. Are there any other dependent parameters that also need to be updated?

asawaribhagat commented 5 years ago

Attached are logs for the master pod after being restarted by statefulset controller:

Setting up STOP handlers... STARTING SSH (if required)... cp: cannot stat '/home/postgres/.ssh/keys/*': No such file or directory No pre-populated ssh keys! SSH is not enabled! STARTING POSTGRES... SETTING UP POLYMORPHIC VARIABLES (repmgr=3+postgres=9 | repmgr=4, postgres=10)... TUNING UP POSTGRES... Configuring /var/lib/postgresql/data/postgresql.conf

Will add configs to the exists file Adding config 'wal_keep_segments'='250' Adding config 'shared_buffers'='300MB' Adding config 'archive_command'=''/bin/true'' Adding config 'shared_preload_libraries'=''repmgr_funcs'' 2019-03-07 11:03:02.689191+00 Ex-Master restarted as slave to sleep for 15 seconds to enable cluster to elect new master Check all partner nodes for common upstream node... Checking NODE=mysystem-db-node-0.mysystem-db... psql: could not connect to server: Connection refused Is the server running on host "mysystem-db-node-0.mysystem-db" (10.32.0.7) and accepting TCP/IP connections on port 5432? Skipping: failed to get master from the node! Checking NODE=mysystem-db-node-1.mysystem-db...

Pretending master role node - mysystem-db-node-1.mysystem-db Checking NODE=mysystem-db-node-2.mysystem-db... Pretending master role node - mysystem-db-node-1.mysystem-db Auto-detected master name: 'mysystem-db-node-1.mysystem-db' The node was acting as a master before restart! Current master is mysystem-db-node-1.mysystem-db. Will clone/rewind it and act as a standby node... Setting up repmgr... Setting up repmgr config file '/etc/repmgr.conf'... Setting up upstream node... cat: /var/lib/postgresql/data/standby.lock: No such file or directory Previously Locked standby upstream node LOCKED_STANDBY='' Waiting for upstream postgres server... Wait schema replica_db.repmgr_mysystem_cluster on mysystem-db-node-1.mysystem-db:5432(user: replica_user,password: ), will try 30 times with delay 10 seconds (TIMEOUT=300) Schema replica_db.repmgr_mysystem_cluster exists on host mysystem-db-node-1.mysystem-db:5432! Waiting for upstream node id replica_db on mysystem-db-node-1.mysystem-db:5432(user: replica_user,password: ), will try 30 times with delay 10 seconds (TIMEOUT=300) Upstream node id 1001 received from host mysystem-db-node-1.mysystem-db:5432! REPLICATION_UPSTREAM_NODE_ID=1001 Sending in background postgres start... Waiting for upstream postgres server... Wait schema replica_db.repmgr_mysystem_cluster on mysystem-db-node-1.mysystem-db:5432(user: replica_user,password: ***), will try 30 times with delay 10 seconds (TIMEOUT=300) Schema replica_db.repmgr_mysystem_cluster exists on host mysystem-db-node-1.mysystem-db:5432! Data folder is not empty /var/lib/postgresql/data: total 128 drwx------ 19 postgres root 4096 Mar 7 11:08 . drwxr-xr-x 1 postgres postgres 4096 Feb 15 01:24 .. drwx------ 7 postgres postgres 4096 Mar 7 11:02 base drwx------ 2 postgres postgres 4096 Mar 7 11:03 global drwx------ 2 postgres postgres 4096 Mar 7 11:02 pg_commit_ts drwx------ 2 postgres postgres 4096 Mar 7 11:02 pg_dynshmem -rwx------ 1 postgres postgres 4579 Mar 7 11:02 pg_hba.conf -rwx------ 1 postgres postgres 1636 Mar 7 11:02 pg_ident.conf drwx------ 4 postgres postgres 4096 Mar 7 11:03 pg_logical drwx------ 4 postgres postgres 4096 Mar 7 11:02 pg_multixact drwx------ 2 postgres postgres 4096 Mar 7 11:02 pg_notify drwx------ 4 postgres postgres 4096 Mar 7 11:03 pg_replslot drwx------ 2 postgres postgres 4096 Mar 7 11:02 pg_serial drwx------ 2 postgres postgres 4096 Mar 7 11:02 pg_snapshots drwx------ 2 postgres postgres 4096 Mar 7 11:02 pg_stat drwx------ 2 postgres postgres 4096 Mar 7 11:04 pg_stat_tmp drwx------ 2 postgres postgres 4096 Mar 7 11:02 pg_subtrans drwx------ 2 postgres postgres 4096 Mar 7 11:02 pg_tblspc drwx------ 2 postgres postgres 4096 Mar 7 11:02 pg_twophase -rwx------ 1 postgres postgres 3 Mar 7 11:02 PG_VERSION drwx------ 3 postgres postgres 4096 Mar 7 11:03 pg_wal drwx------ 2 postgres postgres 4096 Mar 7 11:02 pg_xact -rwx------ 1 postgres postgres 88 Mar 7 11:02 postgresql.auto.conf -rwx------ 1 postgres postgres 21636 Mar 7 11:08 postgresql.conf -rwx------ 1 postgres postgres 36 Mar 7 11:02 postmaster.opts -rwx------ 1 postgres postgres 95 Mar 7 11:04 postmaster.pid Starting standby node... Instance has been set up already. Rewinding to the latest state Archiving configs ERROR: connection to database failed: could not connect to server: Connection refused Is the server running on host "mysystem-db-node-0.mysystem-db" (10.32.0.7) and accepting TCP/IP connections on port 5432?

Start server to be able to rewind (weird hack to avoid dirty shutdown issue) waiting for server to start....2019-03-07 11:08:23.233 UTC [166] LOG: listening on IPv4 address "127.0.0.1", port 5432 2019-03-07 11:08:23.233 UTC [166] LOG: could not bind IPv6 address "::1": Cannot assign requested address 2019-03-07 11:08:23.233 UTC [166] HINT: Is another postmaster already running on port 5432? If not, wait a few seconds and retry. 2019-03-07 11:08:23.253 UTC [166] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432" 2019-03-07 11:08:23.291 UTC [167] LOG: database system was interrupted; last known up at 2019-03-07 11:03:08 UTC 2019-03-07 11:08:23.313 UTC [167] LOG: database system was not properly shut down; automatic recovery in progress 2019-03-07 11:08:23.324 UTC [167] LOG: redo starts at 0/4000028 2019-03-07 11:08:23.325 UTC [167] LOG: invalid record length at 0/5005010: wanted 24, got 0 2019-03-07 11:08:23.325 UTC [167] LOG: redo done at 0/5004FD8 2019-03-07 11:08:23.325 UTC [167] LOG: last completed transaction was at log time 2019-03-07 11:03:39.55614+00 2019-03-07 11:08:23.395 UTC [166] LOG: database system is ready to accept connections done server started Removing unactive replication slots of partners Getting slots from master node repmgr_slot_1002 Closing slots which exist on master node Closing slot repmgr_slot_1002

Closing slot which current master used on the node

Closing slot repmgr_slot_1001

Stop server 2019-03-07 11:08:23.618 UTC [166] LOG: received fast shutdown request waiting for server to shut down....2019-03-07 11:08:23.625 UTC [166] LOG: aborting any active transactions 2019-03-07 11:08:23.627 UTC [166] LOG: worker process: logical replication launcher (PID 173) exited with exit code 1 2019-03-07 11:08:23.627 UTC [168] LOG: shutting down 2019-03-07 11:08:23.684 UTC [166] LOG: database system is shut down done server stopped Creating replication slot for the standby in case it's not there cat: /var/lib/postgresql/data/recovery.conf: No such file or directory Rewind to the host mysystem-db-node-1.mysystem-db servers diverged at WAL location 0/5005010 on timeline 1 rewinding from last common checkpoint at 0/4000060 on timeline 1 Done! Restoring configs NOTICE: 0 files copied to /var/lib/postgresql/data Tell repmgr to follow upstream for the node NOTICE: looking for configuration file in current directory NOTICE: looking for configuration file in /etc NOTICE: configuration file found at: /etc/repmgr.conf DEBUG: slot name initialised as: repmgr_slot_1000 DEBUG: do_standby_follow() ERROR: connection to database failed: fe_sendauth: no password supplied

Stop server pg_ctl: PID file "/var/lib/postgresql/data/postmaster.pid" does not exist Is server running? Configuring /var/lib/postgresql/data/postgresql.conf Will add configs to the exists file Adding config 'wal_keep_segments'='250' Adding config 'shared_buffers'='300MB' Adding config 'archive_command'=''/bin/true'' Adding config 'shared_preload_libraries'=''repmgr_funcs'' Starting postgres... Waiting for local postgres server recovery if any in progress:LAUNCH_RECOVERY_CHECK_INTERVAL=30 Recovery is in progress: 2019-03-07 11:08:25.972 UTC [264] LOG: listening on IPv4 address "0.0.0.0", port 5432 2019-03-07 11:08:25.972 UTC [264] LOG: listening on IPv6 address "::", port 5432 2019-03-07 11:08:25.996 UTC [264] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432" 2019-03-07 11:08:26.035 UTC [273] LOG: database system was interrupted while in recovery at log time 2019-03-07 11:04:49 UTC 2019-03-07 11:08:26.035 UTC [273] HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target. 2019-03-07 11:08:26.069 UTC [273] LOG: redo starts at 0/4000028 2019-03-07 11:08:26.088 UTC [273] LOG: consistent recovery state reached at 0/5022770 2019-03-07 11:08:26.088 UTC [273] LOG: invalid record length at 0/5022770: wanted 24, got 0 2019-03-07 11:08:26.088 UTC [273] LOG: redo done at 0/5022738 2019-03-07 11:08:26.088 UTC [273] LOG: last completed transaction was at log time 2019-03-07 11:08:23.794413+00 2019-03-07 11:08:26.158 UTC [264] LOG: database system is ready to accept connections RECOVERY_WAL_ID is empty! Not in recovery state (anymore) Waiting for local postgres server start... Wait schema replica_db.public on mysystem-db-node-0.mysystem-db:5432(user: replica_user,password: ***), will try 60 times with delay 10 seconds (TIMEOUT=600) Schema replica_db.public exists on host mysystem-db-node-0.mysystem-db:5432! Unregister the node if it was done before DELETE 1 Registering node with role standby INFO: connecting to standby database ERROR: this node should be a standby (user=replica_user password=replica_pass host=mysystem-db-node-0.mysystem-db dbname=replica_db port=5432 connect_timeout=2) Starting repmgr daemon... [2019-03-07 11:08:56] [NOTICE] looking for configuration file in current directory [2019-03-07 11:08:56] [NOTICE] looking for configuration file in /etc [2019-03-07 11:08:56] [NOTICE] configuration file found at: /etc/repmgr.conf [2019-03-07 11:08:56] [INFO] connecting to database 'user=replica_user password=replica_pass host=mysystem-db-node-0.mysystem-db dbname=replica_db port=5432 connect_timeout=2' [2019-03-07 11:08:56] [INFO] connected to database, checking its state [2019-03-07 11:08:56] [ERROR] This node is marked as inactive and cannot be used for failover [2019-03-07 11:08:56] [HINT] Check that 'repmgr (master|standby) register' was executed for this node [2019-03-07 11:08:56] [INFO] repmgrd terminating...

We resolved the issue by updating the postgres/entrypoint.sh to first sleep to allow cluster to elect a new master, and then delete contents of PGDATA to erase its historical data so as to allow it to come back as standby cat $MASTER_ROLE_LOCK_FILE_NAME if [ -f "$MASTER_ROLE_LOCK_FILE_NAME" ]; then SLEEP_TIME= 15 echo ">>> Ex-Master restarted as slave to sleep for $SLEEP_TIME seconds to enable cluster to elect new master" sleep $SLEEP_TIME echo ">>> Deleting contents of PGDATA" rm -rf $PGDATA/* fi