hapostgres / pg_auto_failover

Postgres extension and service for automated failover and high-availability
Other
1.07k stars 112 forks source link

When a secondary node is dropped, the FSM is promoted from secondary to single. #1023

Open dbamu opened 6 months ago

dbamu commented 6 months ago

version

test

### secondary server log
$ pg_autoctl drop node --destroy
17:57:43 66464 INFO  Removing node with name "test2" in formation "test" from the monitor
-17:57:43 66464 INFO  An instance of pg_autoctl is running with PID 11199, waiting for it to stop.
-17:58:13 66464 INFO  Sending signal SIGQUIT to pg_autoctl process 11199
-17:58:13 66464 INFO  An instance of pg_autoctl is running with PID 11199, waiting for it to stop.
17:58:14 66464 INFO  The pg_autoctl instance with pid 11199 has now terminated.
17:58:14 66464 WARN  PG_REGRESS_SOCK_DIR is set to "~/run/postgresql", and our setup is using "pgsql-002"
-17:58:14 66464 FATAL Failed to ensure that the local node with id 5229 in formation "test" and group 0 has been removed from the monitor

#### pg_autoctl log
+17:57:43 11204 INFO  Monitor assigned new state "single"
+17:57:43 11204 INFO  FSM transition from "secondary" to "single": Primary was forcibly removed
-17:57:44 11204 INFO  Promoting postgres
17:57:44 11204 INFO  Waiting for postgres to promote
17:57:45 11204 INFO  Cleaning-up Postgres replication settings
17:57:45 11204 INFO  Disabling synchronous replication
17:57:45 11204 INFO  Reloading Postgres configuration and HBA rules
17:57:45 11204 INFO  Transition complete: current state is now "single"
17:58:13 11199 INFO  pg_autoctl received signal SIGQUIT, terminating
17:58:13 11203 INFO  Postgres controller service received signal SIGQUIT, terminating
17:58:13 11203 INFO  Stopping pg_autoctl postgres service
17:58:13 11203 INFO  /bin/pg_ctl --pgdata /data/db/pgcluster --wait stop --mode fast
17:58:13 11199 INFO  Waiting for subprocesses to terminate.
+17:58:14 11199 INFO  Stop pg_autoctl

pg_autoctl show state
     Name |  Node |              Host:Port |       TLI: LSN |   Connection |      Reported State |      Assigned State
----------+-------+------------------------+----------------+--------------+---------------------+--------------------
test1 |  5228 | pgsql-001:5432 |   1: 0/5002270 |   read-write |              single |              single
test2 |  5229 | pgsql-002:5432 |   2: 0/6001D60 |   read-write |              single |              single

### monitor server log
-- vi pg_autoctl.log
17:57:43 157088 INFO  Setting goal state of node 5229 "test2" (pgsql-002:5432) from formation "test" and group 0 to "dropped" to implement node removal.
17:57:43 157088 INFO  New state for node 5229 "test2" (pgsql-002:5432): secondary ➜ dropped
17:57:43 157088 INFO  Setting number_sync_standbys to 0 for formation "test" now that we have 0 standby nodes set with replication-quorum.
17:57:43 157088 INFO  Setting goal state of node 5228 "test1" (pgsql-001:5432) to single as there is no other node.
17:57:43 157088 INFO  New state for node 5228 "test1" (pgsql-001:5432): primary ➜ single
17:57:43 157088 INFO  Setting goal state of node 5229 "test2" (pgsql-002:5432) to single as there is no other node.
17:57:43 157088 INFO  New state for node 5229 "test2" (pgsql-002:5432): secondary ➜ single
17:57:44 157088 INFO  New state is reported by node 5228 "test1" (pgsql-001:5432): "single"
17:57:44 157088 INFO  New state for node 5228 "test1" (pgsql-001:5432): single ➜ single
17:57:45 157088 INFO  New state is reported by node 5229 "test2" (pgsql-002:5432): "single"
17:57:45 157088 INFO  New state for node 5229 "test2" (pgsql-002:5432): single ➜ single

-- vi postgresql-2023-12-24_000000.log
[2023-12-24 17:57:43 KST]-54992 -autoctl_node@pgsql-002(58480):pg_auto_failover-[unknown] LOG:  Setting goal state of node 5229 "test2" (pgsql-002:5432) from formation "test" and group 0 to "dropped" to implement node removal.
[2023-12-24 17:57:43 KST]-54992 -autoctl_node@pgsql-002(58480):pg_auto_failover-[unknown] STATEMENT:  SELECT nodeid, groupid, pgautofailover.remove_node(nodeid::int, $3)   FROM pgautofailover.node WHERE formationid = $1 and nodename = $2
[2023-12-24 17:57:43 KST]-54992 -autoctl_node@pgsql-002(58480):pg_auto_failover-[unknown] LOG:  Setting number_sync_standbys to 0 for formation "test" now that we have 0 standby nodes set with replication-quorum.
[2023-12-24 17:57:43 KST]-54992 -autoctl_node@pgsql-002(58480):pg_auto_failover-[unknown] STATEMENT:  SELECT nodeid, groupid, pgautofailover.remove_node(nodeid::int, $3)   FROM pgautofailover.node WHERE formationid = $1 and nodename = $2
[2023-12-24 17:57:43 KST]-54992 -autoctl_node@pgsql-002(58480):pg_auto_failover-[unknown] LOG:  Setting goal state of node 5228 "test1" (pgsql-001:5432) to single as there is no other node.
[2023-12-24 17:57:43 KST]-54992 -autoctl_node@pgsql-002-(58480):pg_auto_failover-[unknown] STATEMENT:  SELECT nodeid, groupid, pgautofailover.remove_node(nodeid::int, $3)   FROM pgautofailover.node WHERE formationid = $1 and nodename = $2
[2023-12-24 17:57:43 KST]-54988 -autoctl_node@pgsql-002(58478):pg_auto_failover-[unknown] LOG:  Setting goal state of node 5229 "test2" (pgsql-002:5432) to single as there is no other node.
[2023-12-24 17:57:43 KST]-54988 -autoctl_node@pgsql-002(58478):pg_auto_failover-[unknown] STATEMENT:  SELECT * FROM pgautofailover.node_active($1, $2, $3, $4::pgautofailover.replication_state, $5, $6, $7, $8)
[2023-12-24 17:57:44 KST]-56408 -autoctl_node@pgsql-001(51602):pg_auto_failover-[unknown] LOG:  New state is reported by node 5228 "test1" (pgsql-001:5432): "single"
[2023-12-24 17:57:44 KST]-56408 -autoctl_node@pgsql-001(51602):pg_auto_failover-[unknown] STATEMENT:  SELECT * FROM pgautofailover.node_active($1, $2, $3, $4::pgautofailover.replication_state, $5, $6, $7, $8)
[2023-12-24 17:57:45 KST]-57217 -autoctl_node@pgsql-002(58490):pg_auto_failover-[unknown] LOG:  New state is reported by node 5229 "test2" (pgsql-002:5432): "single"
[2023-12-24 17:57:45 KST]-57217 -autoctl_node@pgsql-002(58490):pg_auto_failover-[unknown] STATEMENT:  SELECT * FROM pgautofailover.node_active($1, $2, $3, $4::pgautofailover.replication_state, $5, $6, $7, $8)

pg_auotctl drop node --destroy --force


#### pg_autoctl drop node --destroy --force (secondary)
#### secondary DB log
[2023-12-26 12:11:45 KST]-72668 LOG:  received fast shutdown request
[2023-12-26 12:11:45 KST]-72668 LOG:  aborting any active transactions
[2023-12-26 12:11:45 KST]-72674 FATAL:  terminating walreceiver process due to administrator command
[2023-12-26 12:11:45 KST]-72671 LOG:  shutting down
[2023-12-26 12:11:45 KST]-72671 LOG:  restartpoint starting: shutdown immediate
[2023-12-26 12:11:45 KST]-72671 LOG:  restartpoint complete: wrote 0 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.002 s, sync=0.001 s, total=0.008 s; sync files=0, longest=0.000 s, average=0.000 s; distance=16384 kB, estimate=16384 kB
[2023-12-26 12:11:45 KST]-72671 LOG:  recovery restart point at 0/19000060
[2023-12-26 12:11:45 KST]-72668 LOG:  database system is shut down

#### pg_autoctl.log(secondary)
+12:11:45 72656 INFO  pg_autoctl received signal SIGQUIT, terminating
+12:11:45 72659 INFO  Postgres controller service received signal SIGQUIT, terminating
+12:11:45 72659 INFO  Stopping pg_autoctl postgres service
12:11:45 72659 INFO  pg_ctl --pgdata $PGDATA --wait stop --mode fast
12:11:45 72656 INFO  Stop pg_autoctl

#### monitor server log(secondary)
+12:11:45 35223 INFO  Removing node 15 "dev-pgf200003" (dev-pgf200003:6432) from formation "sangmudb" and group 0
+12:11:46 35223 INFO  Setting goal state of node 10 "dev-pgf200002" (dev-pgf200002:6432) to single as there is no other node.
12:11:46 35223 INFO  New state for node 10 "dev-pgf200002" (dev-pgf200002:6432): primary ➜ single
12:11:46 35223 INFO  New state is reported by node 10 "dev-pgf200002" (dev-pgf200002:6432): "single"
12:11:46 35223 INFO  New state for node 10 "dev-pgf200002" (dev-pgf200002:6432): single ➜ single