Fix maintenance state related transitions.

hapostgres / pg_auto_failover

Postgres extension and service for automated failover and high-availability

Other

1.12k stars 115 forks source link

Fix maintenance state related transitions. #786

Closed DimCitus closed 3 years ago

DimCitus commented 3 years ago

We used to disallow starting maintenance on a node in some cases, but it seems that the user should be able to decide about when they need to operate maintenance on their own nodes. After all, we don't stop Postgres when going to maintenance, so users may change their mind without impacting their service. A WARNING message is now displayed in some cases that were previously prevented.

Also, the transition from WAIT_MAINTENANCE to MAINTENANCE was failing since we improved the Group State Machine for the primary node, which would go from JOIN_PRIMARY to PRIMARY without waiting for the other nodes to reach their assigned state of WAIT_MAINTENANCE.

DimCitus commented 3 years ago

For interactive testing and QA, use the following setup:

$ make NODES=3 NODES_PRIOS=50,50,0 cluster

Then you can play around putting node2 and node3 to maintenance, and then both of them together.

JelteF commented 3 years ago

I got into an unrecoverable state with these steps:

make cluster -j20 TMUX_LAYOUT=tiled NODES=3
pg_autoctl enable maintenance --pgdata node2
pg_autoctl enable maintenance --pgdata node3
pg_autoctl enable maintenance --pgdata node1 --allow-failover

Cluster will then be stuck in this state:

Disabling maintenance on node1 does not work:

Disabling maintenance on node2 (or node3) also does not work, because it gets in this loop when you try:

State then stays like this:

DimCitus commented 3 years ago

I got into an unrecoverable state with these steps:

Now fixed with the following error. We might want to avoid the first WARNING, what do you think?

$ pg_autoctl enable maintenance --pgdata node1 --allow-failover
12:27:29 65490 WARN  WARNING:  Starting maintenance on node 1 "node1" (localhost:5501) will block writes on the primary node 1 "node1" (localhost:5501)
12:27:29 65490 WARN  DETAIL:  we now have 0 healthy node(s) left in the "secondary" state and formation "default" number-sync-standbys requires 1 sync standbys
12:27:29 65490 ERROR Monitor ERROR:  Starting maintenance on node 1 "node1" (localhost:5501) in state "primary" is not currently possible
12:27:29 65490 ERROR Monitor DETAIL:  there is currently 0 candidate nodes available
12:27:29 65490 ERROR Failed to start_maintenance of node 1 from the monitor
12:27:29 65490 FATAL Failed to enable maintenance of node 1 on the monitor, see above for details

JelteF commented 3 years ago

Again I found a way to reach an unrecoverable state:

make cluster -j20 TMUX_LAYOUT=tiled NODES=3
pg_autoctl enable maintenance --pgdata node3
pg_autoctl enable maintenance --pgdata node1 --allow-failover

Cluster will then be stuck in this state:

Disabling maintenance on node1 does not work:

Disabling maintenance on node3 also does not work, because it gets in this loop when you try:

State then stays like this:

DimCitus commented 3 years ago

Again I found a way to reach an unrecoverable state:

make cluster -j20 TMUX_LAYOUT=tiled NODES=3
pg_autoctl enable maintenance --pgdata node3
pg_autoctl enable maintenance --pgdata node1 --allow-failover

This now can be unblocked by running pg_autoctl disable maintenance --pgdata node3 ; but still fails when trying to disable maintenance on node1. I am looking at a transition from prepare_maintenance back to primary, if that makes sense.

JelteF commented 3 years ago

I think this is good to merge. I ran into some more issues with wait_maintenance and opened a PR to fix some of those: #794 I don't want to block this PR on that though.