Possible FAILURE STATE in State Machine

Description

While unlikely it is possible to trigger failure state that is impossible to solve with provided tools and has to assigned state has to be set manually.

In cluster with two nodes and one quorum it's possible to trigger this sequence of events.

Beginning STATE

NODE1 PRIMARY NODE2 SECONDARY

Sequence of events

NODE1 has error in .pgpass preventing it from comunicating with rest of cluster but itself it isn't cause of automatic switchover NODE2 tries enable maintenance. STATE SECONDARY > wait_maintenance NODE1 is assigned state PRIMARY > wait_primary but the error in .pgpass causes fall into demote_timeout which causes impasse. Because NODE1 cannot reach target state FATAL pg_autoctl does not know how to reach state "wait_primary" from "demote_timeout" And NODE2 cannot leave maintenance because it is stuck in wait_maintenance. Neither node will start and whole cluster gets stuck.

Workaround:

Shutdown BOTH NODES and QUORUM PG_AUTOCTL
START PG_CTL quorum
update node set goalstate='primary' where nodeid=1;
STOP PG_CTL quorum
START PG_AUTOCTL in order QUORUM, NODE1, NODE2

This allows the NODE1 to start and transition to wait_primary which allows NODE2 to reach maintenance.

Expected solution:

Transition between demote_timeout and wait_primary should be implemented.

Workaround EVENTS

                    Event Time |   Node |       Current State |      Assigned State | Comment
-------------------------------+--------+---------------------+---------------------+-----------
 2024-01-16 11:50:44.677014+01 |    0/1 |             primary |        wait_primary | Setting goal state of node 1 "<NODE1>" (<NODE1>:5433) to wait_primary and node 2 "<NODE2>" (<NODE2>:5433) to wait_maintenance after a user-initiated start_maintenance call.
 2024-01-16 11:50:44.677014+01 |    0/2 |           secondary |    wait_maintenance | Setting goal state of node 1 "<NODE1>" (<NODE1>:5433) to wait_primary and node 2 "<NODE2>" (<NODE2>:5433) to wait_maintenance after a user-initiated start_maintenance call.
 2024-01-16 11:50:48.080015+01 |    0/2 |    wait_maintenance |    wait_maintenance | New state is reported by node 2 "<NODE2>" (<NODE2>:5433): "wait_maintenance"
 2024-01-16 11:50:57.412881+01 |    0/2 |    wait_maintenance |    wait_maintenance | Node node 2 "<NODE2>" (<NODE2>:5433) is marked as unhealthy by the monitor
 2024-01-16 11:51:17.442579+01 |    0/1 |             primary |        wait_primary | Node node 1 "<NODE1>" (<NODE1>:5433) is marked as unhealthy by the monitor
 2024-01-16 11:54:01.597231+01 |    0/1 |      demote_timeout |        wait_primary | New state is reported by node 1 "<NODE1>" (<NODE1>:5433): "demote_timeout"
 2024-01-16 12:44:25.280301+01 |    0/1 |      demote_timeout |        wait_primary | Node node 1 "<NODE1>" (<NODE1>:5433) is marked as healthy by the monitor
 2024-01-16 14:05:19.371168+01 |    0/1 |      demote_timeout |        wait_primary | Node node 1 "<NODE1>" (<NODE1>:5433) is marked as unhealthy by the monitor
 2024-01-16 14:06:53.994032+01 |    0/1 |      demote_timeout |        wait_primary | Node node 1 "<NODE1>" (<NODE1>:5433) is marked as healthy by the monitor
    2024-01-16 14:40:53.268+01 |    0/1 |      demote_timeout |        wait_primary | Node node 1 "<NODE1>" (<NODE1>:5433) is marked as unhealthy by the monitor
 2024-01-16 14:43:40.262183+01 |    0/1 |             primary |             primary | New state is reported by node 1 "<NODE1>" (<NODE1>:5433): "primary"
 2024-01-16 14:43:40.262183+01 |    0/1 |             primary |        wait_primary | Setting goal state of node 1 "<NODE1>" (<NODE1>:5433) to wait_primary because none of the standby nodes in the quorum are healthy at the moment.
 2024-01-16 14:43:40.355978+01 |    0/1 |        wait_primary |        wait_primary | New state is reported by node 1 "<NODE1>" (<NODE1>:5433): "wait_primary"
 2024-01-16 14:43:40.659967+01 |    0/1 |        wait_primary |        wait_primary | Node node 1 "<NODE1>" (<NODE1>:5433) is marked as healthy by the monitor
 2024-01-16 14:44:03.793994+01 |    0/2 |    wait_maintenance |         maintenance | Setting goal state of node 2 "<NODE2>" (<NODE2>:5433) to maintenance after node 1 "<NODE1>" (<NODE1>:5433) converged to wait_primary.
 2024-01-16 14:44:03.952567+01 |    0/2 |         maintenance |         maintenance | New state is reported by node 2 "<NODE2>" (<NODE2>:5433): "maintenance"
 2024-01-16 14:51:48.666013+01 |    0/2 |         maintenance |          catchingup | Setting goal state of node 2 "<NODE2>" (<NODE2>:5433) to catchingup  after a user-initiated stop_maintenance call.
  2024-01-16 14:51:50.05806+01 |    0/2 |          catchingup |          catchingup | New state is reported by node 2 "<NODE2>" (<NODE2>:5433): "catchingup"
  2024-01-16 14:51:51.44403+01 |    0/2 |          catchingup |          catchingup | Node node 2 "<NODE2>" (<NODE2>:5433) is marked as healthy by the monitor
 2024-01-16 14:51:51.475895+01 |    0/2 |          catchingup |           secondary | Setting goal state of node 2 "<NODE2>" (<NODE2>:5433) to secondary after it caught up.
 2024-01-16 14:51:51.594911+01 |    0/2 |           secondary |           secondary | New state is reported by node 2 "<NODE2>" (<NODE2>:5433): "secondary"
 2024-01-16 14:51:51.636226+01 |    0/1 |        wait_primary |             primary | Setting goal state of node 1 "<NODE1>" (<NODE1>:5433) to primary now that we have 1 healthy  secondary nodes in the quorum.
  2024-01-16 14:51:51.88083+01 |    0/1 |             primary |             primary | New state is reported by node 1 "<NODE1>" (<NODE1>:5433): "primary"

hapostgres / pg_auto_failover