Switchover and failover get stuck in the report_lst status in certain configurations

xinferum commented 10 months ago

Good afternoon.

The other day during testing, I encountered the problem of the switchover/failover hanging in the report_lsn standby status. Switchover/failover could not be completed due to the unavailability of one of the datanodes before the switchover/failover process started. The problem manifests itself if more than 2 data nodes are used in the cluster.

Software versions used: pg_auto_failover 2.0 PostgreSQL 15.4 (vanilla)

Setting up a cluster of 3 datanodes in one data center and one subnet: postgres-db05.local - monitor postgres-db01.local - datanode 1 postgres-db02.local - datanode 2 postgres-db03.local - datanode 3

postgres@postgres-db05 ~$ pg_autoctl show state --formation vagrant
               Name |  Node |                Host:Port |       TLI: LSN |   Connection |      Reported State |      Assigned State
--------------------+-------+--------------------------+----------------+--------------+---------------------+--------------------
postgres-db01.local |     1 | postgres-db01.local:5432 |   4: 0/D0011D8 |   read-write |             primary |             primary
postgres-db02.local |     2 | postgres-db02.local:5432 |   4: 0/D0011D8 |    read-only |           secondary |           secondary
postgres-db03.local |     3 | postgres-db03.local:5432 |   4: 0/D0011D8 |    read-only |           secondary |           secondary

Cluster settings (one synchronous datanode, all datanodes are equal and participate in elections):

postgres@postgres-db05 ~$ pg_autoctl show settings --formation vagrant
  Context |                Name |                   Setting | Value                                                       
----------+---------------------+---------------------------+-------------------------------------------------------------
formation |             vagrant |      number_sync_standbys | 1                                                           
  primary | postgres-db01.local | synchronous_standby_names | 'ANY 1 (pgautofailover_standby_2, pgautofailover_standby_3)'
     node | postgres-db01.local |        candidate priority | 50                                                          
     node | postgres-db02.local |        candidate priority | 50                                                          
     node | postgres-db03.local |        candidate priority | 50                                                          
     node | postgres-db01.local |        replication quorum | true                                                        
     node | postgres-db02.local |        replication quorum | true                                                        
     node | postgres-db03.local |        replication quorum | true

When all 3 data nodes are available, switchover and failover work normally. But, if one of the datanodes is unavailable, and after that we try to perform a switchover, then it will hang, and the datanodes will get stuck in the report_lsn status.

I'll demonstrate. Let's say we have the 2nd datanode dropped - I'm clearing the VM postgres-db02.local. We see the following state of the cluster:

postgres@postgres-db05 ~$ pg_autoctl show state --formation vagrant
               Name |  Node |                Host:Port |       TLI: LSN |   Connection |      Reported State |      Assigned State
--------------------+-------+--------------------------+----------------+--------------+---------------------+--------------------
postgres-db01.local |     1 | postgres-db01.local:5432 |   4: 0/D0020C0 |   read-write |             primary |             primary
postgres-db02.local |     2 | postgres-db02.local:5432 |   4: 0/D0020C0 |  read-only ! |           secondary |          catchingup
postgres-db03.local |     3 | postgres-db03.local:5432 |   4: 0/D0020C0 |    read-only |           secondary |           secondary

The 2nd datanode is unavailable, it was secondary, so apart from changing its status, there was no other activity in the cluster. Now let's try to perform a switchover, we have two out of three datanodes available and it is expected that we can change the primary from postgres-db01.local to the secondary node postgres-db03.local:

postgres@postgres-db05 ~$ pg_autoctl perform switchover --formation vagrant
13:59:40 42529 INFO  Waiting 60 secs for a notification with state "primary" in formation "vagrant" and group 0
13:59:40 42529 INFO  Listening monitor notifications about state changes in formation "vagrant" and group 0
13:59:40 42529 INFO  Following table displays times when notifications are received
    Time |                Name |  Node |                Host:Port |       Current State |      Assigned State
---------+---------------------+-------+--------------------------+---------------------+--------------------
13:59:41 | postgres-db01.local |     1 | postgres-db01.local:5432 |             primary |            draining
13:59:41 | postgres-db01.local |     1 | postgres-db01.local:5432 |            draining |            draining
13:59:41 | postgres-db01.local |     1 | postgres-db01.local:5432 |            draining |          report_lsn
13:59:41 | postgres-db03.local |     3 | postgres-db03.local:5432 |           secondary |          report_lsn
13:59:41 | postgres-db03.local |     3 | postgres-db03.local:5432 |          report_lsn |          report_lsn
13:59:42 | postgres-db01.local |     1 | postgres-db01.local:5432 |          report_lsn |          report_lsn
14:00:42 42529 ERROR Failed to receive monitor's notifications
14:00:42 42529 ERROR Failed to wait until a new primary has been notified

The switchover command on the monitor failed with an error:

postgres@postgres-db05 ~$ pg_autoctl perform switchover --formation vagrant
13:59:40 42529 INFO  Waiting 60 secs for a notification with state "primary" in formation "vagrant" and group 0
13:59:40 42529 INFO  Listening monitor notifications about state changes in formation "vagrant" and group 0
13:59:40 42529 INFO  Following table displays times when notifications are received
    Time |                Name |  Node |                Host:Port |       Current State |      Assigned State
---------+---------------------+-------+--------------------------+---------------------+--------------------
13:59:41 | postgres-db01.local |     1 | postgres-db01.local:5432 |             primary |            draining
13:59:41 | postgres-db01.local |     1 | postgres-db01.local:5432 |            draining |            draining
13:59:41 | postgres-db01.local |     1 | postgres-db01.local:5432 |            draining |          report_lsn
13:59:41 | postgres-db03.local |     3 | postgres-db03.local:5432 |           secondary |          report_lsn
13:59:41 | postgres-db03.local |     3 | postgres-db03.local:5432 |          report_lsn |          report_lsn
13:59:42 | postgres-db01.local |     1 | postgres-db01.local:5432 |          report_lsn |          report_lsn
14:00:42 42529 ERROR Failed to receive monitor's notifications
14:00:42 42529 ERROR Failed to wait until a new primary has been notified

Looking at the status of the cluster:

postgres@postgres-db05 ~$ pg_autoctl show state --formation vagrant
               Name |  Node |                Host:Port |       TLI: LSN |   Connection |      Reported State |      Assigned State
--------------------+-------+--------------------------+----------------+--------------+---------------------+--------------------
postgres-db01.local |     1 | postgres-db01.local:5432 |   1: 0/E0000A0 |    read-only |          report_lsn |          report_lsn
postgres-db02.local |     2 | postgres-db02.local:5432 |   4: 0/D0020C0 |  read-only ! |           secondary |          catchingup
postgres-db03.local |     3 | postgres-db03.local:5432 |   4: 0/E0000A0 |    read-only |          report_lsn |          report_lsn

We see that now we don't have any primary servers and the two datanodes that were available are in the report_lsn status. They don't come out of this state themselves, no matter how long we wait. On the datanodes themselves, postgres is in recovery mode (the former primary is not available for recording, since the primary role has already been selected from it).

It seems that pg_auto_failover is waiting for the previously unavailable 2nd datanode postgres-db02.local to also provide him with the data of the last lsn available on it, i.e., in fact, the 2nd datanode is also, as it were, in the status report_lsn for the monitor.

There are two options for how to get the cluster out of this state:

return the 2nd date of the year postgres-db02.local to working condition - this option is unlikely in case of a serious problem with the datanode - then pg_auto_failover will execute a hung switchover
temporarily (while the 2nd datanode is unavailable) remove the 2nd date of the year postgres-db02.local from the cluster on the monitor - then pg_auto_failover will execute the hung switchover

For the first solution, everything is clear, for the second, we will perform the removal of datanode 2 from the cluster on the monitor and see that the switchover from the 1st to the 3rd datanode was successfully completed:

postgres@postgres-db05 ~$ pg_autoctl drop node --name postgres-db02.local --formation vagrant --force
14:15:32 53420 INFO  Removing node with name "postgres-db02.local" in formation "vagrant" from the monitor
14:15:32 53420 INFO  Waiting until the node with id 2 in group 0 has been dropped from the monitor, or for 60s, whichever comes first
14:15:32 53420 INFO  New state for node 1 "postgres-db01.local" (postgres-db01.local:5432): report_lsn ➜ report_lsn
14:15:32 53420 INFO  New state for node 3 "postgres-db03.local" (postgres-db03.local:5432): report_lsn ➜ prepare_promotion
14:15:32 53420 INFO  Node with id 2 in group 0 has been successfully dropped from the monitor

postgres@postgres-db05 ~$ pg_autoctl show state --formation vagrant
               Name |  Node |                Host:Port |       TLI: LSN |   Connection |      Reported State |      Assigned State
--------------------+-------+--------------------------+----------------+--------------+---------------------+--------------------
postgres-db01.local |     1 | postgres-db01.local:5432 |   5: 0/E000798 |    read-only |           secondary |           secondary
postgres-db03.local |     3 | postgres-db03.local:5432 |   5: 0/E000798 |   read-write |             primary |             primary

After fixing the 2nd datanode, we add it back to the cluster.

Consider the case of failover. We have all three datanodes in the cluster again and the 2nd one became unavailable some time ago:

postgres@postgres-db05 ~$ pg_autoctl show state --formation vagrant
               Name |  Node |                Host:Port |       TLI: LSN |   Connection |      Reported State |      Assigned State
--------------------+-------+--------------------------+----------------+--------------+---------------------+--------------------
postgres-db01.local |     1 | postgres-db01.local:5432 |   6: 0/F000798 |   read-write |             primary |             primary
postgres-db02.local |     2 | postgres-db02.local:5432 |   6: 0/F000798 |  read-only ! |           secondary |          catchingup
postgres-db03.local |     3 | postgres-db03.local:5432 |   6: 0/F000798 |    read-only |           secondary |           secondary

The 1st datanode is primary. It is clear that if we lose another datanode, then primary will not commit transactions, since (in our case) one synchronous replica is required:

  Context |                Name |                   Setting | Value                                                       
----------+---------------------+---------------------------+-------------------------------------------------------------
formation |             vagrant |      number_sync_standbys | 1  
...

In this configuration, and if one of the datanodes is unavailable, in the case of a failover there is no quorum (all three nodes participate in the quorum) and the failover will not be able to take place and select a new primary. But then why does the failover hang in the report_lsn status again? Let's simulate the fall of the 1st datanode postgres-db 01.local, the one we have now primary and see the status of the cluster:

postgres@postgres-db05 ~$ pg_autoctl show state --formation vagrant
               Name |  Node |                Host:Port |       TLI: LSN |   Connection |      Reported State |      Assigned State
--------------------+-------+--------------------------+----------------+--------------+---------------------+--------------------
postgres-db01.local |     1 | postgres-db01.local:5432 |   6: 0/F0014A8 | read-write ! |             primary |            draining
postgres-db02.local |     2 | postgres-db02.local:5432 |   6: 0/F000798 |  read-only ! |           secondary |          catchingup
postgres-db03.local |     3 | postgres-db03.local:5432 |   6: 0/F0014A8 |    read-only |          report_lsn |          report_lsn

As you can see, the only remaining 3rd datanode is hanging in the report_lst status. It would be clearer if we saw something like no_quorum, it signals more clearly why the cluster cannot select this node as the new primary.

If we put one of the datanodes back into operation, let's say the same 1st one that was primary, then the cluster will still remain hanging in the report_lsn status. The ways out of the situation are still the same - either to put back into operation all the datanodes that were unavailable, or to remove an inaccessible node/node from the cluster so that the cluster chooses a new primary.

postgres@postgres-db05 ~$ pg_autoctl show state --formation vagrant
               Name |  Node |                Host:Port |       TLI: LSN |   Connection |      Reported State |      Assigned State
--------------------+-------+--------------------------+----------------+--------------+---------------------+--------------------
postgres-db01.local |     1 | postgres-db01.local:5432 |   6: 0/F0014A8 |    read-only |          report_lsn |          report_lsn
postgres-db02.local |     2 | postgres-db02.local:5432 |   6: 0/F000798 |  read-only ! |           secondary |          catchingup
postgres-db03.local |     3 | postgres-db03.local:5432 |   6: 0/F0014A8 |    read-only |          report_lsn |          report_lsn

And as a result, due to the unavailability of the 2nd datanode, we found ourselves in the same situation that was with switchover.

In the situation with failover, in principle, it is clear that everything is bad - we have neither a quorum nor a synchronous replica and it will not work. It is not clear why, after 2 datanodes become available, failover still does not happen.

In the case of switchover, it is expected that we can change the roles of two datanodes, since they are available and fully functional.

Perhaps to fix the problem with a switchover, it makes sense to add some kind of timeout for receiving report_lst, so that the unavailable node does not participate in the selection of a new primary and the role change between the two available datanodes is performed?

If one of the three datanodes is unavailable, then promotion freezes just like switchover.

Perhaps you can tell me some parameters by configuring which you can avoid such situations. Thank you.

xinferum commented 10 months ago

If in an architecture with 3 datanodes, if one of them is lost, the execution of switchover or promotion is considered incorrect, then it would be good to have a mechanism for blocking such actions, issue an error to the user and not cause an attempt to change roles in order to avoid the situations described above.

Akkowicz commented 10 months ago

The documentation regarding regarding fault tolerance, failover behaviour and settings that you can tweak is pretty extensive and available here: https://pg-auto-failover.readthedocs.io/en/main/fault-tolerance.html https://pg-auto-failover.readthedocs.io/en/main/failover-state-machine.html https://pg-auto-failover.readthedocs.io/en/main/ref/configuration.html

You probably shouldn't be switching over a healthy node manually at the same time when you're loosing replicas. If you're removing replicas from the cluster or it's expected that they'll be out for a long while, you should probably either drop them from the cluster on the monitor side or enable maintenance for them.

It is clear that if we lose another datanode, then primary will not commit transactions, since (in our case) one synchronous replica is required:

If you drop all non-functional nodes from the cluster, except the working primary it will get into "single" state and will be available for writing.

This is an open source project, if you require extensive support, it may be a good idea to look for some company or individual that offers such paid support option. Other than that, adding safe guards to CLI commands seems like a good first issue if you want to have a crack at it and think that there's actually a problem.

Akkowicz commented 10 months ago

You may also want to have a look at Patroni-based solution like: https://github.com/vitabaks/postgresql_cluster

sgrinko commented 10 months ago

Firstly, I want to thank Dimitri Fontaine and other developers for their contribution to the development of this solution!

If one of the nodes has become unavailable, then the pg_auto_failover (witness) of course must be ready for such a possible problem and have in its arsenal a way out of such a situation. The hang at the report_lsn stage is not good. I really hope that the developers of this solution will find a good way to solve this problem.

Yes, this is open source software, but not everyone can write code as good as this code.

We really hope to implement the necessary cases in the witness code to solve this problem.

Akkowicz commented 10 months ago

I may be a little rusty, as it was a long time since I've experienced multiple node failure in pg_auto_failover clusters, but in the last 4 years of running both production and testing clusters with high volume of transactions, I haven't noticed this behaviour.

One thing is though that I let the cluster "run itself". The only times when I do some manual switchovers/set maintenance mode is when I update and reboot servers.

hapostgres / pg_auto_failover

Switchover and failover get stuck in the report_lst status in certain configurations #1014