Open xinferum opened 10 months ago
If in an architecture with 3 datanodes, if one of them is lost, the execution of switchover or promotion is considered incorrect, then it would be good to have a mechanism for blocking such actions, issue an error to the user and not cause an attempt to change roles in order to avoid the situations described above.
The documentation regarding regarding fault tolerance, failover behaviour and settings that you can tweak is pretty extensive and available here: https://pg-auto-failover.readthedocs.io/en/main/fault-tolerance.html https://pg-auto-failover.readthedocs.io/en/main/failover-state-machine.html https://pg-auto-failover.readthedocs.io/en/main/ref/configuration.html
You probably shouldn't be switching over a healthy node manually at the same time when you're loosing replicas. If you're removing replicas from the cluster or it's expected that they'll be out for a long while, you should probably either drop them from the cluster on the monitor side or enable maintenance for them.
It is clear that if we lose another datanode, then primary will not commit transactions, since (in our case) one synchronous replica is required:
If you drop all non-functional nodes from the cluster, except the working primary it will get into "single" state and will be available for writing.
This is an open source project, if you require extensive support, it may be a good idea to look for some company or individual that offers such paid support option. Other than that, adding safe guards to CLI commands seems like a good first issue if you want to have a crack at it and think that there's actually a problem.
You may also want to have a look at Patroni-based solution like: https://github.com/vitabaks/postgresql_cluster
Firstly, I want to thank Dimitri Fontaine and other developers for their contribution to the development of this solution!
If one of the nodes has become unavailable, then the pg_auto_failover (witness) of course must be ready for such a possible problem and have in its arsenal a way out of such a situation. The hang at the report_lsn stage is not good. I really hope that the developers of this solution will find a good way to solve this problem.
Yes, this is open source software, but not everyone can write code as good as this code.
We really hope to implement the necessary cases in the witness code to solve this problem.
I may be a little rusty, as it was a long time since I've experienced multiple node failure in pg_auto_failover clusters, but in the last 4 years of running both production and testing clusters with high volume of transactions, I haven't noticed this behaviour.
One thing is though that I let the cluster "run itself". The only times when I do some manual switchovers/set maintenance mode is when I update and reboot servers.
Good afternoon.
The other day during testing, I encountered the problem of the switchover/failover hanging in the report_lsn standby status. Switchover/failover could not be completed due to the unavailability of one of the datanodes before the switchover/failover process started. The problem manifests itself if more than 2 data nodes are used in the cluster.
Software versions used: pg_auto_failover 2.0 PostgreSQL 15.4 (vanilla)
Setting up a cluster of 3 datanodes in one data center and one subnet: postgres-db05.local - monitor postgres-db01.local - datanode 1 postgres-db02.local - datanode 2 postgres-db03.local - datanode 3
Cluster settings (one synchronous datanode, all datanodes are equal and participate in elections):
When all 3 data nodes are available, switchover and failover work normally. But, if one of the datanodes is unavailable, and after that we try to perform a switchover, then it will hang, and the datanodes will get stuck in the report_lsn status.
I'll demonstrate. Let's say we have the 2nd datanode dropped - I'm clearing the VM postgres-db02.local. We see the following state of the cluster:
The 2nd datanode is unavailable, it was secondary, so apart from changing its status, there was no other activity in the cluster. Now let's try to perform a switchover, we have two out of three datanodes available and it is expected that we can change the primary from postgres-db01.local to the secondary node postgres-db03.local:
The switchover command on the monitor failed with an error:
Looking at the status of the cluster:
We see that now we don't have any primary servers and the two datanodes that were available are in the report_lsn status. They don't come out of this state themselves, no matter how long we wait. On the datanodes themselves, postgres is in recovery mode (the former primary is not available for recording, since the primary role has already been selected from it).
It seems that pg_auto_failover is waiting for the previously unavailable 2nd datanode postgres-db02.local to also provide him with the data of the last lsn available on it, i.e., in fact, the 2nd datanode is also, as it were, in the status report_lsn for the monitor.
There are two options for how to get the cluster out of this state:
For the first solution, everything is clear, for the second, we will perform the removal of datanode 2 from the cluster on the monitor and see that the switchover from the 1st to the 3rd datanode was successfully completed:
After fixing the 2nd datanode, we add it back to the cluster.
Consider the case of failover. We have all three datanodes in the cluster again and the 2nd one became unavailable some time ago:
The 1st datanode is primary. It is clear that if we lose another datanode, then primary will not commit transactions, since (in our case) one synchronous replica is required:
In this configuration, and if one of the datanodes is unavailable, in the case of a failover there is no quorum (all three nodes participate in the quorum) and the failover will not be able to take place and select a new primary. But then why does the failover hang in the report_lsn status again? Let's simulate the fall of the 1st datanode postgres-db 01.local, the one we have now primary and see the status of the cluster:
As you can see, the only remaining 3rd datanode is hanging in the report_lst status. It would be clearer if we saw something like no_quorum, it signals more clearly why the cluster cannot select this node as the new primary.
If we put one of the datanodes back into operation, let's say the same 1st one that was primary, then the cluster will still remain hanging in the report_lsn status. The ways out of the situation are still the same - either to put back into operation all the datanodes that were unavailable, or to remove an inaccessible node/node from the cluster so that the cluster chooses a new primary.
And as a result, due to the unavailability of the 2nd datanode, we found ourselves in the same situation that was with switchover.
In the situation with failover, in principle, it is clear that everything is bad - we have neither a quorum nor a synchronous replica and it will not work. It is not clear why, after 2 datanodes become available, failover still does not happen.
In the case of switchover, it is expected that we can change the roles of two datanodes, since they are available and fully functional.
Perhaps to fix the problem with a switchover, it makes sense to add some kind of timeout for receiving report_lst, so that the unavailable node does not participate in the selection of a new primary and the role change between the two available datanodes is performed?
If one of the three datanodes is unavailable, then promotion freezes just like switchover.
Perhaps you can tell me some parameters by configuring which you can avoid such situations. Thank you.