hapostgres / pg_auto_failover

Postgres extension and service for automated failover and high-availability
Other
1.07k stars 113 forks source link

Switchover hanging on "report_lsn" state #928

Closed legrandlegrand closed 1 year ago

legrandlegrand commented 1 year ago

Hello, I'm testing pg_auto_failover, et found it very interesting ! I had a strange behavior, that I would like to share with you to get a better understanding.

Configuration:

initial state:

-bash-4.2$ pg_autoctl show state
   Name |  Node |      Host:Port |        TLI: LSN |   Connection |      Reported State |      Assigned State
--------+-------+----------------+-----------------+--------------+---------------------+--------------------
 node_9 |     9 | localhost:6003 |  39: 0/25286870 |    read-only |           secondary |           secondary
 node_7 |    16 | localhost:6001 |  39: 0/25286870 |    read-only |           secondary |           secondary
node_50 |    50 | localhost:6002 |  39: 0/25286870 |   read-write |             primary |             primary

then trying to perform a switchover with pg3 node down

-bash-4.2$ echo $PGDATA
/xxx/pg3
-bash-4.2$ pg_autoctl stop

-bash-4.2$ pg_autoctl perform switchover
07:26:46 9171 INFO  Targetting group 0 in formation "default"
07:26:46 9171 INFO  Listening monitor notifications about state changes in formation "default" and group 0
07:26:46 9171 INFO  Following table displays times when notifications are received
    Time |    Name |  Node |      Host:Port |       Current State |      Assigned State
---------+---------+-------+----------------+---------------------+--------------------
07:26:46 | node_50 |    50 | localhost:6002 |             primary |            draining
07:26:46 | node_50 |    50 | localhost:6002 |            draining |            draining
07:26:46 |  node_7 |    16 | localhost:6001 |           secondary |          report_lsn
07:26:46 | node_50 |    50 | localhost:6002 |            draining |          report_lsn
07:26:47 |  node_7 |    16 | localhost:6001 |          report_lsn |          report_lsn
07:26:47 | node_50 |    50 | localhost:6002 |          report_lsn |          report_lsn

and it hangs until node 3 restart (using pg_auto_failover run) and connections to postgres is not possible.

psql 'host=localhost,localhost,localhost port=6001,6002,6003 dbname=postgres user=postgres target_session_attrs=read-write'
psql: error: connection to server at "localhost" (::1), port 6001 failed: session is read-only
connection to server at "localhost" (::1), port 6002 failed: session is read-only
connection to server at "localhost" (::1), port 6003 failed: Connection refused
        Is the server running on that host and accepting TCP/IP connections?
connection to server at "localhost" (127.0.0.1), port 6003 failed: Connection refused
        Is the server running on that host and accepting TCP/IP connections?

My questions: 1/ is that the expected behaviour ? 2/ if expected behaviour, should SWITCHOVER be allowed ? 3/ if node 3 can not be restarted, what is the faster way to restart service and allow postgres connections

Thanks in advance Regards PAscal

legrandlegrand commented 1 year ago

hang on report_lsn.txt

DimCitus commented 1 year ago

Hi @legrandlegrand ; thanks for reporting an issue here. One thing that is strange to me is why would you first stop a node and then perform a switchover? pg_autoctl is meant to take care of doing the failover automatically, you don't have to ask for it when a node is down, just wait for it to happen. It takes about 30s.

It might be a bug that you found here, when you attempt a manual switchover when a node is down but not detected down by the monitor yet. But I am still going to close that issue. Please consider re-opening it if that workflow is important in your case.

legrandlegrand commented 1 year ago

Thank you Dimitri, for this answer. I had to evaluate pg_auto_autofailover for a postgres HA project that is now frozen :o( Then this issue had many goals: