hapostgres / pg_auto_failover

Postgres extension and service for automated failover and high-availability
Other
1.07k stars 113 forks source link

Impossible / unresolveable state after crash - How to recover? #883

Closed dfex55 closed 2 years ago

dfex55 commented 2 years ago

We are stuck in an unresolveable state of a pg14 cluster.

After a simultaneous crash of the primary, secondary and monitor all three services were re-started automatically.

Before the crash db001-2 was the primary and db001-1 in-sync secondary.

After the restart, the db001-1 instance came up with the state wait_primary and is accepting read-write connection and already serves requests. The monitor wants to assign the state stop_replication which seems impossible. Node logs "pg_autoctl does not know how to reach state "stop_replication" from "wait_primary"

The db001-2 instance came back with state demoted and cannot go into demote_timeout (as the monitor wants it to).

A restart of db001-2 did not help.

Is there a possible solution to recover this cluster?

And what can be done to prevent this in the future?

Monitor state:

pg_autoctl version

pg_autoctl version 1.6.3
pg_autoctl extension version 1.6
compiled with PostgreSQL 14.0 (Ubuntu 14.0-1.pgdg20.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0, 64-bit
compatible with Postgres 10, 11, 12, 13, and 14

pg_autoctl show state

   Name |  Node |    Host:Port |          TLI: LSN |   Connection |      Reported State |      Assigned State
--------+-------+--------------+-------------------+--------------+---------------------+--------------------
db001-1 |     1 | db001-1:5432 |   3: 17A/E8797EF0 |   read-write |        wait_primary |    stop_replication
db001-2 |   148 | db001-2:5432 |   2: 17A/E2EE9E08 |       none ! |             demoted |      demote_timeout

pg_autoctl show events

                    Event Time |   Node |       Current State |      Assigned State | Comment
-------------------------------+--------+---------------------+---------------------+-----------
  2022-04-22 17:54:28.50251+02 |  0/148 |            draining |            draining | New state is reported by node 148 "db001-2" (db001-2:5432): "draining"
 2022-04-22 17:54:28.507861+02 |    0/1 |   prepare_promotion |   prepare_promotion | New state is reported by node 1 "db001-1" (db001-1:5432): "prepare_promotion"
 2022-04-22 17:54:28.507861+02 |    0/1 |   prepare_promotion |    stop_replication | Setting goal state of node 148 "db001-2" (db001-2:5432) to demote_timeout and node 1 "db001-1" (db001-1:5432) to stop_replication after node 1 "db001-1" (db001-1:5432) converged to prepare_promotion.
 2022-04-22 17:54:28.507861+02 |  0/148 |            draining |      demote_timeout | Setting goal state of node 148 "db001-2" (db001-2:5432) to demote_timeout and node 1 "db001-1" (db001-1:5432) to stop_replication after node 1 "db001-1" (db001-1:5432) converged to prepare_promotion.
 2022-04-22 17:54:30.487307+02 |    0/1 |    stop_replication |    stop_replication | New state is reported by node 1 "db001-1" (db001-1:5432): "stop_replication"
 2022-04-22 17:55:31.633059+02 |  0/148 |             demoted |      demote_timeout | New state is reported by node 148 "db001-2" (db001-2:5432): "demoted"
 2022-04-22 17:55:34.407641+02 |  0/148 |             demoted |      demote_timeout | Node node 148 "db001-2" (db001-2:5432) is marked as unhealthy by the monitor
 2022-04-22 17:55:36.413744+02 |    0/1 |    stop_replication |    stop_replication | Node node 1 "db001-1" (db001-1:5432) is marked as unhealthy by the monitor
 2022-04-22 17:55:46.457623+02 |    0/1 |    stop_replication |    stop_replication | Node node 1 "db001-1" (db001-1:5432) is marked as healthy by the monitor
 2022-04-22 17:55:46.499458+02 |    0/1 |        wait_primary |    stop_replication | New state is reported by node 1 "db001-1" (db001-1:5432): "wait_primary"

Current read-write node (db001-1) logs (repeating):

18:11:08 30 INFO  Monitor assigned new state "stop_replication"
18:11:08 30 FATAL pg_autoctl does not know how to reach state "stop_replication" from "wait_primary"
18:11:08 30 ERROR Failed to transition to state "stop_replication", retrying... 

Other node (db001-2) logs (repeating):

18:12:07 32 INFO  Monitor assigned new state "demote_timeout"
18:12:07 32 FATAL pg_autoctl does not know how to reach state "demote_timeout" from "demoted"
18:12:07 32 ERROR Failed to transition to state "demote_timeout", retrying...