We are stuck in an unresolveable state of a pg14 cluster.
After a simultaneous crash of the primary, secondary and monitor all three services were re-started automatically.
Before the crash db001-2 was the primary and db001-1 in-sync secondary.
After the restart, the db001-1 instance came up with the state wait_primary and is accepting read-write connection and already serves requests.
The monitor wants to assign the state stop_replication which seems impossible. Node logs "pg_autoctl does not know how to reach state "stop_replication" from "wait_primary"
The db001-2 instance came back with state demoted and cannot go into demote_timeout (as the monitor wants it to).
A restart of db001-2 did not help.
Is there a possible solution to recover this cluster?
And what can be done to prevent this in the future?
Monitor state:
pg_autoctl version
pg_autoctl version 1.6.3
pg_autoctl extension version 1.6
compiled with PostgreSQL 14.0 (Ubuntu 14.0-1.pgdg20.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0, 64-bit
compatible with Postgres 10, 11, 12, 13, and 14
Event Time | Node | Current State | Assigned State | Comment
-------------------------------+--------+---------------------+---------------------+-----------
2022-04-22 17:54:28.50251+02 | 0/148 | draining | draining | New state is reported by node 148 "db001-2" (db001-2:5432): "draining"
2022-04-22 17:54:28.507861+02 | 0/1 | prepare_promotion | prepare_promotion | New state is reported by node 1 "db001-1" (db001-1:5432): "prepare_promotion"
2022-04-22 17:54:28.507861+02 | 0/1 | prepare_promotion | stop_replication | Setting goal state of node 148 "db001-2" (db001-2:5432) to demote_timeout and node 1 "db001-1" (db001-1:5432) to stop_replication after node 1 "db001-1" (db001-1:5432) converged to prepare_promotion.
2022-04-22 17:54:28.507861+02 | 0/148 | draining | demote_timeout | Setting goal state of node 148 "db001-2" (db001-2:5432) to demote_timeout and node 1 "db001-1" (db001-1:5432) to stop_replication after node 1 "db001-1" (db001-1:5432) converged to prepare_promotion.
2022-04-22 17:54:30.487307+02 | 0/1 | stop_replication | stop_replication | New state is reported by node 1 "db001-1" (db001-1:5432): "stop_replication"
2022-04-22 17:55:31.633059+02 | 0/148 | demoted | demote_timeout | New state is reported by node 148 "db001-2" (db001-2:5432): "demoted"
2022-04-22 17:55:34.407641+02 | 0/148 | demoted | demote_timeout | Node node 148 "db001-2" (db001-2:5432) is marked as unhealthy by the monitor
2022-04-22 17:55:36.413744+02 | 0/1 | stop_replication | stop_replication | Node node 1 "db001-1" (db001-1:5432) is marked as unhealthy by the monitor
2022-04-22 17:55:46.457623+02 | 0/1 | stop_replication | stop_replication | Node node 1 "db001-1" (db001-1:5432) is marked as healthy by the monitor
2022-04-22 17:55:46.499458+02 | 0/1 | wait_primary | stop_replication | New state is reported by node 1 "db001-1" (db001-1:5432): "wait_primary"
Current read-write node (db001-1) logs (repeating):
18:11:08 30 INFO Monitor assigned new state "stop_replication"
18:11:08 30 FATAL pg_autoctl does not know how to reach state "stop_replication" from "wait_primary"
18:11:08 30 ERROR Failed to transition to state "stop_replication", retrying...
Other node (db001-2) logs (repeating):
18:12:07 32 INFO Monitor assigned new state "demote_timeout"
18:12:07 32 FATAL pg_autoctl does not know how to reach state "demote_timeout" from "demoted"
18:12:07 32 ERROR Failed to transition to state "demote_timeout", retrying...
We are stuck in an unresolveable state of a pg14 cluster.
After a simultaneous crash of the primary, secondary and monitor all three services were re-started automatically.
Before the crash db001-2 was the primary and db001-1 in-sync secondary.
After the restart, the db001-1 instance came up with the state wait_primary and is accepting read-write connection and already serves requests. The monitor wants to assign the state stop_replication which seems impossible. Node logs "pg_autoctl does not know how to reach state "stop_replication" from "wait_primary"
The db001-2 instance came back with state demoted and cannot go into demote_timeout (as the monitor wants it to).
A restart of db001-2 did not help.
Is there a possible solution to recover this cluster?
And what can be done to prevent this in the future?
Monitor state:
pg_autoctl version
pg_autoctl show state
pg_autoctl show events
Current read-write node (db001-1) logs (repeating):
Other node (db001-2) logs (repeating):