Open kbr-scylla opened 5 months ago
User case where group0_state_machine is halted due to broken invariants (tablets metadata cannot be loaded): https://github.com/scylladb/scylladb/issues/20039
Manual recovery tool would be useful for such case, but simple "copy group 0 state from one node and paste it to others" would not work. It would have to fix the state as well, e.g. by disabling ongoing tablet migrations
Added "group 0 bugs" to the title
To recover from majority loss situation we revert back to gossip mode.
We want to get rid of gossip-based topology operations (https://github.com/scylladb/scylladb/issues/15422) -- for that, we need a different way to perform recovery. I propose we design an external tool which would connect to all remaining live nodes, reconcile their topology metadata (including stuff like tablets), decide which nodes to remove, and coordinate recovery of group 0 state.
This will also help in situations where group0_state_machine got stuck e.g. due to broken invariants due to bugs, like in https://github.com/scylladb/scylladb/issues/20039
cc @gleb-cloudius @tgrabiec @bhalevy @avikivity