Open kbr-scylla opened 8 months ago
Another idea is to somehow employ group 0 history table to improve observability / debuggability. But there is no way to map group 0 history descriptions to command idxs.
Maybe we could extend the group 0 history table with additional columns and fill them with command indexes or sth?
BTW. we may want to disable garbage collection of group 0 history table, or at least make TTL very large (like 30 days). (Maybe make it configurable?)
If applier_fiber crashes due to an exception from
group0_state_machine::apply
(which most likely indicates a bug in the code), we should be able to recover from that state. One way would be to prune the Raft log starting from the faulty command.Unfortunately, the exception doesn't say where the failure happened, it looks like this:
One idea to improve the situation is to report the last successfully applied index. Note that it's in general not possible to report the exact index of the crash triggering command because we may have merged some commands inside apply and only one of them caused the crash.
cc @gleb-cloudius @xemul