scylladb / scylladb

NoSQL data store using the seastar framework, compatible with Apache Cassandra
http://scylladb.com
GNU Affero General Public License v3.0
13k stars 1.24k forks source link

raft: if applier_fiber crashes, there's no way to know which command caused the crash #16049

Open kbr-scylla opened 8 months ago

kbr-scylla commented 8 months ago

If applier_fiber crashes due to an exception from group0_state_machine::apply (which most likely indicates a bug in the code), we should be able to recover from that state. One way would be to prune the Raft log starting from the faulty command.

Unfortunately, the exception doesn't say where the failure happened, it looks like this:

ERROR 2023-11-14 10:37:36,931 [shard 0:main] raft - [6df5e7e8-b11a-4589-bedc-3207c0fb3ba9] applier fiber stopped because of the error: std::_Nested_exception<raft::state_machine_error> (State machine error at raft/server.cc:1218): seastar::internal::backtraced<exceptions::configuration_exception> (<my exception here>)

One idea to improve the situation is to report the last successfully applied index. Note that it's in general not possible to report the exact index of the crash triggering command because we may have merged some commands inside apply and only one of them caused the crash.

cc @gleb-cloudius @xemul

kbr-scylla commented 8 months ago

Another idea is to somehow employ group 0 history table to improve observability / debuggability. But there is no way to map group 0 history descriptions to command idxs.

Maybe we could extend the group 0 history table with additional columns and fill them with command indexes or sth?

BTW. we may want to disable garbage collection of group 0 history table, or at least make TTL very large (like 30 days). (Maybe make it configurable?)