Open Cyrill opened 3 months ago
I agree with you; that situation could definitely happen. I suspect the problem arises from sharing the RpcServer
. The RaftGroupService
closes before the RpcServer
, causing a delay in processing new requests because the raft service is already closed by then.
I managed to hit an AssertionError in AppendEntriesRequestProcessor. Apparently, there is a race. The crash was observed on a custom branch, though the code in master is the same.
First, the code:
AppendEntriesRequestProcessor.PeerExecutorSelector
has the following code (Intentionally removed unrelated lines):getOrCreatePeerRequestContext
looks as follows:Execution flow
I don't have a specific code to reproduce this issue, but the flow is simple. I observed a slight delay in messaging/threads which ended up with an error.
My assumptions regarding the execution flow are:
select
is called.NodeManager.getInstance().get(groupId, peer)
returns a non-null result, continue togetOrCreatePeerRequestContext
NodeManager.getInstance().remove()
is called for this node.getOrCreatePeerRequestContext
the result offinal Node node = NodeManager.getInstance().get(groupId, peer);
is null, since the node has already been removed moments ago.assert (node != null);