Massive unnecessary Raft log rollback causes spike latency.

Please check the FAQ documentation before raising an issue

Describe the bug (required) I've found such a scenario with spike latency for seconds: graphd's log show that requests to a certain storaged host are timeout. when look into the storaged's log, it shows that many partitions (maybe all parts as the follower) have encountered a RaftLog Rollback.

However, there is no logs indicating leader re-election or leader change, which means it should not involve inconsistency.

Your Environments (required) a private branch dispatched from the master branch for long, but the related code looks the same as the master branch.

How To Reproduce(required) No idea, it happens occasionally. In my case, it happens when the storaged is under heavy load caused by write pressure test.

Expected behavior Should not triggle massive RaftLog Rollback and causes the storaged unresponsible for seconds.

Additional context I've taked a look at the RaftPart Impl and have some thoughts about the issue.

there are massive Follower Raft Parts doing rollback, and the leaders of these parts are not changed. So this may not be caused by a log inconsistency.
from reading the source code, based on my understanding, it seems that there is a case that may causes the rollback:
A previous AppendLog was sent to Follower with: LogEntries [100, 103], commitId: 99
Due to the network or something else, the response was lost or not return to the Leader in time. as the result, the Leader's last_log_id_sent is still 99
Next time, the leader will send AppendLog to the Follower: LogEntries [100, 110], commitId 99
At this time, although [100, 103] of the local wal file is consistent, rollbackToLog 102 will still be triggered.

the corresponding code is: 企业微信截图_37c49ef5-e561-4608-bb6e-1b7808f25b80

if this is the case, a simple solution might be: 企业微信截图_0deb5d6e-43e4-46f2-b82a-9e3d57e64905

vesoft-inc / nebula

Massive unnecessary Raft log rollback causes spike latency. #5881