Please check the FAQ documentation before raising an issue
Describe the bug (required)
I've found such a scenario with spike latency for seconds: graphd's log show that requests to a certain storaged host are timeout. when look into the storaged's log, it shows that many partitions (maybe all parts as the follower) have encountered a RaftLog Rollback.
However, there is no logs indicating leader re-election or leader change, which means it should not involve inconsistency.
Your Environments (required)
a private branch dispatched from the master branch for long, but the related code looks the same as the master branch.
How To Reproduce(required)
No idea, it happens occasionally.
In my case, it happens when the storaged is under heavy load caused by write pressure test.
Expected behavior
Should not triggle massive RaftLog Rollback and causes the storaged unresponsible for seconds.
Additional context
I've taked a look at the RaftPart Impl and have some thoughts about the issue.
there are massive Follower Raft Parts doing rollback, and the leaders of these parts are not changed. So this may not be caused by a log inconsistency.
from reading the source code, based on my understanding, it seems that there is a case that may causes the rollback:
A previous AppendLog was sent to Follower with: LogEntries [100, 103], commitId: 99
Due to the network or something else, the response was lost or not return to the Leader in time. as the result, the Leader's last_log_id_sent is still 99
Next time, the leader will send AppendLog to the Follower: LogEntries [100, 110], commitId 99
At this time, although [100, 103] of the local wal file is consistent, rollbackToLog 102 will still be triggered.
I think this also leads to a parallel rollback issue when restarting a crashed storage instance, an issue we've observed in our production environment. Here's the sequence of events:
The crashed instance has some logs that have been written to the WAL, but have not been committed.
Upon recovery, the crashed instance receives the new appendLog request with some overlapping log entries.
It then proceeds to perform a rollback operation, takes a lot of CPU time, and eventually causes more problems.
Please check the FAQ documentation before raising an issue
Describe the bug (required) I've found such a scenario with spike latency for seconds: graphd's log show that requests to a certain storaged host are timeout. when look into the storaged's log, it shows that many partitions (maybe all parts as the follower) have encountered a RaftLog Rollback.
However, there is no logs indicating leader re-election or leader change, which means it should not involve inconsistency.
Your Environments (required) a private branch dispatched from the master branch for long, but the related code looks the same as the master branch.
How To Reproduce(required) No idea, it happens occasionally. In my case, it happens when the storaged is under heavy load caused by write pressure test.
Expected behavior Should not triggle massive RaftLog Rollback and causes the storaged unresponsible for seconds.
Additional context I've taked a look at the RaftPart Impl and have some thoughts about the issue.
last_log_id_sent
is still 99the corresponding code is:
if this is the case, a simple solution might be: