vesoft-inc / nebula

A distributed, fast open-source graph database featuring horizontal scalability and high availability
https://nebula-graph.io
Apache License 2.0
10.89k stars 1.21k forks source link

Massive unnecessary Raft log rollback causes spike latency. #5881

Open UndefinedSy opened 6 months ago

UndefinedSy commented 6 months ago

Please check the FAQ documentation before raising an issue

Describe the bug (required) I've found such a scenario with spike latency for seconds: graphd's log show that requests to a certain storaged host are timeout. when look into the storaged's log, it shows that many partitions (maybe all parts as the follower) have encountered a RaftLog Rollback.

However, there is no logs indicating leader re-election or leader change, which means it should not involve inconsistency.

Your Environments (required) a private branch dispatched from the master branch for long, but the related code looks the same as the master branch.

How To Reproduce(required) No idea, it happens occasionally. In my case, it happens when the storaged is under heavy load caused by write pressure test.

Expected behavior Should not triggle massive RaftLog Rollback and causes the storaged unresponsible for seconds.

Additional context I've taked a look at the RaftPart Impl and have some thoughts about the issue.

  1. there are massive Follower Raft Parts doing rollback, and the leaders of these parts are not changed. So this may not be caused by a log inconsistency.
  2. from reading the source code, based on my understanding, it seems that there is a case that may causes the rollback:
  3. A previous AppendLog was sent to Follower with: LogEntries [100, 103], commitId: 99
  4. Due to the network or something else, the response was lost or not return to the Leader in time. as the result, the Leader's last_log_id_sent is still 99
  5. Next time, the leader will send AppendLog to the Follower: LogEntries [100, 110], commitId 99
  6. At this time, although [100, 103] of the local wal file is consistent, rollbackToLog 102 will still be triggered.

the corresponding code is: 企业微信截图_37c49ef5-e561-4608-bb6e-1b7808f25b80

if this is the case, a simple solution might be: 企业微信截图_0deb5d6e-43e4-46f2-b82a-9e3d57e64905

UndefinedSy commented 6 months ago

I think this also leads to a parallel rollback issue when restarting a crashed storage instance, an issue we've observed in our production environment. Here's the sequence of events:

  1. The crashed instance has some logs that have been written to the WAL, but have not been committed.
  2. Upon recovery, the crashed instance receives the new appendLog request with some overlapping log entries.
  3. It then proceeds to perform a rollback operation, takes a lot of CPU time, and eventually causes more problems.