openmessaging / dledger

A raft-based java library for building high-available, high-durable, strong-consistent commitlog.
Apache License 2.0
801 stars 315 forks source link

Missing detection of disk availability when sending heartbeats. #287

Open cserwen opened 1 year ago

cserwen commented 1 year ago

In RocketMQ DLedger, when a disk which stores data or index fails, a status which No-Master may occur for brokerGroup. The following is the process of the problem:

Process

Initial state: term=2, n2 is the Master node

Question

If the Master has not sent heartbeats to the follower, the follower will trigger the election; but if the heartbeat has been sent normally, the slave node will not initiate the election.

The memberState object lock is used to detect disk failures. When writing a message, the lock will be held. If the disk fails, the lock will not be released in time, and the heartbeat thread will not acquire the lock, thus detecting the disk failure. It can be seen that writing messages is a trigger to detect disk failures, but if the client no longer writes messages, the heartbeat thread can always acquire the lock, and it keeps sending heartbeats.

dledger-lock drawio

TODO

If no data is written, the node where the faulty disk is located will also become the Master. Therefore, I think it is necessary to add a task to regularly detect whether the disks are available. to avoid this situation.

humkum commented 1 year ago

I'd like to follow this issue. Plz assign this issue to me, thanks.

TheR1sing3un commented 1 year ago

I'd like to follow this issue. Plz assign this issue to me, thanks.

Welcome~ You can write a brief improvement proposal~