Open Bukhtawar opened 3 years ago
Can you add the steps to reproduce this issue or write the test case for the same?
I don't see the FS Health failing fixing this problem as once the health check fails, the onLeaderFailure
will block again on the same mutex. We need to somehow add timeouts to cluster state persistence calls. I believe once that is fixed, this issue will go way, irrespective of FS health check on stuck IO (Essentially your point 1 in Proposal above). This change may be tricky to achieve. FS health check on stuck IO is a good change in itself, but not a fix for this issue.
Fs health checks does pro-active checks to identify a bad node and evict the same from the cluster, rather than waiting for a cluster state update to remove the stuck leader. You rightly pointed out that this fix itself is insufficient since the mutex for onLeaderFailed
will still get BLOCKED
, which is what the first proposal tries to fix. Moreover Fs health checks would help with situations in general across data nodes.
Hi Bukhtawar, are you actively working on this issue? I see a pull request #1167 already submitted. anything else is pending before we can close the issue?
We don't seem to have any metrics on how frequently this issue manifests, but it appears to happen when the filesystem in use is not a local disk (i.e. EBS).
@Bukhtawar is there anything else pending on this issue before we close it?
Describe the bug
The publication of cluster state is time bound to 30s by a
cluster.publish.timeout
settings. If this time is reached before the new cluster state is committed then the cluster state change is rejected and the leader considers itself to have failed. It stands down and starts trying to elect a new master.There is a bug in leader that when it tries to publish the new cluster state it first tries acquire a lock(
0x0000000097a2f970
) to flush the new state under a mutex to disk. The same lock(0x0000000097a2f970
) is used to cancel the publication on timeout. Below is the state of the timeout scheduler meant to cancel the publication. So essentially if the flushing of cluster state is stuck on IO, so will the cancellation of the publication since both of them share the same mutex. So leader will not step down and effectively block the cluster from making progressFS Health checks at this point
Leader trying to commit the new cluster state to disk causing other operations to be stalled on the same mutex.
Note other processing like Follower Checker remove node is stuck on the same mutex
Proposal
To Reproduce Steps to reproduce the behavior:
Expected behavior A clear and concise description of what you expected to happen.
Plugins Please list all plugins currently enabled.
Screenshots If applicable, add screenshots to help explain your problem.
Host/Environment (please complete the following information):
Additional context Add any other context about the problem here.