Open stapelberg opened 2 years ago
Hey @stapelberg thanks for filing this and doing some great detective work so far!
I've not looked at the code to verify all of this, but I think this is an issue as you described because there are no timeouts around disk writing and that happens synchronously in the leader loop so if the disk IO blocks forever, then the leader will be stuck forever currently! That's not a disk failure mode I've come across before!
The reason the leader doesn't stand down I think is because the heart-beating is done by the replication goroutine for each follower which is separate to the leader loop. So when the leader loop is stuck on IO the replication routines are still running. They won't be replicating anything since no new logs are being written to the leader's LogStore, but since they are still running they'll keep sending heartbeats and so followers will have no reason to think the leader is unhealthy and hold an election.
The question is what the correct fix is. It's a little nuanced because it's impossible to tell the difference between a sudden spike in work that backs up write buffers on the disk and causes a few long writes and this mode where the disk is jammed and will never recover. If we picked some timeout after which we consider the disk write to have failed that would become an error here: https://github.com/hashicorp/raft/blob/44124c28758b8cfb675e90c75a204a08a84f8d4f/raft.go#L1190-L1199
Which would cause the leader to step down.
That would fix your specific failure case here, but I wonder if we need to think carefully about how we'd set that. I could see issues where a spike in writes on an overloaded disk could just cause constant leader flapping and basically take the cluster down rather than just being slow until the spike is over 🤔.
Maybe just setting it long enough (like a few minutes) would be OK, but it would be a long time to recover in your case. Maybe we'd need something more sophisticated. If anyone has thought or insights please share!
Thanks for taking a look!
I've not looked at the code to verify all of this, but I think this is an issue as you described because there are no timeouts around disk writing and that happens synchronously in the leader loop so if the disk IO blocks forever, then the leader will be stuck forever currently! That's not a disk failure mode I've come across before!
Yep, I was surprised, too, but it was happening repeatedly on that machine — probably 5 or 6 times before I had enough and migrated my stuff to a new machine. Only a hardware reset would help, and any I/O would hang indefinitely.
The reason the leader doesn't stand down I think is because the heart-beating is done by the replication goroutine for each follower which is separate to the leader loop. So when the leader loop is stuck on IO the replication routines are still running. They won't be replicating anything since no new logs are being written to the leader's LogStore, but since they are still running they'll keep sending heartbeats and so followers will have no reason to think the leader is unhealthy and hold an election.
Yes, that is consistent with what I’m thinking.
That would fix your specific failure case here, but I wonder if we need to think carefully about how we'd set that. I could see issues where a spike in writes on an overloaded disk could just cause constant leader flapping and basically take the cluster down rather than just being slow until the spike is over thinking.
I think my preferred way to fix this would be to couple leader healthiness to disk throughput: if disk throughput drops to 0 while there are failing Apply()
calls, make the leader step down. This mechanism could be rate limited to once per 10 minutes if you’re concerned about this possibly causing leader flapping.
I don’t know if this is something you want to have in hashicorp/raft itself. You could also consider making healthiness externally influencible if you want to leave the specific logic to users. Of course, providing a robust default behavior would still be key.
We ran into this same behavior recently. The Azure disk backing the Consul leader began to hang (IO utilization shot to 100% but actual throughput and operations dropped to almost nothing), causing updates to the KV store to time out (rpc error making call: raft apply failed: timed out enqueuing operation
). Restarting a single follower initiated an election that fixed the situation by electing a new leader. I was scratching my head trying to understand why Consul was not holding an election to fail over, but thankfully you guys have explained it thoroughly in this issue. It looks like the trail went cold on this issue about a year ago; I assume that no fixes were attempted, so this behavior is still expected?
I recently encountered an issue in my production deployment of https://robustirc.net/, where the network was not making any Raft progress anymore.
It turns out that one of my servers has an issue with its (local NVMe) storage, which manifests itself in hanging indefinitely. There are no read or write errors, any disk access just hangs.
When that server happens to currently be the Raft leader when the issue occurs, the entire Raft network will just hang indefinitely. By this, I mean the leader will still participate in the Raft protocol (last contact times do update on the Raft followers), but applying new messages to Raft will not work (will timeout) and, crucially, the Raft network never even starts electing a new leader.
This was a surprising failure mode to me, and I wonder if that’s intentional (out of scope for Raft) or an issue with the current implementation?
To reproduce, I cloned the Raft example https://github.com/yongman/leto and modified it like so:
Then, after bringing up 3 nodes as described in the README, I send SIGUSR1 to the leader node, and now all commands hang, but no re-election happens.
The same is reproducible in https://robustirc.net/ with the robustirc-localnet command, but might be a bit more elaborate to test than the more self-contained leto example.
Given that it happens in two different projects built on top of hashicorp/raft, I don’t think it’s a bug with my code itself, but maybe both projects are using hashicorp/raft slightly wrong?
Any recommendations for how to handle this failure mode?