Closed Excpt0r closed 3 years ago
@Excpt0r thank you for the report, we will look into it.
Please attach the metrics log in these two nodes, you can get the metric logs by kill -3 <pid>
.
It seems that these nodes processed RequestVote
request too slow(greater than electionTimeoutMs
), and the candidate node reached vote timedout everytime.
hi, i think i found the root cause.
First, we look into such case:
| 2021-05-07T14:29:39,166 | JRaft-RPC-Processor-0 | com.alipay.sofa.jraft.core.NodeImpl | INFO | Node
<DeviceSubLeaderElection/van-device-sub-2.van-device-sub-headless:8888> start vote and grant vote self, term=1.
| 2021-05-07T14:29:39,166 | JRaft-RPC-Processor-0 | com.alipay.sofa.jraft.core.NodeImpl | DEBUG | Node
<DeviceSubLeaderElection/van-device-sub-2.van-device-sub-headless:8888> stop election timer, term=1.
| 2021-05-07T14:29:39,166 | JRaft-RPC-Processor-0 | com.alipay.sofa.jraft.core.NodeImpl | DEBUG | Node
<DeviceSubLeaderElection/van-device-sub-2.van-device-sub-headless:8888> start vote timer, term=2 .
1. van-device-sub-2.van-device
try to vote self at 14:29:39,166
, but van-device-sub-0.van-device
receive the vote request at 14:29:40,219
:
| 2021-05-07T14:29:40,219 | Bolt-default-executor-6-thread-6 | com.alipay.sofa.jraft.core.NodeImpl | INFO | Node
<DeviceSubLeaderElection/van-device-sub-0.van-device-sub-headless:8888> received RequestVoteRequest from
van-device-sub-2.van-device-sub-headless:8888, term=2, currTerm=2.
2. There was almost one second delay for this request to reach van-device-sub-0.van-device
.
The problem is creating connection to dead peer van-device-sub-1.van-device
timed out in van-device-sub-2.van-device
:
| 2021-05-07T14:29:40,174 | JRaft-RPC-Processor-0 | com.alipay.sofa.jraft.rpc.impl.AbstractClientService | ERROR | Fail to
connect van-device-sub-1.van-device-sub-headless:8888, remoting exception:
com.alipay.remoting.exception.RemotingException: Create connection failed. The address is van-device-sub-1.van-device-
sub-headless:8888.
The default connect timeout is one second (in RpcOptions#rpcConnectTimeoutMs
), and when van-device-sub-2.van-device
tried to connect the dead node, it cost 1 second and reached out the vote timeout which is election timeout plus some random value of RaftOptions#maxElectionDelayMs
, so the candidate van-device-sub-2.van-device
stepped down and started the election again.
I think you can change the default election timeout(in NodeOptioins) or rpc connect timeout(in
RpcOptions`) value to solve this problem. We may change these default values in next release.
The problem is creating connection to dead peer
van-device-sub-1.van-device
timed out invan-device-sub-2.van-device
We can check the connection first and fail-fast (asynchronously to create the connection) on pre-vote and vote
Hi,
@killme2008 this is a follow up of #552 which I was not able to re-open. I have set up an automated test, to reproduce the rare problem - logs are attached.
Scenario as follows:
There was no load on the system, the purpose of the test is only to check the behaviour of a silently gone node/pod (using virsh destroy on KVM). I'm using the default NodeOptions (electionTimeoutMs = 1000, electionHeartbeatFactor = 10) Thanks for your help
Logs: 07.05.2021_van-device-sub-0_jraft.log 07.05.2021_van-device-sub-2_jraft.log