sofastack / sofa-jraft

A production-grade java implementation of RAFT consensus algorithm.
https://www.sofastack.tech/projects/sofa-jraft/
Apache License 2.0
3.6k stars 1.15k forks source link

集群突然有个节点不断下台后失联 #1020

Closed LinHuiG closed 1 year ago

LinHuiG commented 1 year ago

Describe the bug

集群有A、B、C三个节点,还有一个D作为学习者节点,选举超时时间设为3s 我在B节点的onLeaderStart中做了一个逻辑:如果A节点存活,则cliService.transferLeader将领导者转移到A节点 当我在A节点调用了isLeader函数后,A和B节点在3秒后开始疯狂的leaderStop和leaderStart,持续了25分钟左右 再之后就A节点就与集群断联了,进程存活

日志如下: 2023-08-18 17:24:14.414 com.alipay.sofa.jraft.core.StateMachineAdapter [JRaft-FSMCaller-Disruptor-0]-[INFO] onLeaderStart: term=627. 2023-08-18 17:24:14.415 com.eastmoney.jraft.fsm.StateMachineImpl [JRaft-FSMCaller-Disruptor-0]-[INFO] onLeaderStart groupId=em_strategy_sz_20230818 coreId=0 term=627 endPoint=10.10.184.32:6011 nodeType=NORMAL 2023-08-18 17:24:16.829 com.alipay.sofa.jraft.core.StateMachineAdapter [JRaft-FSMCaller-Disruptor-0]-[INFO] onLeaderStop: status=Status[ESHUTDOWN<1007>: Raft node is going to quit.]. 2023-08-18 17:24:20.473 com.alipay.sofa.jraft.core.StateMachineAdapter [JRaft-FSMCaller-Disruptor-0]-[INFO] onLeaderStart: term=628. 2023-08-18 17:24:20.475 com.eastmoney.jraft.fsm.StateMachineImpl [JRaft-FSMCaller-Disruptor-0]-[INFO] onLeaderStart groupId=em_strategy_sz_20230818 coreId=0 term=628 endPoint=10.10.184.32:6011 nodeType=NORMAL 2023-08-18 17:24:25.152 com.alipay.sofa.jraft.core.StateMachineAdapter [JRaft-FSMCaller-Disruptor-0]-[INFO] onLeaderStop: status=Status[ESHUTDOWN<1007>: Raft node is going to quit.]. 2023-08-18 17:24:28.477 com.alipay.sofa.jraft.core.StateMachineAdapter [JRaft-FSMCaller-Disruptor-0]-[INFO] onLeaderStart: term=631. 2023-08-18 17:24:28.479 com.eastmoney.jraft.fsm.StateMachineImpl [JRaft-FSMCaller-Disruptor-0]-[INFO] onLeaderStart groupId=em_strategy_sz_20230818 coreId=0 term=631 endPoint=10.10.184.32:6011 nodeType=NORMAL 2023-08-18 17:24:34.212 com.alipay.sofa.jraft.core.StateMachineAdapter [JRaft-FSMCaller-Disruptor-0]-[INFO] onLeaderStop: status=Status[ESHUTDOWN<1007>: Raft node is going to quit.]. 2023-08-18 17:24:39.024 com.alipay.sofa.jraft.core.StateMachineAdapter [JRaft-FSMCaller-Disruptor-0]-[INFO] onLeaderStart: term=634. 2023-08-18 17:24:39.025 com.eastmoney.jraft.fsm.StateMachineImpl [JRaft-FSMCaller-Disruptor-0]-[INFO] onLeaderStart groupId=em_strategy_sz_20230818 coreId=0 term=634 endPoint=10.10.184.32:6011 nodeType=NORMAL 2023-08-18 17:24:44.761 com.alipay.sofa.jraft.core.StateMachineAdapter [JRaft-FSMCaller-Disruptor-0]-[INFO] onLeaderStop: status=Status[ESHUTDOWN<1007>: Raft node is going to quit.]. 2023-08-18 17:24:48.821 com.alipay.sofa.jraft.core.StateMachineAdapter [JRaft-FSMCaller-Disruptor-0]-[INFO] onLeaderStart: term=637. 2023-08-18 17:24:48.822 com.eastmoney.jraft.fsm.StateMachineImpl [JRaft-FSMCaller-Disruptor-0]-[INFO] onLeaderStart groupId=em_strategy_sz_20230818 coreId=0 term=637 endPoint=10.10.184.32:6011 nodeType=NORMAL 2023-08-18 17:24:54.555 com.alipay.sofa.jraft.core.StateMachineAdapter [JRaft-FSMCaller-Disruptor-0]-[INFO] onLeaderStop: status=Status[ESHUTDOWN<1007>: Raft node is going to quit.]. 2023-08-18 17:24:58.090 com.alipay.sofa.jraft.core.StateMachineAdapter [JRaft-FSMCaller-Disruptor-0]-[INFO] onLeaderStart: term=640. 2023-08-18 17:24:58.092 com.eastmoney.jraft.fsm.StateMachineImpl [JRaft-FSMCaller-Disruptor-0]-[INFO] onLeaderStart groupId=em_strategy_sz_20230818 coreId=0 term=640 endPoint=10.10.184.32:6011 nodeType=NORMAL 2023-08-18 17:25:03.778 com.alipay.sofa.jraft.core.StateMachineAdapter [JRaft-FSMCaller-Disruptor-0]-[INFO] onLeaderStop: status=Status[ESHUTDOWN<1007>: Raft node is going to quit.]. 2023-08-18 17:25:10.522 com.alipay.sofa.jraft.core.StateMachineAdapter [JRaft-FSMCaller-Disruptor-0]-[INFO] onLeaderStart: term=643. 2023-08-18 17:26:01.549 com.eastmoney.jraft.fsm.StateMachineImpl [JRaft-FSMCaller-Disruptor-0]-[INFO] onLeaderStart groupId=em_strategy_sz_20230818 coreId=0 term=643 endPoint=10.10.184.32:6011 nodeType=NORMAL 2023-08-18 17:26:01.549 com.alipay.sofa.jraft.core.StateMachineAdapter [JRaft-FSMCaller-Disruptor-0]-[INFO] onLeaderStop: status=Status[ERAFTTIMEDOUT<10001>: Majority of the group dies: 2/3]. 2023-08-18 17:26:01.549 com.alipay.sofa.jraft.core.StateMachineAdapter [JRaft-FSMCaller-Disruptor-0]-[INFO] onLeaderStop: status=Status[ERAFTTIMEDOUT<10001>: Majority of the group dies: 2/3]. 2023-08-18 17:26:01.550 com.alipay.sofa.jraft.core.StateMachineAdapter [JRaft-FSMCaller-Disruptor-0]-[INFO] onLeaderStop: status=Status[ERAFTTIMEDOUT<10001>: Majority of the group dies: 2/3].

Expected behavior

Actual behavior

Steps to reproduce

Minimal yet complete reproducer code (or GitHub URL to code)

Environment

LinHuiG commented 1 year ago

B节点表现看上去挺正常的,每次A节点下台后,B节点选举成功,然后发现A节点存活就调用cliService.transferLeader转让领导者给A节点,A节点当选领导者后隔几秒(5秒左右)又LeaderStop,然后B节点选举成功....

killme2008 commented 1 year ago
Status[ESHUTDOWN<1007>: Raft node is going to quit.

这是主动调用了 shutdown,退出节点

https://github.com/sofastack/sofa-jraft/blob/19ed179e02ee9108adc0bbf66badb47f62c62af8/jraft-core/src/main/java/com/alipay/sofa/jraft/core/NodeImpl.java#L2779

自己再查查代码吧,显然是用法的问题。

LinHuiG commented 1 year ago
Status[ESHUTDOWN<1007>: Raft node is going to quit.

这是主动调用了 shutdown,退出节点

https://github.com/sofastack/sofa-jraft/blob/19ed179e02ee9108adc0bbf66badb47f62c62af8/jraft-core/src/main/java/com/alipay/sofa/jraft/core/NodeImpl.java#L2779

自己再查查代码吧,显然是用法的问题。

好的,我再看一下

LinHuiG commented 1 year ago
Status[ESHUTDOWN<1007>: Raft node is going to quit.

这是主动调用了 shutdown,退出节点 https://github.com/sofastack/sofa-jraft/blob/19ed179e02ee9108adc0bbf66badb47f62c62af8/jraft-core/src/main/java/com/alipay/sofa/jraft/core/NodeImpl.java#L2779

自己再查查代码吧,显然是用法的问题。

好的,我再看一下

定位到了,是我用法的问题,在数据归档后做了集群切换,切换这块出问题了