sofastack / sofa-jraft

A production-grade java implementation of RAFT consensus algorithm.
https://www.sofastack.tech/projects/sofa-jraft/
Apache License 2.0
3.52k stars 1.12k forks source link

Nacos-Server jraft初始化失败,导致集群多节点服务下的实例数不一致,重启节点也无法恢复,最后只能删除data目录 #1118

Open guozongkang opened 1 week ago

guozongkang commented 1 week ago

集群环境: 3台ALiyun ECS 16C 32G Nacos-Server版本: 2.1.2

问题现象: Nacos-Server3台节点已经正常运行了半个月的时候,但是其中一台因为内存问题,我们不得不将其重启,我们将其命名为1节点,另外两台节点分别为2,3节点。 将1节点重启的方式是执行bin目录下的shutdown脚本,然后执行bin下的startup脚本,这个时候我们发现了问题。 从Nacos控制台查看,1节点显示某一个服务有45个实例,2,3节点显示这个服务有65个实例(后经查实,65个实例是正常的)。 也就是说1节点的数据有问题, 我们查看日志。发现 alipay-jraft日志有错误:

2024-06-19 00:16:35,087 WARN Node <naming_persistent_service/10.254.16.7:7848> RequestVote to 10.254.18.46:7848 error: Status[EINTERNAL<1004>: RPC exception:UNKNOWN]. 2024-06-19 00:16:35,707 WARN Fail to issue RPC to 10.254.18.46:7848, consecutiveErrorTimes=1, error=Status[EINTERNAL<1004>: RPC exception:UNAVAILABLE: io exception] 2024-06-19 00:16:35,710 WARN Fail to issue RPC to 10.254.18.46:7848, consecutiveErrorTimes=1, error=Status[EINTERNAL<1004>: RPC exception:UNAVAILABLE: io exception] 2024-06-19 00:16:35,707 WARN Node <naming_persistent_service_v2/10.254.16.7:7848> RequestVote to 10.254.18.46:7848 error: Status[EINTERNAL<1004>: RPC exception:UNAVAILABLE: io exception]. 2024-06-19 00:16:35,710 WARN Fail to issue RPC to 10.254.18.46:7848, consecutiveErrorTimes=1, error=Status[EINTERNAL<1004>: RPC exception:UNAVAILABLE: io exception] 2024-06-19 00:16:38,277 WARN Fail to issue RPC to 10.254.18.46:7848, consecutiveErrorTimes=11, error=Status[ENOENT<1012>: Peer id not found: 10.254.18.46:7848, group: naming_service_metadata] 2024-06-19 00:18:21,216 WARN Fail to issue RPC to 10.254.17.172:7848, consecutiveErrorTimes=1, error=Status[ENOENT<1012>: Peer id not found: 10.254.17.172:7848, group: naming_persistent_service] 2024-06-19 00:18:21,264 WARN Fail to issue RPC to 10.254.17.172:7848, consecutiveErrorTimes=1, error=Status[ENOENT<1012>: Peer id not found: 10.254.17.172:7848, group: naming_service_metadata] 2024-06-19 00:18:21,266 WARN Fail to issue RPC to 10.254.17.172:7848, consecutiveErrorTimes=1, error=Status[ENOENT<1012>: Peer id not found: 10.254.17.172:7848, group: naming_persistent_service_v2] 2024-06-19 00:18:26,139 WARN Node <naming_instance_metadata/10.254.16.7:7848> RequestVote to 10.254.17.172:7848 error: Status[EINTERNAL<1004>: RPC exception:UNAVAILABLE: io exception]. 2024-06-19 00:18:26,326 WARN Fail to issue RPC to 10.254.17.172:7848, consecutiveErrorTimes=11, error=Status[EINTERNAL<1004>: RPC exception:UNAVAILABLE: io exception] 2024-06-19 00:18:26,328 WARN Fail to issue RPC to 10.254.17.172:7848, consecutiveErrorTimes=11, error=Status[EINTERNAL<1004>: RPC exception:UNAVAILABLE: io exception] 2024-06-19 00:18:26,336 WARN Fail to issue RPC to 10.254.17.172:7848, consecutiveErrorTimes=11, error=Status[EINTERNAL<1004>: RPC exception:UNAVAILABLE: io exception] 2024-06-19 00:18:28,668 WARN Fail to issue RPC to 10.254.17.172:7848, consecutiveErrorTimes=1, error=Status[EINTERNAL<1004>: RPC exception:UNAVAILABLE: io exception] 2024-06-19 00:18:31,188 WARN Fail to issue RPC to 10.254.17.172:7848, consecutiveErrorTimes=11, error=Status[EINTERNAL<1004>: Check connection[10.254.17.172:7848] fail and try to create new one] 2024-06-19 00:18:31,360 WARN Fail to issue RPC to 10.254.17.172:7848, consecutiveErrorTimes=21, error=Status[EINTERNAL<1004>: Check connection[10.254.17.172:7848] fail and try to create new one] 2024-06-19 00:18:31,385 WARN Fail to issue RPC to 10.254.17.172:7848, consecutiveErrorTimes=21, error=Status[EINTERNAL<1004>: Check connection[10.254.17.172:7848] fail and try to create new one] 2024-06-19 00:18:31,388 WARN Fail to issue RPC to 10.254.17.172:7848, consecutiveErrorTimes=21, error=Status[EINTERNAL<1004>: Check connection[10.254.17.172:7848] fail and try to create new one] 2024-06-19 00:18:33,710 WARN Fail to issue RPC to 10.254.17.172:7848, consecutiveErrorTimes=21, error=Status[EINTERNAL<1004>: RPC exception:UNAVAILABLE: io exception] 2024-06-19 00:18:36,225 WARN Fail to issue RPC to 10.254.17.172:7848, consecutiveErrorTimes=31, error=Status[EINTERNAL<1004>: RPC exception:UNAVAILABLE: io exception] 2024-06-19 00:18:36,400 WARN Fail to issue RPC to 10.254.17.172:7848, consecutiveErrorTimes=31, error=Status[EINTERNAL<1004>: RPC exception:UNAVAILABLE: io exception] 2024-06-19 00:18:36,424 WARN Fail to issue RPC to 10.254.17.172:7848, consecutiveErrorTimes=31, error=Status[EINTERNAL<1004>: RPC exception:UNAVAILABLE: io exception] 2024-06-19 00:18:36,449 WARN Fail to issue RPC to 10.254.17.172:7848, consecutiveErrorTimes=31, error=Status[EINTERNAL<1004>: RPC exception:UNAVAILABLE: io exception] 2024-06-19 00:18:38,786 WARN Fail to issue RPC to 10.254.17.172:7848, consecutiveErrorTimes=41, error=Status[EINTERNAL<1004>: Check connection[10.254.17.172:7848] fail and try to create new one] 2024-06-19 00:18:41,462 WARN Fail to issue RPC to 10.254.17.172:7848, consecutiveErrorTimes=41, error=Status[ENOENT<1012>: Peer id not found: 10.254.17.172:7848, group: naming_service_metadata] 2024-06-19 00:18:41,477 WARN Fail to issue RPC to 10.254.17.172:7848, consecutiveErrorTimes=41, error=Status[ENOENT<1012>: Peer id not found: 10.254.17.172:7848, group: naming_persistent_service_v2] 2024-06-19 00:18:41,530 WARN Fail to issue RPC to 10.254.17.172:7848, consecutiveErrorTimes=51, error=Status[ENOENT<1012>: Peer id not found: 10.254.17.172:7848, group: naming_instance_metadata] 2024-06-19 00:19:36,094 WARN ThreadId: Replicator [state=Destroyed, statInfo=, peerId=10.254.18.46:7848, waitId=2, type=Follower] already destroyed, ignore error code: 1001 2024-06-19 00:19:36,143 WARN Fail to issue RPC to 10.254.17.172:7848, consecutiveErrorTimes=1, error=Status[EINTERNAL<1004>: RPC exception:DEADLINE_EXCEEDED: deadline exceeded after 2.499983956s. [remote_addr=10.254.17.172/10.254.17.172:7848]] 2024-06-19 00:19:36,272 WARN Fail to issue RPC to 10.254.17.172:7848, consecutiveErrorTimes=1, error=Status[EINTERNAL<1004>: RPC exception:DEADLINE_EXCEEDED: deadline exceeded after 2.499984812s. [remote_addr=10.254.17.172/10.254.17.172:7848]] 2024-06-19 00:19:36,303 WARN ThreadId: Replicator [state=Destroyed, statInfo=, peerId=10.254.18.46:7848, waitId=270, type=Follower] already destroyed, ignore error code: 1001 2024-06-19 00:19:36,501 WARN ThreadId: Replicator [state=Destroyed, statInfo=, peerId=10.254.18.46:7848, waitId=2, type=Follower] already destroyed, ignore error code: 1001 [admin@b01_nacos_service_test_hk logs]$ cat alipay-jraft.log|grep ERROR 2024-06-19 00:16:35,666 ERROR Fail to connect 10.254.18.46:7848, remoting exception: java.util.concurrent.TimeoutException. 2024-06-19 00:18:26,134 ERROR Fail to connect 10.254.17.172:7848, remoting exception: java.util.concurrent.ExecutionException: io.grpc.StatusRuntimeException: UNAVAILABLE: io exception. 2024-06-19 00:18:26,165 ERROR Fail to connect 10.254.17.172:7848, remoting exception: java.util.concurrent.ExecutionException: io.grpc.StatusRuntimeException: UNAVAILABLE: io exception. 2024-06-19 00:18:26,165 ERROR Fail to init sending channel to 10.254.17.172:7848. 2024-06-19 00:18:26,165 ERROR Fail to start replicator to peer=10.254.17.172:7848, replicatorType=Follower. 2024-06-19 00:18:26,165 ERROR Fail to add a replicator, peer=10.254.17.172:7848.

Protocol-raft日志错误为: 2024-06-19 00:16:35,175 ERROR Fail to refresh route configuration for group : naming_service_metadata, status is : Status[UNKNOWN<-1>: io.grpc.StatusRuntimeException: UNKNOWN] 2024-06-19 00:18:21,467 ERROR Fail to refresh leader for group : naming_instance_metadata, status is : Status[UNKNOWN<-1>: Unknown leader, No nodes in group naming_instance_metadata, Unknown leader] 2024-06-19 00:18:21,469 ERROR Fail to refresh route configuration for group : naming_instance_metadata, status is : Status[ENOENT<1012>: Fail to find node 10.254.17.172:7848 in group naming_instance_metadata]

我们将1节点shutdown10分钟,然后再次重启,问题仍然没有解决。 我们在社区的Isseus翻找,发现之前人提出的问题,和我们很类似,解决方式是删除data目录,然后重启即可。我们照着做,确实解决了问题,但是如何避免这种问题出现呢

guozongkang commented 1 week ago

我又把问题复现了一下,启动了机器后,故障节点的jraft没什么有用的信息。 但是我登录了ld节点,发现ld节点有错误信息

2024-06-25 00:15:39,043 WARN Fail to issue RPC to 10.254.16.7:7848, consecutiveErrorTimes=1, error=Status[ENOENT<1012>: Peer id not found: 10.254.16.7:7848, group: naming_persistent_service] 2024-06-25 00:15:39,078 WARN Fail to issue RPC to 10.254.16.7:7848, consecutiveErrorTimes=1, error=Status[ENOENT<1012>: Peer id not found: 10.254.16.7:7848, group: naming_instance_metadata] 2024-06-25 00:15:39,284 WARN Fail to issue RPC to 10.254.16.7:7848, consecutiveErrorTimes=1, error=Status[ENOENT<1012>: Peer id not found: 10.254.16.7:7848, group: naming_service_metadata] 2024-06-25 00:15:39,344 WARN Fail to issue RPC to 10.254.16.7:7848, consecutiveErrorTimes=1, error=Status[ENOENT<1012>: Peer id not found: 10.254.16.7:7848, group: naming_persistent_service_v2] 2024-06-25 00:15:41,590 WARN Fail to issue RPC to 10.254.16.7:7848, consecutiveErrorTimes=11, error=Status[EINTERNAL<1004>: RPC exception:UNAVAILABLE: io exception] 2024-06-25 00:15:43,624 WARN Fail to issue RPC to 10.254.16.7:7848, consecutiveErrorTimes=21, error=Status[EINTERNAL<1004>: RPC exception:UNAVAILABLE: io exception] 2024-06-25 00:15:44,101 WARN Fail to issue RPC to 10.254.16.7:7848, consecutiveErrorTimes=11, error=Status[EINTERNAL<1004>: RPC exception:UNAVAILABLE: io exception] 2024-06-25 00:15:44,596 WARN Fail to issue RPC to 10.254.16.7:7848, consecutiveErrorTimes=11, error=Status[EINTERNAL<1004>: Check connection[10.254.16.7:7848] fail and try to create new one] 2024-06-25 00:15:44,596 WARN Fail to issue RPC to 10.254.16.7:7848, consecutiveErrorTimes=11, error=Status[EINTERNAL<1004>: Check connection[10.254.16.7:7848] fail and try to create new one] 2024-06-25 00:15:46,102 WARN Fail to issue RPC to 10.254.16.7:7848, consecutiveErrorTimes=31, error=Status[EINTERNAL<1004>: RPC exception:UNAVAILABLE: io exception] 2024-06-25 00:15:48,134 WARN Fail to issue RPC to 10.254.16.7:7848, consecutiveErrorTimes=41, error=Status[EINTERNAL<1004>: Check connection[10.254.16.7:7848] fail and try to create new one] 2024-06-25 00:15:49,115 WARN Fail to issue RPC to 10.254.16.7:7848, consecutiveErrorTimes=21, error=Status[EINTERNAL<1004>: Check connection[10.254.16.7:7848] fail and try to create new one] 2024-06-25 00:15:49,609 WARN Fail to issue RPC to 10.254.16.7:7848, consecutiveErrorTimes=21, error=Status[EINTERNAL<1004>: RPC exception:UNAVAILABLE: io exception] 2024-06-25 00:15:49,611 WARN Fail to issue RPC to 10.254.16.7:7848, consecutiveErrorTimes=21, error=Status[EINTERNAL<1004>: RPC exception:UNAVAILABLE: io exception] 2024-06-25 00:15:50,615 WARN Fail to issue RPC to 10.254.16.7:7848, consecutiveErrorTimes=51, error=Status[EINTERNAL<1004>: RPC exception:UNAVAILABLE: io exception] 2024-06-25 00:15:51,345 WARN Fail to issue RPC to 10.254.16.7:7848, consecutiveErrorTimes=61, error=Status[EINTERNAL<1004>: RPC exception:UNAVAILABLE: io exception] 2024-06-25 00:15:51,624 WARN Fail to issue RPC to 10.254.16.7:7848, consecutiveErrorTimes=71, error=Status[EINTERNAL<1004>: Check connection[10.254.16.7:7848] fail and try to create new one] 2024-06-25 00:15:52,073 WARN Fail to issue RPC to 10.254.16.7:7848, consecutiveErrorTimes=81, error=Status[EINTERNAL<1004>: RPC exception:UNAVAILABLE: io exception] 2024-06-25 00:15:52,453 WARN Fail to issue RPC to 10.254.16.7:7848, consecutiveErrorTimes=91, error=Status[EINTERNAL<1004>: RPC exception:UNAVAILABLE: io exception] 2024-06-25 00:15:53,648 WARN Fail to issue RPC to 10.254.16.7:7848, consecutiveErrorTimes=101, error=Status[EINTERNAL<1004>: Check connection[10.254.16.7:7848] fail and try to create new one] 2024-06-25 00:15:54,129 WARN Fail to issue RPC to 10.254.16.7:7848, consecutiveErrorTimes=31, error=Status[EINTERNAL<1004>: RPC exception:UNAVAILABLE: io exception] 2024-06-25 00:15:54,622 WARN Fail to issue RPC to 10.254.16.7:7848, consecutiveErrorTimes=31, error=Status[EINTERNAL<1004>: RPC exception:UNAVAILABLE: io exception] 2024-06-25 00:15:54,622 WARN Fail to issue RPC to 10.254.16.7:7848, consecutiveErrorTimes=31, error=Status[EINTERNAL<1004>: RPC exception:UNAVAILABLE: io exception] 2024-06-25 00:15:56,157 WARN Fail to issue RPC to 10.254.16.7:7848, consecutiveErrorTimes=111, error=Status[EINTERNAL<1004>: RPC exception:UNAVAILABLE: io exception] 2024-06-25 00:15:58,662 WARN Fail to issue RPC to 10.254.16.7:7848, consecutiveErrorTimes=121, error=Status[EINTERNAL<1004>: RPC exception:UNAVAILABLE: io exception] 2024-06-25 00:15:59,250 WARN Fail to issue RPC to 10.254.16.7:7848, consecutiveErrorTimes=41, error=Status[EINTERNAL<1004>: RPC exception:UNAVAILABLE: io exception] 2024-06-25 00:15:59,636 WARN Fail to issue RPC to 10.254.16.7:7848, consecutiveErrorTimes=41, error=Status[EINTERNAL<1004>: Check connection[10.254.16.7:7848] fail and try to create new one] 2024-06-25 00:15:59,636 WARN Fail to issue RPC to 10.254.16.7:7848, consecutiveErrorTimes=41, error=Status[EINTERNAL<1004>: Check connection[10.254.16.7:7848] fail and try to create new one] 2024-06-25 00:16:01,170 WARN Fail to issue RPC to 10.254.16.7:7848, consecutiveErrorTimes=131, error=Status[EINTERNAL<1004>: Check connection[10.254.16.7:7848] fail and try to create new one] 2024-06-25 00:16:03,678 WARN Fail to issue RPC to 10.254.16.7:7848, consecutiveErrorTimes=141, error=Status[EINTERNAL<1004>: RPC exception:UNAVAILABLE: io exception] 2024-06-25 00:16:04,265 WARN Fail to issue RPC to 10.254.16.7:7848, consecutiveErrorTimes=51, error=Status[EINTERNAL<1004>: Check connection[10.254.16.7:7848] fail and try to create new one] 2024-06-25 00:16:04,649 WARN Fail to issue RPC to 10.254.16.7:7848, consecutiveErrorTimes=51, error=Status[EINTERNAL<1004>: RPC exception:UNAVAILABLE: io exception] 2024-06-25 00:16:04,650 WARN Fail to issue RPC to 10.254.16.7:7848, consecutiveErrorTimes=51, error=Status[EINTERNAL<1004>: RPC exception:UNAVAILABLE: io exception] 2024-06-25 00:16:06,183 WARN Fail to issue RPC to 10.254.16.7:7848, consecutiveErrorTimes=151, error=Status[EINTERNAL<1004>: RPC exception:UNAVAILABLE: io exception] 2024-06-25 00:16:08,690 WARN Fail to issue RPC to 10.254.16.7:7848, consecutiveErrorTimes=161, error=Status[EINTERNAL<1004>: Check connection[10.254.16.7:7848] fail and try to create new one] 2024-06-25 00:16:09,308 WARN Fail to issue RPC to 10.254.16.7:7848, consecutiveErrorTimes=61, error=Status[EINTERNAL<1004>: RPC exception:UNAVAILABLE: io exception] 2024-06-25 00:16:09,662 WARN Fail to issue RPC to 10.254.16.7:7848, consecutiveErrorTimes=61, error=Status[EINTERNAL<1004>: RPC exception:UNAVAILABLE: io exception] 2024-06-25 00:16:09,662 WARN Fail to issue RPC to 10.254.16.7:7848, consecutiveErrorTimes=61, error=Status[EINTERNAL<1004>: RPC exception:UNAVAILABLE: io exception] 2024-06-25 00:16:11,199 WARN Fail to issue RPC to 10.254.16.7:7848, consecutiveErrorTimes=171, error=Status[EINTERNAL<1004>: RPC exception:UNAVAILABLE: io exception] 2024-06-25 00:16:13,704 WARN Fail to issue RPC to 10.254.16.7:7848, consecutiveErrorTimes=181, error=Status[EINTERNAL<1004>: RPC exception:UNAVAILABLE: io exception] 2024-06-25 00:16:14,320 WARN Fail to issue RPC to 10.254.16.7:7848, consecutiveErrorTimes=71, error=Status[EINTERNAL<1004>: RPC exception:UNAVAILABLE: io exception] 2024-06-25 00:16:14,673 WARN Fail to issue RPC to 10.254.16.7:7848, consecutiveErrorTimes=71, error=Status[EINTERNAL<1004>: Check connection[10.254.16.7:7848] fail and try to create new one] 2024-06-25 00:16:14,673 WARN Fail to issue RPC to 10.254.16.7:7848, consecutiveErrorTimes=71, error=Status[EINTERNAL<1004>: Check connection[10.254.16.7:7848] fail and try to create new one] 2024-06-25 00:16:16,211 WARN Fail to issue RPC to 10.254.16.7:7848, consecutiveErrorTimes=191, error=Status[EINTERNAL<1004>: Check connection[10.254.16.7:7848] fail and try to create new one] 2024-06-25 00:16:18,717 WARN Fail to issue RPC to 10.254.16.7:7848, consecutiveErrorTimes=201, error=Status[EINTERNAL<1004>: RPC exception:UNAVAILABLE: io exception] 2024-06-25 00:16:19,332 WARN Fail to issue RPC to 10.254.16.7:7848, consecutiveErrorTimes=81, error=Status[EINTERNAL<1004>: Check connection[10.254.16.7:7848] fail and try to create new one] 2024-06-25 00:16:19,687 WARN Fail to issue RPC to 10.254.16.7:7848, consecutiveErrorTimes=81, error=Status[EINTERNAL<1004>: RPC exception:UNAVAILABLE: io exception] 2024-06-25 00:16:19,687 WARN Fail to issue RPC to 10.254.16.7:7848, consecutiveErrorTimes=81, error=Status[EINTERNAL<1004>: RPC exception:UNAVAILABLE: io exception] 2024-06-25 00:16:21,222 WARN Fail to issue RPC to 10.254.16.7:7848, consecutiveErrorTimes=211, error=Status[EINTERNAL<1004>: RPC exception:UNAVAILABLE: io exception] 2024-06-25 00:16:23,728 WARN Fail to issue RPC to 10.254.16.7:7848, consecutiveErrorTimes=221, error=Status[EINTERNAL<1004>: Check connection[10.254.16.7:7848] fail and try to create new one] 2024-06-25 00:16:24,344 WARN Fail to issue RPC to 10.254.16.7:7848, consecutiveErrorTimes=91, error=Status[EINTERNAL<1004>: RPC exception:UNAVAILABLE: io exception] 2024-06-25 00:16:24,700 WARN Fail to issue RPC to 10.254.16.7:7848, consecutiveErrorTimes=91, error=Status[EINTERNAL<1004>: RPC exception:UNAVAILABLE: io exception] 2024-06-25 00:16:24,702 WARN Fail to issue RPC to 10.254.16.7:7848, consecutiveErrorTimes=91, error=Status[EINTERNAL<1004>: RPC exception:UNAVAILABLE: io exception] 2024-06-25 00:16:26,235 WARN Fail to issue RPC to 10.254.16.7:7848, consecutiveErrorTimes=231, error=Status[EINTERNAL<1004>: RPC exception:UNAVAILABLE: io exception] 2024-06-25 00:16:28,739 WARN Fail to issue RPC to 10.254.16.7:7848, consecutiveErrorTimes=241, error=Status[EINTERNAL<1004>: RPC exception:UNAVAILABLE: io exception] 2024-06-25 00:16:29,360 WARN Fail to issue RPC to 10.254.16.7:7848, consecutiveErrorTimes=101, error=Status[EINTERNAL<1004>: RPC exception:UNAVAILABLE: io exception] 2024-06-25 00:16:29,711 WARN Fail to issue RPC to 10.254.16.7:7848, consecutiveErrorTimes=101, error=Status[EINTERNAL<1004>: Check connection[10.254.16.7:7848] fail and try to create new one] 2024-06-25 00:16:29,712 WARN Fail to issue RPC to 10.254.16.7:7848, consecutiveErrorTimes=101, error=Status[EINTERNAL<1004>: Check connection[10.254.16.7:7848] fail and try to create new one] 2024-06-25 00:16:31,246 WARN Fail to issue RPC to 10.254.16.7:7848, consecutiveErrorTimes=251, error=Status[EINTERNAL<1004>: Check connection[10.254.16.7:7848] fail and try to create new one] 2024-06-25 00:16:33,753 WARN Fail to issue RPC to 10.254.16.7:7848, consecutiveErrorTimes=261, error=Status[EINTERNAL<1004>: RPC exception:UNAVAILABLE: io exception] 2024-06-25 00:16:34,370 WARN Fail to issue RPC to 10.254.16.7:7848, consecutiveErrorTimes=111, error=Status[EINTERNAL<1004>: Check connection[10.254.16.7:7848] fail and try to create new one] 2024-06-25 00:16:34,723 WARN Fail to issue RPC to 10.254.16.7:7848, consecutiveErrorTimes=111, error=Status[EINTERNAL<1004>: RPC exception:UNAVAILABLE: io exception] 2024-06-25 00:16:34,723 WARN Fail to issue RPC to 10.254.16.7:7848, consecutiveErrorTimes=111, error=Status[EINTERNAL<1004>: RPC exception:UNAVAILABLE: io exception]

怀疑是jraft有问题

guozongkang commented 1 week ago

我们的错误好像和https://github.com/sofastack/sofa-jraft/issues/1096 这个是一样的

killme2008 commented 1 week ago

Peer id not found: [10.254.16.7:7848](http://10.254.16.7:7848/), group: naming_persistent_service

这个错误就是该节点10.254.16.7:7848 从 naming_persistent_service 分组移除了,主动 shutdown 了。

guozongkang commented 1 week ago

Peer id not found: [10.254.16.7:7848](http://10.254.16.7:7848/), group: naming_persistent_service

这个错误就是该节点10.254.16.7:7848 从 naming_persistent_service 分组移除了,主动 shutdown 了。

为什么10.254.16.7这个节点会移除,这个是我的启动日志

guozongkang commented 1 week ago

10.254.16.7这个节点启动后,LD节点一直报上面的错。 在nacos控制台上看,集群两台节点有都显示一个服务数量时65个,我启动10.254.16.7 follower后(称为节点3), 节点1显示服务实例为65,节点2实例为56,节点三服务实例为40,并且长期无法达到一致性状态。 随着时间的推移,节点2的实例数是波动的,有时候是61,也有时候是50。 实在没有办法,我将节点3(10.254.16.7)shutdown, 节点1,2在短时间内恢复正常,显示服务实例为65

killme2008 commented 1 week ago

Peer id not found: [10.254.16.7:7848](http://10.254.16.7:7848/), group: naming_persistent_service 这个错误就是该节点10.254.16.7:7848 从 naming_persistent_service 分组移除了,主动 shutdown 了。

为什么10.254.16.7这个节点会移除,这个是我的启动日志

这个你可能要问下 nacos,因为 jraft 不会主动去 shutdown 一个 node