sofastack / sofa-jraft

A production-grade java implementation of RAFT consensus algorithm.
https://www.sofastack.tech/projects/sofa-jraft/
Apache License 2.0
3.58k stars 1.15k forks source link

Connection abnormality #299

Closed hanzhihua closed 4 years ago

hanzhihua commented 4 years ago

Your question

我在使用jraft时候,偶尔会出现 NativeConnectException: syscall:getsockopt错误 造成了leader stepdown,出现选主流程

Your scenes

Describe your use scenes (why need this feature)

Your advice

Describe the advice or solution you'd like

Environment

jraft 最新版本 netty 4.1.25

sofastack-bot[bot] commented 4 years ago

Hi @hanzhihua, we detect non-English characters in the issue. This comment is an auto translation by @sofastack-robot to help other users to understand this issue.

We encourage you to describe your issue in English which is more friendly to other users.

Your question When I use jraft, I occasionally have a NativeConnectException: syscall: getsockopt error caused leader stepdown, appearing owner process ### Your scenes Describe your use scenes (why need this feature) ### Your advice Describe The advice or solution you'd like ### Environment jraft Latest version netty 4.1.25

fengjiachun commented 4 years ago

看起来是 netty 封装的 epoll 或 kqueue 的 jni 实现的报错,有更多信息吗?比如堆栈,以及 os 是 linux 还是 mac 或其他? os 版本?

fengjiachun commented 4 years ago

我看了一下 bolt 的代码, bolt 只使用了 epoll jni 实现,没有 kqueue, 你应该是 linux 了

可以通过把这个system property 设置为 false 来临时避免这个问题 “bolt.netty.epoll.switch”,设置为 false 就会使用 java nio 实现

fengjiachun commented 4 years ago

建议还是把堆栈发一下,有更多的信息更好,通常这个错误很可能不是你所说的 leader stepdown 的根本原因

hanzhihua commented 4 years ago

stepdown的原因是checkdeadnode超过半数,主动stepdown的 uname -a : Linux nvm-t-berserker-2 4.9.0-7-amd64 #1 SMP Debian 4.9.110-3+deb9u1 (2018-08-03) x86_64 GNU/Linux 另外还有两个问题: 一个是leader跟follower自己是一个链接,还是双向链接? 另一个是leader主动stepdown,可以关掉吗?好像braft是这样推荐的

谢谢大家

fengjiachun commented 4 years ago

一个是leader跟follower自己是一个链接,还是双向链接?

我不确定理解你的问题了,leader 会为每个 follower 维护一个 replicator,用于日志复制,会单独维护一个连接(实现 pipeline 需要)来复制日志,这个可理解为单向, follower 节点也会因为 vote request 以及 readindex 等请求向 leader 发送请求数据,全局看又是双向的

另一个是leader主动stepdown,可以关掉吗?好像braft是这样推荐的

当 leader 发现 dead node 超过半数,可能集群已经产生了新的 leader, 由于网路分区导致自己不知道,是需要 stepdown 的

hanzhihua commented 4 years ago

关于leader做checkdeadnode,这个我理解错了,请忽略 另外关于leader和follower之间的链接是两个,还是一个呢?我现在用下来,看的有是的两个链接一个是一个链接,zookeeper好像只有一个链接,根据id大小来判断的

另外还有一个问题,如果Statemachine做onsnaphost出错了,不会影响节点状态吧

fengjiachun commented 4 years ago

另外关于leader和follower之间的链接是两个,还是一个呢?

leader 主动与 follower 建立的链接,一般只有一个,你看的两个链接可能是对端主动建立的,确认一下

另外还有一个问题,如果Statemachine做onsnaphost出错了,不会影响节点状态吧

会终止状态机,snapshot 都出错了,代表数据可能不一致,此时需要人工解决了

hanzhihua commented 4 years ago

snapshot时会有多线程问题,因为我的状态机是多个集合类,我不想加锁 我期望是snapshot出现java.util.ConcurrentModificationException 这个错误的时候,忽略本次的snaphost,等下一次做,因为snapshot的错误是偶尔出现的

是不是我的用法有点问题,请指教

fengjiachun commented 4 years ago

理论上是可以的,但要注意以下几点:

  1. save snapshot 期间需要禁写,否则大于 snapshot index 的数据进入 snapshot 可能导致不一致,关于禁写通常两种办法: 1)同步 snapshot save,此时在做 snapshot 时状态机是被阻塞的,不会有新的写入,可以有读操作,因为有 readIndex 实现的线性一致读请求不走 raft log,你自己保证可见性即可 2)异步 snapshot,这个需要你的状态机实现 CopyOnWrite 或者叫快照读,保证在 snapshot 期间新的写入数据不会进入 snapshot

  2. snapshot 完成以后有以下几种情况: 1)成功: status.isOK() 此时会裁剪 raft log 2)收到 RaftError.EIO 错误(IO 错误),会终止状态机 3)返回失败,但 status 不要设置为 RaftError.EIO ,不会终止状态机也不会裁剪 raft log,你的需求可以利用这个逻辑分支

hanzhihua commented 4 years ago

谢谢你的回答,现在就是还有连接链不上的错误,错误堆栈如下:

[Bolt-default-executor-6-thread-4]-c.a.s.j.r.i.c.BoltRaftClientService.connect - Fail to connect ip:port, remoting exception: Create connection failed. The address is ip:port. com.alipay.remoting.exception.RemotingException: Create connection failed. The address is ip:port at com.alipay.remoting.DefaultConnectionManager.create(DefaultConnectionManager.java:507) at com.alipay.remoting.DefaultConnectionManager.doCreate(DefaultConnectionManager.java:801) at com.alipay.remoting.DefaultConnectionManager.access$000(DefaultConnectionManager.java:52) at com.alipay.remoting.DefaultConnectionManager$ConnectionPoolCall.call(DefaultConnectionManager.java:732) at com.alipay.remoting.DefaultConnectionManager$ConnectionPoolCall.call(DefaultConnectionManager.java:706) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at com.alipay.remoting.util.RunStateRecordedFutureTask.run(RunStateRecordedFutureTask.java:39) at com.alipay.remoting.DefaultConnectionManager.getConnectionPoolAndCreateIfAbsent(DefaultConnectionManager.java:599) at com.alipay.remoting.DefaultConnectionManager.getAndCreateIfAbsent(DefaultConnectionManager.java:465) at com.alipay.remoting.rpc.RpcClientRemoting.getConnectionAndInitInvokeContext(RpcClientRemoting.java:124) at com.alipay.remoting.rpc.RpcClientRemoting.invokeSync(RpcClientRemoting.java:62) at com.alipay.remoting.rpc.RpcRemoting.invokeSync(RpcRemoting.java:143) at com.alipay.remoting.rpc.RpcClient.invokeSync(RpcClient.java:309) at com.alipay.sofa.jraft.rpc.impl.AbstractBoltClientService.connect(AbstractBoltClientService.java:138) at com.alipay.sofa.jraft.core.NodeImpl.electSelf(NodeImpl.java:914) at com.alipay.sofa.jraft.core.NodeImpl.handleTimeoutNowRequest(NodeImpl.java:2745) at com.alipay.sofa.jraft.rpc.impl.core.TimeoutNowRequestProcessor.processRequest0(TimeoutNowRequestProcessor.java:51) at com.alipay.sofa.jraft.rpc.impl.core.TimeoutNowRequestProcessor.processRequest0(TimeoutNowRequestProcessor.java:33) at com.alipay.sofa.jraft.rpc.impl.core.NodeRequestProcessor.processRequest(NodeRequestProcessor.java:60) at com.alipay.sofa.jraft.rpc.RpcRequestProcessor.handleRequest(RpcRequestProcessor.java:53) 1134791,68-75 93% at com.alipay.sofa.jraft.rpc.impl.core.NodeRequestProcessor.processRequest(NodeRequestProcessor.java:60) at com.alipay.sofa.jraft.rpc.RpcRequestProcessor.handleRequest(RpcRequestProcessor.java:53) at com.alipay.sofa.jraft.rpc.RpcRequestProcessor.handleRequest(RpcRequestProcessor.java:37) at com.alipay.remoting.rpc.protocol.RpcRequestProcessor.dispatchToUserProcessor(RpcRequestProcessor.java:224) at com.alipay.remoting.rpc.protocol.RpcRequestProcessor.doProcess(RpcRequestProcessor.java:145) at com.alipay.remoting.rpc.protocol.RpcRequestProcessor$ProcessTask.run(RpcRequestProcessor.java:366) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.Exception: Create connection to ip:port error! at com.alipay.remoting.connection.AbstractConnectionFactory.doCreateConnection(AbstractConnectionFactory.java:206) at com.alipay.remoting.connection.AbstractConnectionFactory.createConnection(AbstractConnectionFactory.java:131) at com.alipay.remoting.DefaultConnectionManager.create(DefaultConnectionManager.java:504) ... 26 common frames omitted Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: syscall:getsockopt(..) failed: Connection refused: /ip:port at io.netty.channel.unix.Socket.finishConnect(..)(Unknown Source) Caused by: io.netty.channel.unix.Errors$NativeConnectException: syscall:getsockopt(..) failed: Connection refused ... 1 common frames omitted

fengjiachun commented 4 years ago

[Bolt-default-executor-6-thread-4]-c.a.s.j.r.i.c.BoltRaftClientService.connect - Fail to connect ip:port, remoting exception: Create connection failed. The address is ip:port.

注意这一行,你没填 ip 和 端口

hanzhihua commented 4 years ago

ip和port 是我特意去掉的,不好意思:)

fengjiachun commented 4 years ago

ip:port 对应的节点应该有问题,这只是网络层的错误

hanzhihua commented 4 years ago

就是偶尔会连不上的,开始我以为ulimit 问题,后来修改了也不行 我现在加上了你说的那个配置,bolt.netty.epoll.switch,使用了NioServerSocketChannal,看看后续有没有这个问题

多谢你的回答,谢谢

fengjiachun commented 4 years ago

没其他问题先关闭了