sofastack / sofa-jraft

A production-grade java implementation of RAFT consensus algorithm.
https://www.sofastack.tech/projects/sofa-jraft/
Apache License 2.0
3.53k stars 1.12k forks source link

com.alipay.sofa.jraft.error.RaftException: ERROR_TYPE_SNAPSHOT #955

Closed funky-eyes closed 1 year ago

funky-eyes commented 1 year ago

Describe the bug

之前好像没出现这个问题,今天测试的时候,电脑非常卡,强杀了leader,再重启,老leader去跟新leader拉snapshot就扑街了 文件好像确实没有保存下来,不知道为什么

leader 节点

21:52:17.216  INFO --- [    JRaft-RPC-Processor-2] [pay.sofa.jraft.core.Replicator] [                   ?]  [] : Node default:192.168.31.181:9093 received InstallSnapshotResponse from 192.168.31.181:9091::100 lastIncludedIndex=27627 lastIncludedTerm=1 error:Status[EIO<1014>: Fail to read from path=sessionStore\raft\9093\default\snapshot\snapshot_27627 filename=data]
21:52:17.216  WARN --- [    JRaft-RPC-Processor-2] [pay.sofa.jraft.core.Replicator] [                   ?]  [] : Fail to install snapshot at peer=192.168.31.181:9091::100, error=Status[EIO<1014>: Fail to read from path=sessionStore\raft\9093\default\snapshot\snapshot_27627 filename=data]
21:52:17.320  INFO --- [    JRaft-RPC-Processor-3] [pay.sofa.jraft.core.Replicator] [                   ?]  [] : Node default:192.168.31.181:9093 received InstallSnapshotResponse from 192.168.31.181:9091::100 lastIncludedIndex=27627 lastIncludedTerm=1 error:Status[EINVAL<1015>: Node default:192.168.31.181:9091 is not in active state, state STATE_ERROR.]
21:52:17.423  INFO --- [    JRaft-RPC-Processor-4] [pay.sofa.jraft.core.Replicator] [                   ?]  [] : Node default:192.168.31.181:9093 received InstallSnapshotResponse from 192.168.31.181:9091::100 lastIncludedIndex=27627 lastIncludedTerm=1 error:Status[EINVAL<1015>: Node default:192.168.31.181:9091 is not in active state, state STATE_ERROR.]
21:52:17.530  INFO --- [    JRaft-RPC-Processor-5] [pay.sofa.jraft.core.Replicator] [                   ?]  [] : Node default:192.168.31.181:9093 received InstallSnapshotResponse from 192.168.31.181:9091::100 lastIncludedIndex=27627 lastIncludedTerm=1 error:Status[EINVAL<1015>: Node default:192.168.31.181:9091 is not in active state, state STATE_ERROR.]
21:52:17.633  INFO --- [    JRaft-RPC-Processor-6] [pay.sofa.jraft.core.Replicator] [                   ?]  [] : Node default:192.168.31.181:9093 received InstallSnapshotResponse from 192.168.31.181:9091::100 lastIncludedIndex=27627 lastIncludedTerm=1 error:Status[EINVAL<1015>: Node default:192.168.31.181:9091 is not in active state, state STATE_ERROR.]
21:52:17.739  INFO --- [    JRaft-RPC-Processor-7] [pay.sofa.jraft.core.Replicator] [                   ?]  [] : Node default:192.168.31.181:9093 received InstallSnapshotResponse from 192.168.31.181:9091::100 lastIncludedIndex=27627 lastIncludedTerm=1 error:Status[EINVAL<1015>: Node default:192.168.31.181:9091 is not in active state, state STATE_ERROR.]
21:52:17.739  WARN --- [    JRaft-RPC-Processor-7] [pay.sofa.jraft.core.Replicator] [                   ?]  [] : Fail to install snapshot at peer=192.168.31.181:9091::100, error=Status[EINVAL<1015>: Node default:192.168.31.181:9091 is not in active state, state STATE_ERROR.]

follower拉不到snapshot报错

21:52:14.978  WARN --- [fault-executor-4-thread-1] [ft.rpc.impl.BoltRaftRpcFactory] [                   ?]  [] : JRaft SET bolt.rpc.dispatch-msg-list-in-default-executor to be false for replicator pipeline optimistic.
21:52:14.990  INFO --- [fault-executor-4-thread-2] [lipay.sofa.jraft.core.NodeImpl] [                   ?]  [] : Node <default/192.168.31.181:9091> received InstallSnapshotRequest from 192.168.31.181:9093, lastIncludedLogIndex=27627, lastIncludedLogTerm=1, lastLogId=LogId [index=2008, term=1].
21:52:14.992  INFO --- [aft-FSMCaller-Disruptor-0] [.cluster.raft.RaftStateMachine] [                   ?]  [] : groupId: default, onStartFollowing: LeaderChangeContext [leaderId=192.168.31.181:9093, term=1, status=Status[ENEWLEADER<10011>: Follower receives message from new leader with the same term.]].
21:52:15.023  INFO --- [vent-executor-13-thread-1] [erviceConnectionEventProcessor] [                   ?]  [] : Peer 192.168.31.181:9093 is connected
21:52:15.081  INFO --- [-Group-Default-Executor-0] [ipay.sofa.jraft.util.Recyclers] [                   ?]  [] : -Djraft.recyclers.maxCapacityPerThread: 4096.
21:52:15.183  WARN --- [-Group-Default-Executor-0] [m.alipay.sofa.jraft.util.Utils] [                   ?]  [] : Unable to fsync directory sessionStore\raft\9091\default\snapshot\temp on windows.
21:52:15.547  INFO --- [ttyServerNIOWorker_1_1_16] [rocessor.server.RegTmProcessor] [                   ?]  [] : TM register success,message:RegisterTMRequest{version='2.0.0-SNAPSHOT', applicationId='product-service', transactionServiceGroup='default_tx_group', extraData='ak=null
digest=default_tx_group,192.168.31.181,1680616335497
timestamp=1680616335497
authVersion=V4
vgroup=default_tx_group
ip=192.168.31.181
'},channel:[id: 0x6ef25ebf, L:/192.168.31.181:8091 - R:/192.168.31.181:62246],client version:2.0.0-SNAPSHOT
21:52:17.198 ERROR --- [    JRaft-RPC-Processor-3] [ge.snapshot.remote.CopySession] [                   ?]  [] : Fail to copy data, readerId=940328505881148840 fileName=data offset=0 status=Status[EIO<1014>: Fail to read from path=sessionStore\raft\9093\default\snapshot\snapshot_27627 filename=data]
21:52:17.203  WARN --- [-Group-Default-Executor-0] [hot.local.LocalSnapshotStorage] [                   ?]  [] : Close snapshot writer sessionStore\raft\9091\default\snapshot\temp with exit code: 1014.
21:52:17.203  INFO --- [-Group-Default-Executor-0] [hot.local.LocalSnapshotStorage] [                   ?]  [] : Deleting snapshot sessionStore\raft\9091\default\snapshot\temp.
21:52:17.214 ERROR --- [aft-FSMCaller-Disruptor-0] [jraft.core.StateMachineAdapter] [                   ?]  [] : Encountered an error=Status[EIO<1014>: Fail to read from path=sessionStore\raft\9093\default\snapshot\snapshot_27627 filename=data] on StateMachine io.seata.server.cluster.raft.RaftStateMachine, it's highly recommended to implement this method as raft stops working since some error occurs, you should figure out the cause and repair or remove this node.
==>
com.alipay.sofa.jraft.error.RaftException: ERROR_TYPE_SNAPSHOT
    at com.alipay.sofa.jraft.storage.snapshot.SnapshotExecutorImpl.reportError(SnapshotExecutorImpl.java:727)
    at com.alipay.sofa.jraft.storage.snapshot.SnapshotExecutorImpl.loadDownloadingSnapshot(SnapshotExecutorImpl.java:547)
    at com.alipay.sofa.jraft.storage.snapshot.SnapshotExecutorImpl.installSnapshot(SnapshotExecutorImpl.java:532)
    at com.alipay.sofa.jraft.core.NodeImpl.handleInstallSnapshot(NodeImpl.java:3373)
    at com.alipay.sofa.jraft.rpc.impl.core.InstallSnapshotRequestProcessor.processRequest0(InstallSnapshotRequestProcessor.java:53)
    at com.alipay.sofa.jraft.rpc.impl.core.InstallSnapshotRequestProcessor.processRequest0(InstallSnapshotRequestProcessor.java:34)
    at com.alipay.sofa.jraft.rpc.impl.core.NodeRequestProcessor.processRequest(NodeRequestProcessor.java:60)
    at com.alipay.sofa.jraft.rpc.RpcRequestProcessor.handleRequest(RpcRequestProcessor.java:53)
    at com.alipay.sofa.jraft.rpc.RpcRequestProcessor.handleRequest(RpcRequestProcessor.java:35)
    at com.alipay.sofa.jraft.rpc.impl.BoltRpcServer$2.handleRequest(BoltRpcServer.java:124)
    at com.alipay.remoting.rpc.protocol.RpcRequestProcessor.dispatchToUserProcessor(RpcRequestProcessor.java:235)
    at com.alipay.remoting.rpc.protocol.RpcRequestProcessor.doProcess(RpcRequestProcessor.java:146)
    at com.alipay.remoting.rpc.protocol.RpcRequestProcessor$ProcessTask.run(RpcRequestProcessor.java:393)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:833)
<==

21:52:17.215  WARN --- [aft-FSMCaller-Disruptor-0] [lipay.sofa.jraft.core.NodeImpl] [                   ?]  [] : Node <default/192.168.31.181:9091> got error: {}.
==>
com.alipay.sofa.jraft.error.RaftException: ERROR_TYPE_SNAPSHOT
    at com.alipay.sofa.jraft.storage.snapshot.SnapshotExecutorImpl.reportError(SnapshotExecutorImpl.java:727)
    at com.alipay.sofa.jraft.storage.snapshot.SnapshotExecutorImpl.loadDownloadingSnapshot(SnapshotExecutorImpl.java:547)
    at com.alipay.sofa.jraft.storage.snapshot.SnapshotExecutorImpl.installSnapshot(SnapshotExecutorImpl.java:532)
    at com.alipay.sofa.jraft.core.NodeImpl.handleInstallSnapshot(NodeImpl.java:3373)
    at com.alipay.sofa.jraft.rpc.impl.core.InstallSnapshotRequestProcessor.processRequest0(InstallSnapshotRequestProcessor.java:53)
    at com.alipay.sofa.jraft.rpc.impl.core.InstallSnapshotRequestProcessor.processRequest0(InstallSnapshotRequestProcessor.java:34)
    at com.alipay.sofa.jraft.rpc.impl.core.NodeRequestProcessor.processRequest(NodeRequestProcessor.java:60)
    at com.alipay.sofa.jraft.rpc.RpcRequestProcessor.handleRequest(RpcRequestProcessor.java:53)
    at com.alipay.sofa.jraft.rpc.RpcRequestProcessor.handleRequest(RpcRequestProcessor.java:35)
    at com.alipay.sofa.jraft.rpc.impl.BoltRpcServer$2.handleRequest(BoltRpcServer.java:124)
    at com.alipay.remoting.rpc.protocol.RpcRequestProcessor.dispatchToUserProcessor(RpcRequestProcessor.java:235)
    at com.alipay.remoting.rpc.protocol.RpcRequestProcessor.doProcess(RpcRequestProcessor.java:146)
    at com.alipay.remoting.rpc.protocol.RpcRequestProcessor$ProcessTask.run(RpcRequestProcessor.java:393)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:833)
<==

21:52:17.215  WARN --- [aft-FSMCaller-Disruptor-0] [.sofa.jraft.core.FSMCallerImpl] [                   ?]  [] : FSMCaller already in error status, ignore new error.
==>
com.alipay.sofa.jraft.error.RaftException: ERROR_TYPE_SNAPSHOT
    at com.alipay.sofa.jraft.storage.snapshot.SnapshotExecutorImpl.reportError(SnapshotExecutorImpl.java:727)
    at com.alipay.sofa.jraft.storage.snapshot.SnapshotExecutorImpl.loadDownloadingSnapshot(SnapshotExecutorImpl.java:547)
    at com.alipay.sofa.jraft.storage.snapshot.SnapshotExecutorImpl.installSnapshot(SnapshotExecutorImpl.java:532)
    at com.alipay.sofa.jraft.core.NodeImpl.handleInstallSnapshot(NodeImpl.java:3373)
    at com.alipay.sofa.jraft.rpc.impl.core.InstallSnapshotRequestProcessor.processRequest0(InstallSnapshotRequestProcessor.java:53)
    at com.alipay.sofa.jraft.rpc.impl.core.InstallSnapshotRequestProcessor.processRequest0(InstallSnapshotRequestProcessor.java:34)
    at com.alipay.sofa.jraft.rpc.impl.core.NodeRequestProcessor.processRequest(NodeRequestProcessor.java:60)
    at com.alipay.sofa.jraft.rpc.RpcRequestProcessor.handleRequest(RpcRequestProcessor.java:53)
    at com.alipay.sofa.jraft.rpc.RpcRequestProcessor.handleRequest(RpcRequestProcessor.java:35)
    at com.alipay.sofa.jraft.rpc.impl.BoltRpcServer$2.handleRequest(BoltRpcServer.java:124)
    at com.alipay.remoting.rpc.protocol.RpcRequestProcessor.dispatchToUserProcessor(RpcRequestProcessor.java:235)
    at com.alipay.remoting.rpc.protocol.RpcRequestProcessor.doProcess(RpcRequestProcessor.java:146)
    at com.alipay.remoting.rpc.protocol.RpcRequestProcessor$ProcessTask.run(RpcRequestProcessor.java:393)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:833)
<==

21:52:17.216  INFO --- [aft-FSMCaller-Disruptor-0] [.cluster.raft.RaftStateMachine] [                   ?]  [] : groupId: default, onStopFollowing: LeaderChangeContext [leaderId=192.168.31.181:9093, term=1, status=Status[EBADNODE<10009>: Raft node(leader or candidate) is in error.]].
21:52:17.261  WARN --- [093]-AppendEntriesThread0] [lipay.sofa.jraft.core.NodeImpl] [                   ?]  [] : Node <default/192.168.31.181:9091> is not in active state, currTerm=1.
21:52:17.318  WARN --- [fault-executor-4-thread-8] [lipay.sofa.jraft.core.NodeImpl] [                   ?]  [] : Node <default/192.168.31.181:9091> ignore InstallSnapshotRequest as it is not in active state STATE_ERROR.
21:52:17.363  WARN --- [093]-AppendEntriesThread0] [lipay.sofa.jraft.core.NodeImpl] [                   ?]  [] : Node <default/192.168.31.181:9091> is not in active state, currTerm=1.
21:52:17.422  WARN --- [fault-executor-4-thread-9] [lipay.sofa.jraft.core.NodeImpl] [                   ?]  [] : Node <default/192.168.31.181:9091> ignore InstallSnapshotRequest as it is not in active state STATE_ERROR.
21:52:17.464  WARN --- [093]-AppendEntriesThread0] [lipay.sofa.jraft.core.NodeImpl] [                   ?]  [] : Node <default/192.168.31.181:9091> is not in active state, currTerm=1.
21:52:17.526  WARN --- [ault-executor-4-thread-10] [lipay.sofa.jraft.core.NodeImpl] [                   ?]  [] : Node <default/192.168.31.181:9091> ignore InstallSnapshotRequest as it is not in active state STATE_ERROR.
21:52:17.566  WARN --- [093]-AppendEntriesThread0] [lipay.sofa.jraft.core.NodeImpl] [                   ?]  [] : Node <default/192.168.31.181:9091> is not in active state, currTerm=1.
21:52:17.632  WARN --- [ault-executor-4-thread-11] [lipay.sofa.jraft.core.NodeImpl] [                   ?]  [] : Node <default/192.168.31.181:9091> ignore InstallSnapshotRequest as it is not in active state STATE_ERROR.
21:52:17.670  WARN --- [093]-AppendEntriesThread0] [lipay.sofa.jraft.core.NodeImpl] [                   ?]  [] : Node <default/192.168.31.181:9091> is not in active state, currTerm=1.
21:52:17.736  WARN --- [ault-executor-4-thread-12] [lipay.sofa.jraft.core.NodeImpl] [                   ?]  [] : Node <default/192.168.31.181:9091> ignore InstallSnapshotRequest as it is not in active state STATE_ERROR.
21:52:17.771  WARN --- [093]-AppendEntriesThread0] [lipay.sofa.jraft.core.NodeImpl] [                   ?]  [] : Node <default/192.168.31.181:9091> is not in active state, currTerm=1.

Expected behavior

Actual behavior

Steps to reproduce

Minimal yet complete reproducer code (or GitHub URL to code)

Environment

killme2008 commented 1 year ago
Unable to fsync directory sessionStore\raft\9091\default\snapshot\temp on windows.

应该跟 https://github.com/sofastack/sofa-jraft/pull/745 有关系。 windows 也没有实现 fsync directory 和 atomic move 语义。 jraft 没有在 windows 系统充分测试过,请慎重使用。

funky-eyes commented 1 year ago

我记得当初在20年的某个版本应该是1.3.7还是1.3.6来着,好像没发现这个问题.只能把相关测试和运行要在非windows上吗

killme2008 commented 1 year ago

我记得当初在20年的某个版本应该是1.3.7还是1.3.6来着,好像没发现这个问题.只能把相关测试和运行要在非windows上吗

有没有问题是看运气,是不是中途写到一半,是不是临时文件没有原子 move 成功等等。整个库的测试和研发都仅在 mac/linux 上验证过,windows 确实没有验证。

funky-eyes commented 1 year ago

我记得当初在20年的某个版本应该是1.3.7还是1.3.6来着,好像没发现这个问题.只能把相关测试和运行要在非windows上吗

有没有问题是看运气,是不是中途写到一半,是不是临时文件没有原子 move 成功等等。整个库的测试和研发都仅在 mac/linux 上验证过,windows 确实没有验证。

明白了,谢谢