sofastack / sofa-jraft

A production-grade java implementation of RAFT consensus algorithm.
https://www.sofastack.tech/projects/sofa-jraft/
Apache License 2.0
3.59k stars 1.15k forks source link

feat:propagate onerror event to statelistener #918

Closed yuyang0423 closed 1 year ago

yuyang0423 commented 1 year ago

Motivation:

We see below error log after a crash case. The node comes back but failed started success.

how about propagate onError event to state listener and let upper user know what happened?

2022-12-26 14:24:11.750 [parallel-kv-executor #2] INFO  TaskHandler:61 - Task handler on start: parallel-kv-executor #2.
2022-12-26 14:24:11.751 [parallel-kv-executor #3] INFO  TaskHandler:61 - Task handler on start: parallel-kv-executor #3.
2022-12-26 14:24:11.757 [main] INFO  NamedThreadFactory:82 - Creates new Thread[get_batching #0,5,main].
2022-12-26 14:24:11.759 [main] INFO  NamedThreadFactory:82 - Creates new Thread[get_batching_only_safe #0,5,main].
2022-12-26 14:24:11.763 [main] INFO  NamedThreadFactory:82 - Creates new Thread[put_batching #0,5,main].
2022-12-26 14:24:11.764 [main] INFO  DefaultRheaKVStore:295 - [DefaultRheaKVStore] start successfully, options: RheaKVStoreOptions{clusterId=0, clusterName=' test_election', placementDriverOptions=PlacementDriverOptions{fake=true, cliOptions=null, pdRpcOptions=null, pdGroupId='null', regionRouteTableOptionsList=null, initialServerList='192.168.80.192:8082', initialPdServerList='null'}, storeEngineOptions=StoreEngineOptions{storageType=RocksDB, rocksDBOptions=RocksDBOptions{sync=false, disableWAL=true, fastSnapshot=false, asyncSnapshot=false, openStatisticsCollector=true, statisticsCallbackIntervalSeconds=0, dbPath='/opt/election/db/db_-1_8082'}, memoryDBOptions=null, raftDataPath='/opt/election', serverAddress=192.168.80.192:8082, commonNodeOptions=NodeOptions{electionTimeoutMs=5000, electionPriority=-1, decayPriorityGap=10, leaderLeaseTimeRatio=90, snapshotIntervalSecs=3600, snapshotLogIndexMargin=0, catchupMargin=1000, initialConf=, fsm=null, logUri='null', raftMetaUri='null', snapshotUri='null', filterBeforeCopyRemote=false, disableCli=false, sharedTimerPool=false, timerPoolSize=12, cliRpcThreadPoolSize=4, raftRpcThreadPoolSize=24, enableMetrics=false, snapshotThrottle=null, sharedElectionTimer=false, sharedVoteTimer=false, sharedStepDownTimer=false, sharedSnapshotTimer=false, serviceFactory=com.alipay.sofa.jraft.core.DefaultJRaftServiceFactory@64021427, applyTaskMode=NonBlocking, raftOptions=RaftOptions{maxByteCountPerRpc=131072, fileCheckHole=false, maxEntriesSize=1024, maxBodySize=524288, maxAppendBufferSize=262144, maxElectionDelayMs=3000, electionHeartbeatFactor=10, applyBatch=32, sync=true, syncMeta=false, openStatistics=true, replicatorPipeline=true, maxReplicatorInflightMsgs=256, disruptorBufferSize=16384, disruptorPublishEventWaitTimeoutSecs=10, enableLogEntryChecksum=false, readOnlyOptions=ReadOnlySafe, maxReadIndexLag=-1, stepDownWhenVoteTimedout=true, startUpOldStorage=false}} RpcOptions{rpcConnectTimeoutMs=1000, rpcDefaultTimeout=5000, rpcInstallSnapshotTimeout=300000, rpcProcessorThreadPoolSize=80, enableRpcChecksum=false, metricRegistry=null}, regionEngineOptionsList=[RegionEngineOptions{regionId=-1, startKey='null', startKeyBytes=null, endKey='null', endKeyBytes=null, raftGroupId=' test_election--1', raftDataPath='/opt/election/raft_data_region_-1_8082', nodeOptions=NodeOptions{electionTimeoutMs=5000, electionPriority=-1, decayPriorityGap=10, leaderLeaseTimeRatio=90, snapshotIntervalSecs=3600, snapshotLogIndexMargin=0, catchupMargin=1000, initialConf=192.168.80.192:8082, fsm=com.alipay.sofa.jraft.rhea.storage.KVStoreStateMachine@fbe70d8, logUri='/opt/election/raft_data_region_-1_8082/log', raftMetaUri='/opt/election/raft_data_region_-1_8082/meta', snapshotUri='/opt/election/raft_data_region_-1_8082/snapshot', filterBeforeCopyRemote=false, disableCli=false, sharedTimerPool=false, timerPoolSize=12, cliRpcThreadPoolSize=4, raftRpcThreadPoolSize=24, enableMetrics=true, snapshotThrottle=null, sharedElectionTimer=false, sharedVoteTimer=false, sharedStepDownTimer=false, sharedSnapshotTimer=false, serviceFactory=com.alipay.sofa.jraft.core.DefaultJRaftServiceFactory@64021427, applyTaskMode=NonBlocking, raftOptions=RaftOptions{maxByteCountPerRpc=131072, fileCheckHole=false, maxEntriesSize=1024, maxBodySize=524288, maxAppendBufferSize=262144, maxElectionDelayMs=3000, electionHeartbeatFactor=10, applyBatch=32, sync=true, syncMeta=false, openStatistics=true, replicatorPipeline=true, maxReplicatorInflightMsgs=256, disruptorBufferSize=16384, disruptorPublishEventWaitTimeoutSecs=10, enableLogEntryChecksum=false, readOnlyOptions=ReadOnlySafe, maxReadIndexLag=-1, stepDownWhenVoteTimedout=true, startUpOldStorage=false}} RpcOptions{rpcConnectTimeoutMs=1000, rpcDefaultTimeout=5000, rpcInstallSnapshotTimeout=300000, rpcProcessorThreadPoolSize=80, enableRpcChecksum=false, metricRegistry=com.codahale.metrics.MetricRegistry@7d2c345d}, serverAddress=192.168.80.192:8082, initialServerList='192.168.80.192:8082', metricsReportPeriod=300}], initialServerList='192.168.80.192:8082', heartbeatOptions=null, useSharedRpcExecutor=false, readIndexCoreThreads=16, leaderStateTriggerCoreThreads=4, snapshotCoreThreads=1, snapshotMaxThreads=32, cliRpcCoreThreads=16, raftRpcCoreThreads=32, kvRpcCoreThreads=32, metricsReportPeriod=300, leastKeysOnSplit=10000}, initialServerList='192.168.80.192:8082', onlyLeaderRead=true, rpcOptions=RpcOptions{callbackExecutorCorePoolSize=16, callbackExecutorMaximumPoolSize=32, callbackExecutorQueueCapacity=512, rpcTimeoutMillis=5000}, failoverRetries=2, futureTimeoutMillis=30000, useParallelKVExecutor=true, batchingOptions=BatchingOptions{allowBatching=true, batchSize=100, bufSize=8192, maxWriteBytes=32768, maxReadBytes=1024}, useParallelCompress=false, compressThreads=4, deCompressThreads=5}.
2022-12-26 14:24:13.246 [Rpc-netty-server-worker-10-thread-1] WARN  BoltRaftRpcFactory:82 - JRaft SET bolt.rpc.dispatch-msg-list-in-default-executor to be false for replicator pipeline optimistic.
2022-12-26 14:24:13.279 [ test_election--1/PeerPair[192.168.80.192:8082 -> 192.168.80.136:8082]-AppendEntriesThread0] INFO  LocalRaftMetaStorage:132 - Save raft meta, path=/opt/election/raft_data_region_-1_8082/meta, term=4, votedFor=0.0.0.0:0, cost time=18 ms
2022-12-26 14:24:13.281 [ test_election--1/PeerPair[192.168.80.192:8082 -> 192.168.80.136:8082]-AppendEntriesThread0] WARN  NodeImpl:1951 - Node < test_election--1/192.168.80.192:8082> reject term_unmatched AppendEntriesRequest from 192.168.80.136:8082, term=4, prevLogIndex=1437, prevLogTerm=4, localPrevLogTerm=0, lastLogIndex=1234, entriesSize=0.
2022-12-26 14:24:13.281 [JRaft-FSMCaller-Disruptor-0] INFO  StateMachineAdapter:89 - onStartFollowing: LeaderChangeContext [leaderId=192.168.80.136:8082, term=4, status=Status[ENEWLEADER<10011>: Raft node receives message from new leader with higher term.]].
2022-12-26 14:24:13.282 [JRaft-FSMCaller-Disruptor-0] INFO  NamedThreadFactory:82 - Creates new Thread[rheakv-raft-state-trigger #0,5,main].
2022-12-26 14:24:13.298 [ test_election--1/PeerPair[192.168.80.192:8082 -> 192.168.80.136:8082]-AppendEntriesThread0] WARN  NodeImpl:1951 - Node < test_election--1/192.168.80.192:8082> reject term_unmatched AppendEntriesRequest from 192.168.80.136:8082, term=4, prevLogIndex=1234, prevLogTerm=3, localPrevLogTerm=2, lastLogIndex=1234, entriesSize=0.
2022-12-26 14:24:13.320 [JRaft-LogManager-Disruptor-0] WARN  LogManagerImpl:533 - Truncating storage to lastIndexKept=1233.
2022-12-26 14:24:13.324 [JRaft-LogManager-Disruptor-0] ERROR RocksDBLogStorage:629 - Fail to truncateSuffix 1233 in data path:/opt/election/raft_data_region_-1_8082/log.
org.rocksdb.RocksDBException: end key comes before start key
        at org.rocksdb.RocksDB.deleteRange(Native Method) ~[rocksdbjni-6.22.1.1.jar!/:?]
        at org.rocksdb.RocksDB.deleteRange(RocksDB.java:1539) ~[rocksdbjni-6.22.1.1.jar!/:?]
        at com.alipay.sofa.jraft.storage.impl.RocksDBLogStorage.truncateSuffix(RocksDBLogStorage.java:624) [jraft-core-1.3.11.jar!/:?]
        at com.alipay.sofa.jraft.storage.impl.LogManagerImpl$StableClosureEventHandler.onEvent(LogManagerImpl.java:534) [jraft-core-1.3.11.jar!/:?]
        at com.alipay.sofa.jraft.storage.impl.LogManagerImpl$StableClosureEventHandler.onEvent(LogManagerImpl.java:488) [jraft-core-1.3.11.jar!/:?]
        at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:137) [disruptor-3.3.7.jar!/:?]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]
2022-12-26 14:24:13.326 [JRaft-FSMCaller-Disruptor-0] ERROR StateMachineAdapter:72 - Encountered an error=Status[EIO<1014>: Failed operation in LogStorage] on StateMachine com.alipay.sofa.jraft.rhea.storage.KVStoreStateMachine, it's highly recommended to implement this method as raft stops working since some error occurs, you should figure out the cause and repair or remove this node.
com.alipay.sofa.jraft.error.RaftException: ERROR_TYPE_LOG
        at com.alipay.sofa.jraft.storage.impl.LogManagerImpl.reportError(LogManagerImpl.java:570) ~[jraft-core-1.3.11.jar!/:?]
        at com.alipay.sofa.jraft.storage.impl.LogManagerImpl.access$1200(LogManagerImpl.java:75) ~[jraft-core-1.3.11.jar!/:?]
        at com.alipay.sofa.jraft.storage.impl.LogManagerImpl$StableClosureEventHandler.onEvent(LogManagerImpl.java:555) ~[jraft-core-1.3.11.jar!/:?]
        at com.alipay.sofa.jraft.storage.impl.LogManagerImpl$StableClosureEventHandler.onEvent(LogManagerImpl.java:488) ~[jraft-core-1.3.11.jar!/:?]
        at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:137) [disruptor-3.3.7.jar!/:?]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]
2022-12-26 14:24:13.327 [ test_election--1/PeerPair[192.168.80.192:8082 -> 192.168.80.136:8082]-AppendEntriesThread0] INFO  Recyclers:52 - -Djraft.recyclers.maxCapacityPerThread: 4096.
2022-12-26 14:24:13.327 [JRaft-FSMCaller-Disruptor-0] WARN  NodeImpl:2498 - Node < test_election--1/192.168.80.192:8082> got error: Error [type=ERROR_TYPE_LOG, status=Status[EIO<1014>: Failed operation in LogStorage]].
2022-12-26 14:24:13.327 [JRaft-FSMCaller-Disruptor-0] WARN  FSMCallerImpl:348 - FSMCaller already in error status, ignore new error.
com.alipay.sofa.jraft.error.RaftException: ERROR_TYPE_LOG
        at com.alipay.sofa.jraft.storage.impl.LogManagerImpl.reportError(LogManagerImpl.java:570) ~[jraft-core-1.3.11.jar!/:?]
        at com.alipay.sofa.jraft.storage.impl.LogManagerImpl.access$1200(LogManagerImpl.java:75) ~[jraft-core-1.3.11.jar!/:?]
        at com.alipay.sofa.jraft.storage.impl.LogManagerImpl$StableClosureEventHandler.onEvent(LogManagerImpl.java:555) ~[jraft-core-1.3.11.jar!/:?]
        at com.alipay.sofa.jraft.storage.impl.LogManagerImpl$StableClosureEventHandler.onEvent(LogManagerImpl.java:488) ~[jraft-core-1.3.11.jar!/:?]
        at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:137) [disruptor-3.3.7.jar!/:?]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]
2022-12-26 14:24:13.332 [JRaft-FSMCaller-Disruptor-0] INFO  StateMachineAdapter:84 - onStopFollowing: LeaderChangeContext [leaderId=192.168.80.136:8082, term=4, status=Status[EBADNODE<10009>: Raft node(leader or candidate) is in error.]].
2022-12-26 14:24:13.332 [JRaft-FSMCaller-Disruptor-0] INFO  NamedThreadFactory:82 - Creates new Thread[rheakv-raft-state-trigger #1,5,main].
2022-12-26 14:24:13.660 [ test_election--1/PeerPair[192.168.80.192:8082 -> 192.168.80.136:8082]-AppendEntriesThread0] WARN  NodeImpl:1893 - Node < test_election--1/192.168.80.192:8082> is not in active state, currTerm=4.
2022-12-26 14:24:13.838 [ test_election--1/PeerPair[192.168.80.192:8082 -> 192.168.80.136:8082]-AppendEntriesThread0] WARN  NodeImpl:1893 - Node < test_election--1/192.168.80.192:8082> is not in active state, currTerm=4.
2022-12-26 14:24:14.164 [ test_election--1/PeerPair[192.168.80.192:8082 -> 192.168.80.136:8082]-AppendEntriesThread0] WARN  NodeImpl:1893 - Node < test_election--1/192.168.80.192:8082> is not in active state, currTerm=4.
2022-12-26 14:24:14.341 [ test_election--1/PeerPair[192.168.80.192:8082 -> 192.168.80.136:8082]-AppendEntriesThread0] WARN  NodeImpl:1893 - Node < test_election--1/192.168.80.192:8082> is not in active state, currTerm=4.
2022-12-26 14:24:14.391 [Rpc-netty-server-worker-10-thread-2] INFO  NamedThreadFactory:82 - Creates new Thread[rheakv-cli-rpc-executor #0,5,main].
2022-12-26 14:24:14.410 [Rpc-netty-server-worker-10-thread-2] INFO  NamedThreadFactory:82 - Creates new Thread[rheakv-cli-rpc-executor #1,5,main].
2022-12-26 14:24:14.426 [Rpc-netty-server-worker-10-thread-2] INFO  NamedThreadFactory:82 - Creates new Thread[rheakv-cli-rpc-executor #2,5,main].
2022-12-26 14:24:14.444 [Rpc-netty-server-worker-10-thread-2] INFO  NamedThreadFactory:82 - Creates new Thread[rheakv-cli-rpc-executor #3,5,main].
2022-12-26 14:24:14.459 [Rpc-netty-server-worker-10-thread-2] INFO  NamedThreadFactory:82 - Creates new Thread[rheakv-cli-rpc-executor #4,5,main].
2022-12-26 14:24:14.475 [Rpc-netty-server-worker-10-thread-2] INFO  NamedThreadFactory:82 - Creates new Thread[rheakv-cli-rpc-executor #5,5,main].
2022-12-26 14:24:14.490 [Rpc-netty-server-worker-10-thread-2] INFO  NamedThreadFactory:82 - Creates new Thread[rheakv-cli-rpc-executor #6,5,main].
2022-12-26 14:24:14.506 [Rpc-netty-server-worker-10-thread-2] INFO  NamedThreadFactory:82 - Creates new Thread[rheakv-cli-rpc-executor #7,5,main].
2022-12-26 14:24:14.522 [Rpc-netty-server-worker-10-thread-2] INFO  NamedThreadFactory:82 - Creates new Thread[rheakv-cli-rpc-executor #8,5,main].

Modification:

Add onError to State Listener

Result:

skippyb1 commented 1 year ago

@yuyang0423 麻烦问下 org.rocksdb.RocksDBException: end key comes before start key 这个异常是怎么产生的?