sofastack / sofa-jraft

A production-grade java implementation of RAFT consensus algorithm.
https://www.sofastack.tech/projects/sofa-jraft/
Apache License 2.0
3.53k stars 1.12k forks source link

Jraft Leader step down, Node not active #1021

Closed googlefan closed 10 months ago

googlefan commented 11 months ago

Your question

我使用Jraft 实现了一个 分布式 文件系统, 使用默认的bolt rpc 框架,编解码使用的 默认 的hessian .大文件做了切割处理,10MB每块; 在处理小文件的 上传,性能不错,也很稳定,在做大文件压测时,我控制了 各种设置,尽量不让系统发生OOM,虽然也发生了,但是问题的关键是,当系统压力过大时,Leader step down 后, client 没有获取到leader down 状态,会持续的进行rpc 调用,然后报错 not leader ;系统就会进入无休止的 不可用状态. 我的问题是:

  1. 如何才能避免Jraft 集群的不可用状态(避免Leader down现象)
  2. 如果发生了Leader 重新选举,怎么才能保证最快恢复

Your scenes

2023-08-23 16:17:42.718 [Append-Entries-Thread-Send0] [] ERROR c.alipay.sofa.jraft.rpc.impl.AbstractClientService - Fail to run RpcResponseClosure, the request is group_id: "group1"
server_id: "192.168.165.32:8261"
peer_id: "192.168.165.30:8261"
term: 3
prev_log_term: 3
prev_log_index: 4221
entries {
  term: 3
  type: ENTRY_TYPE_DATA
  data_len: 10486854
}
committed_index: 4213

invoke err: Not leader

2023-08-23 16:33:30.814 [group1/PeerPair[192.168.165.32:8261 -> 192.168.165.31:8261]-AppendEntriesThread0] [] WARN  com.alipay.sofa.jraft.core.NodeImpl - Node <group1/192.168.165.32:8261> is not in active state, currTerm=4.
2023-08-23 16:33:30.842 [group1/PeerPair[192.168.165.32:8261 -> 192.168.165.31:8261]-AppendEntriesThread0] [] WARN  com.alipay.sofa.jraft.core.NodeImpl - Node <group1/192.168.165.32:8261> is not in active state, currTerm=4.
2023-08-23 16:33:31.043 [group1/PeerPair[192.168.165.32:8261 -> 192.168.165.31:8261]-AppendEntriesThread0] [] WARN  com.alipay.sofa.jraft.core.NodeImpl - Node <group1/192.168.165.32:8261> is not in active state, currTerm=4.
2023-08-23 16:33:31.064 [group1/PeerPair[192.168.165.32:8261 -> 192.168.165.31:8261]-AppendEntriesThread0] [] WARN  com.alipay.sofa.jraft.core.NodeImpl - Node <group1/192.168.165.32:8261> is not in active state, currTerm=4.

ClientSDK Code:

    response = this.rpcClient.invokeSync(getLeader().getEndpoint(), request, cliOptions.getRpcDefaultTimeout());

    public PeerId getLeader() throws InterruptedException, TimeoutException {
        refreshLeader(groupId);
        leader = RouteTable.getInstance().selectLeader(groupId);
        return leader;
    }

    private com.alipay.sofa.jraft.Status refreshLeader(String groupId) throws InterruptedException, TimeoutException {
        return RouteTable.getInstance().refreshLeader(cliClientService, groupId, cliOptions.getTimeoutMs());
    }

Jraft 参数配置如下:

jraft:
  server:
    enable: true
    group-id: group1
    rpc-endpoint: 192.168.165.30:8261
    group: 192.168.165.30:8261,192.168.165.31:8261,192.168.165.32:8261
    node:
      log-uri: /data/storage/log
      snapshot-uri: /data/storage/snapshot
      raft-meta-uri: /data/storage/raft_meta
      rpc-connect-timeout-ms: 10000
      election-timeout-ms: 2000
      rpc-default-timeout: 60000
      raft-options:
        max-byte-count-per-rpc: 1048576
        max-entries-size: 128
        max-body-size: 1048576
        max-append-buffer-size: 1048576
        max-election-delay-ms: 5000
        election-heartbeat-factor: 10
        disruptor-buffer-size: 16
        disruptor-publish-event-wait-timeout-secs: 30
        step-down-when-vote-timedout: true
  client:
    enable: true
    split-val: 10MB
    groups:
      - group-id: group1
        group: 192.168.165.30:8261,192.168.165.31:8261,192.168.165.32:8261
      #- group-id: group2
      #  group: 192.168.167.220:8262,192.168.167.221:8262,192.168.167.222:8262
    cliOptions:
      rpc-connect-timeout-ms: 2000
      rpc-default-timeout: 30000
      timeout-ms: 10000
      max-retry: 3
      rpc-processor-thread-pool-size: 2

Environment

fengjiachun commented 10 months ago

Leader step down 后, client 没有获取到leader down 状态,会持续的进行rpc 调用,然后报错 not leader

Leader 切走了,你要连新的 leader 呀 (收到 not leader 错误后调用 refreshLeader)

googlefan commented 10 months ago

Leader step down 后, client 没有获取到leader down 状态,会持续的进行rpc 调用,然后报错 not leader

Leader 切走了,你要连新的 leader 呀 (收到 not leader 错误后调用 refreshLeader)

是有连接新leader的操作的,您看下我贴的代码, rpc 调用那里 ,每次rpc都走了 refreshLeader; 现在的情况是, 状态机 持续进行Task OnApply;参数数据 done 是null, 但是iter.getData() buffer 是空的, 状态机 持续处于 running 状态; 通过debug,我发现 在refresh leader 时: if (!stampedLock.validate(stamp)) false 跳过了, 并且 新的rpc 调用不走processor了,直接返回response (BooleanCommand) success true,但是 并没有调用服务端.

 public PeerId selectLeader(final String groupId) {
        Requires.requireTrue(!StringUtils.isBlank(groupId), "Blank group id");

        final GroupConf gc = this.groupConfTable.get(groupId);
        if (gc == null) {
            return null;
        }
        final StampedLock stampedLock = gc.stampedLock;
        long stamp = stampedLock.tryOptimisticRead();
        PeerId leader = gc.leader;
        if (!stampedLock.validate(stamp)) {
            stamp = stampedLock.readLock();
            try {
                leader = gc.leader;
            } finally {
                stampedLock.unlockRead(stamp);
            }
        }
        return leader;
    }

状态机任务处理代码如下:

    @Override
    public void onApply(Iterator iter) {
        while (iter.hasNext()) {
            final Closure done = iter.done();
            CommandType cmdType;
            final ByteBuffer data = iter.getData();
            Object cmd;
            LeaderTaskClosure<T> closure = null;
            if (done != null) {
                // Leader节点直接获取数据
                closure = (LeaderTaskClosure<T>) done;
                cmdType = closure.getCmdType();
                cmd = closure.getCmd();
            } else {
                // Follower 节点需要从rpc传输来的数据做解码
                if (data.remaining() <= 0) {
                    log.info("fsm data buffer is empty!");
                    continue;
                }
                try {
                    final byte b = data.get();
                    cmdType = CommandType.parseByte(b);
                } catch (Exception e) {
                    log.error("fsm data get err:{}", e);
                    continue;
                }
                final byte[] cmdBytes = new byte[data.remaining()];
                data.get(cmdBytes);
                cmd = CommandCodec.decodeCommand(cmdBytes, cmdType);
            }
            Object response;
            if (cmd == null) {
                response = null;
            } else {
                CmdTaskHandler<T> handler = registry.getHandler(cmdType);
                try {
                    response = handler.execute((T) cmd);
                } catch (FsException e) {
                    response = e;
                    log.error("fsm run FsException:{}", e.getMessage());
                }
            }
            if (closure != null) {
                closure.setResponse(response);
                closure.run(Status.OK());
            }
            log.info("On apply with term: {} and index: {}. ", iter.getTerm(), iter.getIndex());
            iter.next();
        }
    }
fengjiachun commented 10 months ago

public PeerId selectLeader(final String groupId) {

这是 select leader 不是 refresh,select 是查询本地缓存,我的意思是如果遇到 NotLeader 的错误需要客户端主动刷新 leader (也就是重定向),并重试请求

上面的答案适合集群已经成功切换 leader 的情况,如果 jraft 因为负载过高或者某些原因长期选不出 leader 来,那你需要看一下服务端的错误信息,而不是客户端

googlefan commented 10 months ago

public PeerId selectLeader(final String groupId) {

这是 select leader 不是 refresh,select 是查询本地缓存,我的意思是如果遇到 NotLeader 的错误需要客户端主动刷新 leader (也就是重定向),并重试请求

额, 在selectLeader 之前 我有调用refreshLeader ,你说的是 return RouteTable.getInstance().refreshLeader(cliClientService, groupId, cliOptions.getTimeoutMs()); 这个吧?

fengjiachun commented 10 months ago

refreshLeader

看返回值,以及我的第二条建议

fengjiachun commented 10 months ago

public PeerId selectLeader(final String groupId) {

这是 select leader 不是 refresh,select 是查询本地缓存,我的意思是如果遇到 NotLeader 的错误需要客户端主动刷新 leader (也就是重定向),并重试请求

额, 在selectLeader 之前 我有调用refreshLeader ,你说的是 return RouteTable.getInstance().refreshLeader(cliClientService, groupId, cliOptions.getTimeoutMs()); 这个吧?

看返回值,以及我的第二条建议

googlefan commented 10 months ago

public PeerId selectLeader(final String groupId) {

这是 select leader 不是 refresh,select 是查询本地缓存,我的意思是如果遇到 NotLeader 的错误需要客户端主动刷新 leader (也就是重定向),并重试请求

额, 在selectLeader 之前 我有调用refreshLeader ,你说的是 return RouteTable.getInstance().refreshLeader(cliClientService, groupId, cliOptions.getTimeoutMs()); 这个吧?

看返回值,以及我的第二条建议

嗯 ,好的, 麻烦再 帮我看看 服务的状态机的问题: 服务端在 经过压力 报错过后,持续 进行task onApply的问题,感觉是进入了死循环

googlefan commented 10 months ago
 public void handleRequest(final RpcContext rpcCtx, final T request) {
        if (!raftGroupService.getRaftNode().isLeader(true)) {
            rpcCtx.sendResponse(redirect());
            return;
        }
        final CommandType cmdType = getCmdType();
        final Task task = createTask(rpcCtx, request, cmdType);
        raftGroupService.getRaftNode().apply(task);
    }

貌似在 判断leader 的时候 加个blocking 能 解决这个问题 @fengjiachun

fengjiachun commented 10 months ago

googlefan commented 10 months ago

<! 刚加大压力测试,有缓解,但是 并没有 完全解决 .哎~。又进入死循环了:

2023-08-23 15:35:38.372 [JRaft-FSMCaller-Disruptor-0] [] ERROR c.y.h.p.fs.storage.jraft.StorageStateMachine - fsm data get err:{}
java.nio.BufferUnderflowException: null
    at java.base/java.nio.Buffer.nextGetIndex(Buffer.java:643)
    at java.base/java.nio.HeapByteBuffer.get(HeapByteBuffer.java:165)
    at com.yss.henghe.platform.fs.storage.jraft.StorageStateMachine.onApply(StorageStateMachine.java:53)
    at com.alipay.sofa.jraft.core.FSMCallerImpl.doApplyTasks(FSMCallerImpl.java:597)
    at com.alipay.sofa.jraft.core.FSMCallerImpl.doCommitted(FSMCallerImpl.java:561)
    at com.alipay.sofa.jraft.core.FSMCallerImpl.runApplyTask(FSMCallerImpl.java:467)
    at com.alipay.sofa.jraft.core.FSMCallerImpl.access$100(FSMCallerImpl.java:73)
    at com.alipay.sofa.jraft.core.FSMCallerImpl$ApplyTaskHandler.onEvent(FSMCallerImpl.java:150)
    at com.alipay.sofa.jraft.core.FSMCallerImpl$ApplyTaskHandler.onEvent(FSMCallerImpl.java:142)
    at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:137)
    at java.base/java.lang.Thread.run(Thread.java:829)
2023-08-23 15:35:38.372 [JRaft-FSMCaller-Disruptor-0] [] ERROR c.y.h.p.fs.storage.jraft.StorageStateMachine - fsm data get err:{}
java.nio.BufferUnderflowException: null
    at java.base/java.nio.Buffer.nextGetIndex(Buffer.java:643)
    at java.base/java.nio.HeapByteBuffer.get(HeapByteBuffer.java:165)
    at com.yss.henghe.platform.fs.storage.jraft.StorageStateMachine.onApply(StorageStateMachine.java:53)
    at com.alipay.sofa.jraft.core.FSMCallerImpl.doApplyTasks(FSMCallerImpl.java:597)
    at com.alipay.sofa.jraft.core.FSMCallerImpl.doCommitted(FSMCallerImpl.java:561)
    at com.alipay.sofa.jraft.core.FSMCallerImpl.runApplyTask(FSMCallerImpl.java:467)
    at com.alipay.sofa.jraft.core.FSMCallerImpl.access$100(FSMCallerImpl.java:73)
    at com.alipay.sofa.jraft.core.FSMCallerImpl$ApplyTaskHandler.onEvent(FSMCallerImpl.java:150)
    at com.alipay.sofa.jraft.core.FSMCallerImpl$ApplyTaskHandler.onEvent(FSMCallerImpl.java:142)
    at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:137)
    at java.base/java.lang.Thread.run(Thread.java:829)
2023-08-23 15:35:38.370 [http-nio-8062-exec-3] [] ERROR c.y.h.platform.fs.exceptions.ApiExceptionHandler - 上传文件失败
com.yss.henghe.platform.fs.FsException: Node is busy, has too many tasks, queue is full and bufferSize=16
    at com.yss.henghe.platform.fs.client.FsClient.invoke(FsClient.java:374)
    at com.yss.henghe.platform.fs.client.FsClient.putObject(FsClient.java:240)
    at com.yss.henghe.platform.fs.service.impl.ObjectServiceImpl.putObject(ObjectServiceImpl.java:46)
    at com.yss.henghe.platform.fs.s3.ObjectController.upLoadByFile(ObjectController.java:54)
    at com.yss.henghe.platform.fs.s3.ObjectController$$FastClassBySpringCGLIB$$66c46a2a.invoke(<generated>)

2023-08-23 16:17:42.718 [Append-Entries-Thread-Send0] [] ERROR c.alipay.sofa.jraft.rpc.impl.AbstractClientService - Fail to run RpcResponseClosure, the request is group_id: "group1"
server_id: "192.168.165.32:8261"
peer_id: "192.168.165.30:8261"
term: 3
prev_log_term: 3
prev_log_index: 4221
entries {
  term: 3
  type: ENTRY_TYPE_DATA
  data_len: 10486854
}
committed_index: 4213
data: "\001O\276com.yss.henghe.platform.fs.command.SaveCommand\226\006bucket\003key\bfileType\bfileSize\nsplitIndex\005byteso\220\rtest-bucket16\005dir43\tvideo/mp4w\000\334\373\366\220b\200\000\000\000\000 ftypisom\000\000\002\000isomiso2avc1mp41\000\000\331umoov\000\000\000lmvhd\000\000\000\000\000\000\000\000\000\000\000\000\000\000\003\350\000\001\026k\000\001\000\000\001\000\000\000\000\000\000\000\000\000\000\000\000\001\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\001\000\000\000\000\000\000\000\000\000\000\000\000\000\000@\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\003\000\000q\231trak\000\000\000\\tkhd\000\000\000\003\000\000\000\000\000\000\000\000\000\000\000\001\000\000\000\000\000\001\026H\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\001\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\001\000\000\000\000\000\000\000\000\000\000\000\000\000\000@\000\000\000\005\000\000\000\002\320\000\000\000\000\000$edt

2023-08-23 20:32:56.596 [JRaft-FSMCaller-Disruptor-0] [] INFO  c.y.h.p.fs.storage.jraft.StorageStateMachine - fsm data buffer is empty!
2023-08-23 20:32:56.596 [JRaft-FSMCaller-Disruptor-0] [] INFO  c.y.h.p.fs.storage.jraft.StorageStateMachine - fsm data buffer is empty!
2023-08-23 20:32:56.596 [JRaft-FSMCaller-Disruptor-0] [] INFO  c.y.h.p.fs.storage.jraft.StorageStateMachine - fsm data buffer is empty!
2023-08-23 20:32:56.596 [JRaft-FSMCaller-Disruptor-0] [] INFO  c.y.h.p.fs.storage.jraft.StorageStateMachine - fsm data buffer is empty!
2023-08-23 20:32:56.596 [JRaft-FSMCaller-Disruptor-0] [] INFO  c.y.h.p.fs.storage.jraft.StorageStateMachine - fsm data buffer is empty!
2023-08-23 20:32:56.596 [JRaft-FSMCaller-Disruptor-0] [] INFO  c.y.h.p.fs.storage.jraft.StorageStateMachine - fsm data buffer is empty!
2023-08-23 20:32:56.596 [JRaft-FSMCaller-Disruptor-0] [] INFO  c.y.h.p.fs.storage.jraft.StorageStateMachine - fsm data buffer is empty!
2023-08-23 20:32:56.596 [JRaft-FSMCaller-Disruptor-0] [] INFO  c.y.h.p.fs.storage.jraft.StorageStateMachine - fsm data buffer is empty!
2023-08-23 20:32:56.596 [JRaft-FSMCaller-Disruptor-0] [] INFO  c.y.h.p.fs.storage.jraft.StorageStateMachine - fsm data buffer is empty!
2023-08-23 20:32:56.596 [JRaft-FSMCaller-Disruptor-0] [] INFO  c.y.h.p.fs.storage.jraft.StorageStateMachine - fsm data buffer is empty!
2023-08-23 20:32:56.596 [JRaft-FSMCaller-Disruptor-0] [] INFO  c.y.h.p.fs.storage.jraft.StorageStateMachine - fsm data buffer is empty!
2023-08-23 20:32:56.596 [JRaft-FSMCaller-Disruptor-0] [] INFO  c.y.h.p.fs.storage.jraft.StorageStateMachine - fsm data buffer is empty!
2023-08-23 20:32:56.596 [JRaft-FSMCaller-Disruptor-0] [] INFO  c.y.h.p.fs.storage.jraft.StorageStateMachine - fsm data buffer is empty!
2023-08-23 20:32:56.596 [JRaft-FSMCaller-Disruptor-0] [] INFO  c.y.h.p.fs.storage.jraft.StorageStateMachine - fsm data buffer is empty!
googlefan commented 10 months ago

找到bug 了, 是在 状态机处理 buffer data 时,如果buffer 已经 读完了, 没有 指向下一个任务导致的死循环 :

if (data.remaining() <= 0) {
                    log.info("fsm data buffer is empty!");
                    iter.next(); // 缺了这个
                    continue;
                }
googlefan commented 10 months ago

@fengjiachun 您好, 上面的问题基本解决了,现在出现了一个新 问题,还是 not active 的问题, 请协助一下:

Leader log:
2023-08-24 11:28:06.610 [Append-Entries-Thread-Send1] [] WARN  com.alipay.sofa.jraft.core.Replicator - Fail to issue RPC to 192.168.165.32:8261, consecutiveErrorTimes=2051, error=Status[EINVAL<1015>: Node <group1/192.168.165.32:8261> is not in active state, state STATE_ERROR.], groupId=group1

Follower log:
2023-08-24 11:29:48.298 [group1/PeerPair[192.168.165.32:8261 -> 192.168.165.31:8261]-AppendEntriesThread0] [] WARN  com.alipay.sofa.jraft.core.NodeImpl - Node <group1/192.168.165.32:8261> is not in active state, currTerm=3.

从日志可以看出 从节点 是 失活状态了,持续了 很久, jraft 会 自动去 检测 follwer 状态,自行恢复么? 还是必须要 人工介入 ,将 节点 重启 来解决这个问题?

fengjiachun commented 10 months ago

自助排查一下看看,参考一下链接第11小节: https://www.sofastack.tech/projects/sofa-jraft/jraft-user-guide/

killme2008 commented 10 months ago

这个错误信息已经很明显了 , state STATE_ERROR

状态机处于 error 状态,为什么? 看日志。

https://github.com/sofastack/sofa-jraft/blob/19ed179e02ee9108adc0bbf66badb47f62c62af8/jraft-core/src/main/java/com/alipay/sofa/jraft/core/StateMachineAdapter.java#L71

googlefan commented 10 months ago

这个错误信息已经很明显了 , state STATE_ERROR

状态机处于 error 状态,为什么? 看日志。

https://github.com/sofastack/sofa-jraft/blob/19ed179e02ee9108adc0bbf66badb47f62c62af8/jraft-core/src/main/java/com/alipay/sofa/jraft/core/StateMachineAdapter.java#L71

嗯嗯, 错误 有收集到:

2023-08-24 13:07:25.439 [JRaft-FSMCaller-Disruptor-0] [] ERROR c.y.h.p.fs.storage.jraft.StorageStateMachine - fsm err: ERROR_TYPE_LOG
2023-08-24 13:07:25.440 [JRaft-FSMCaller-Disruptor-0] [] ERROR com.alipay.sofa.jraft.core.StateMachineAdapter - Encountered an error=Status[UNKNOWN<-1>: LogManager handle event error] on StateMachine com.yss.henghe.platform.fs.storage.jraft.StorageStateMachine, it's highly recommended to implement this method as raft stops working since some error occurs, you should figure out the cause and repair or remove this node.
com.alipay.sofa.jraft.error.RaftException: ERROR_TYPE_LOG
    at com.alipay.sofa.jraft.storage.impl.LogManagerImpl.reportError(LogManagerImpl.java:573)
    at com.alipay.sofa.jraft.storage.impl.LogManagerImpl.lambda$init$0(LogManagerImpl.java:205)
    at com.alipay.sofa.jraft.util.LogExceptionHandler.handleEventException(LogExceptionHandler.java:66)
    at com.lmax.disruptor.dsl.ExceptionHandlerWrapper.handleEventException(ExceptionHandlerWrapper.java:18)
    at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:156)
    at java.base/java.lang.Thread.run(Thread.java:829)
2023-08-24 13:07:25.440 [JRaft-FSMCaller-Disruptor-0] [] WARN  com.alipay.sofa.jraft.core.NodeImpl - Node <group1/192.168.165.32:8261> got error: {}.
com.alipay.sofa.jraft.error.RaftException: ERROR_TYPE_LOG
    at com.alipay.sofa.jraft.storage.impl.LogManagerImpl.reportError(LogManagerImpl.java:573)
    at com.alipay.sofa.jraft.storage.impl.LogManagerImpl.lambda$init$0(LogManagerImpl.java:205)
    at com.alipay.sofa.jraft.util.LogExceptionHandler.handleEventException(LogExceptionHandler.java:66)
    at com.lmax.disruptor.dsl.ExceptionHandlerWrapper.handleEventException(ExceptionHandlerWrapper.java:18)
    at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:156)
    at java.base/java.lang.Thread.run(Thread.java:829)
2023-08-24 13:07:25.440 [JRaft-FSMCaller-Disruptor-0] [] WARN  com.alipay.sofa.jraft.core.FSMCallerImpl - FSMCaller already in error status, ignore new error.
com.alipay.sofa.jraft.error.RaftException: ERROR_TYPE_LOG
    at com.alipay.sofa.jraft.storage.impl.LogManagerImpl.reportError(LogManagerImpl.java:573)
    at com.alipay.sofa.jraft.storage.impl.LogManagerImpl.lambda$init$0(LogManagerImpl.java:205)
    at com.alipay.sofa.jraft.util.LogExceptionHandler.handleEventException(LogExceptionHandler.java:66)
    at com.lmax.disruptor.dsl.ExceptionHandlerWrapper.handleEventException(ExceptionHandlerWrapper.java:18)
    at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:156)
    at java.base/java.lang.Thread.run(Thread.java:829)

节点 出错之后, 该节点就一直处于 RpcRequestProcessor.handleRequest 方法中 处于死循环状态了, 不太理解为什么会这样.

killme2008 commented 10 months ago

读下开发指南吧。 你的状态机代码应该有抛出异常,所以整个状态机陷入错误状态,没法继续处理,只能人为介入处理。

raft 为了保证一致性,在遇到状态机异常的时候,就不会继续服务,日志也提醒的很清楚了,需要你找出根因,解决掉。

googlefan commented 10 months ago

读下开发指南吧。 你的状态机代码应该有抛出异常,所以整个状态机陷入错误状态,没法继续处理,只能人为介入处理。

raft 为了保证一致性,在遇到状态机异常的时候,就不会继续服务,日志也提醒的很清楚了,需要你找出根因,解决掉。

您好, 我重新看了一下 指南,想做了如下 修改:

  1. apply-task-mode: Blocking
  2. 我想在提交task 之前,将 expectedTerm 的值赋上. 期望上述到更改 能 优化 压力环境下,jraft 集群的稳定性.关于指南中的第二点, 我没有找到获取 term的方法呢?
    public void handleRequest(final RpcContext rpcCtx, final T request) {
        if (!raftGroupService.getRaftNode().isLeader(true)) {
            rpcCtx.sendResponse(redirect());
            return;
        }
        final CommandType cmdType = getCmdType();
        final Task task = createTask(rpcCtx, request, cmdType);
        raftGroupService.getRaftNode().apply(task);
    }

    我看遍了 raftGroupService 可获取的方法, 没找到可以获取Node。term 的方法, 是 哪个姿势不对呢? @killme2008

googlefan commented 10 months ago

而且 NodeIml 的

    @OnlyForTest
    long getCurrentTerm() {
        this.readLock.lock();
        try {
            return this.currTerm;
        } finally {
            this.readLock.unlock();
        }
    }

只用于测试的

fengjiachun commented 10 months ago

关于 ”apply-task-mode: Blocking“

    public void handleRequest(final RpcContext rpcCtx, final T request) {
        if (!raftGroupService.getRaftNode().isLeader(true)) {
            rpcCtx.sendResponse(redirect());
            return;
        }
        final CommandType cmdType = getCmdType();
        final Task task = createTask(rpcCtx, request, cmdType);
        raftGroupService.getRaftNode().apply(task);
    }

这是你自己写的代码吧?要不要同步调用不是你自己说的算吗?如果不知道怎么写,可以使用 SynchronizedClosure

fengjiachun commented 10 months ago

我想在提交task 之前,将 expectedTerm 的值赋上.

在之前拿没有任何意义,可以在 onApply 后拿到 term

googlefan commented 10 months ago

我想在提交task 之前,将 expectedTerm 的值赋上.

在之前拿没有任何意义,可以在 onApply 后拿到 term

指南中 提到: “long expectedTerm = -1 任务提交时预期的 leader term,如果不提供(也就是默认值 -1 ),在任务应用到状态机之前不会检查 leader 是否发生了变更,如果提供了(从状态机回调中获取,参见下文),那么在将任务应用到状态机之前,会检查 term 是否匹配,如果不匹配将拒绝该任务。” 我想 是不是 可以 通过 不匹配 term 机制来拒绝task , 不在状态机 流程抛异常

googlefan commented 10 months ago

关于 ”apply-task-mode: Blocking“

    public void handleRequest(final RpcContext rpcCtx, final T request) {
        if (!raftGroupService.getRaftNode().isLeader(true)) {
            rpcCtx.sendResponse(redirect());
            return;
        }
        final CommandType cmdType = getCmdType();
        final Task task = createTask(rpcCtx, request, cmdType);
        raftGroupService.getRaftNode().apply(task);
    }

这是你自己写的代码吧?要不要同步调用不是你自己说的算吗?如果不知道怎么写,可以使用 SynchronizedClosure

是的,我是打算用同步提交的方式来 减轻服务端的压力的, 看到配置项中有这个配置, 就想尝试用用

killme2008 commented 10 months ago

当前 term ,可以 leader 在 onLeaderStart 的时候记录下来,后面直接使用这个值。

googlefan commented 10 months ago
2023-08-24 18:16:11.848 [JRaft-FSMCaller-Disruptor-0] [] DEBUG com.yss.henghe.platform.fs.LeaderTaskClosure - task onCommitted
2023-08-24 18:16:11.849 [JRaft-FSMCaller-Disruptor-0] [] INFO  c.y.h.p.fs.storage.jraft.StorageStateMachine - On apply with term: 1 and index: 99.

**2023-08-24 18:16:12.677 [JRaft-LogManager-Disruptor-0] [] ERROR com.alipay.sofa.jraft.util.LogExceptionHandler - Handle LogManagerImpl disruptor event error, event is**

 com.alipay.sofa.jraft.storage.impl.LogManagerImpl$StableClosureEvent@53ddc044
java.lang.OutOfMemoryError: Java heap space
2023-08-24 18:16:12.840 [JRaft-Group-Default-Executor-7] [] ERROR com.alipay.sofa.jraft.core.NodeImpl - Node <group1/192.168.165.32:8261> append [0, 0] failed, status=Status[EIO<1014>: Corrupted LogStorage].
2023-08-24 18:16:12.977 [JRaft-StepDownTimer-<group1/192.168.165.32:8261>0] [] WARN  com.alipay.sofa.jraft.core.NodeImpl - Node <group1/192.168.165.32:8261> steps down when alive nodes don't satisfy quorum, term=1, deadNodes=192.168.165.30:8261,192.168.165.31:8261, conf=192.168.165.30:8261,192.168.165.31:8261,192.168.165.32:8261.
2023-08-24 18:16:12.986 [JRaft-FSMCaller-Disruptor-0] [] INFO  com.alipay.sofa.jraft.core.StateMachineAdapter - onLeaderStop: status=Status[ERAFTTIMEDOUT<10001>: Majority of the group dies: 2/3].
2023-08-24 18:16:13.088 [JRaft-LogManager-Disruptor-0] [] ERROR com.alipay.sofa.jraft.core.NodeImpl - Node <group1/192.168.165.32:8261> append [101, 101] failed, status=Status[EIO<1014>: Corrupted LogStorage].
2023-08-24 18:16:13.089 [JRaft-LogManager-Disruptor-0] [] ERROR com.alipay.sofa.jraft.core.NodeImpl - Node <group1/192.168.165.32:8261> append [103, 104] failed, status=Status[EIO<1014>: Corrupted LogStorage].
2023-08-24 18:16:13.089 [JRaft-LogManager-Disruptor-0] [] ERROR com.alipay.sofa.jraft.core.NodeImpl - Node <group1/192.168.165.32:8261> append [105, 105] failed, status=Status[EIO<1014>: Corrupted LogStorage].
2023-08-24 18:16:13.089 [JRaft-LogManager-Disruptor-0] [] ERROR com.alipay.sofa.jraft.core.NodeImpl - Node <group1/192.168.165.32:8261> append [111, 111] failed, status=Status[EIO<1014>: Corrupted LogStorage].
2023-08-24 18:16:13.089 [JRaft-FSMCaller-Disruptor-0] [] ERROR c.y.h.p.fs.storage.jraft.StorageStateMachine - fsm err: ERROR_TYPE_LOG
2023-08-24 18:16:13.090 [JRaft-FSMCaller-Disruptor-0] [] ERROR com.alipay.sofa.jraft.core.StateMachineAdapter - Encountered an error=Status[UNKNOWN<-1>: LogManager handle event error] on StateMachine com.yss.henghe.platform.fs.storage.jraft.StorageStateMachine, it's highly recommended to implement this method as raft stops working since some error occurs, you should figure out the cause and repair or remove this node.
com.alipay.sofa.jraft.error.RaftException: ERROR_TYPE_LOG
        at com.alipay.sofa.jraft.storage.impl.LogManagerImpl.reportError(LogManagerImpl.java:573)
        at com.alipay.sofa.jraft.storage.impl.LogManagerImpl.lambda$init$0(LogManagerImpl.java:205)
        at com.alipay.sofa.jraft.util.LogExceptionHandler.handleEventException(LogExceptionHandler.java:66)
        at com.lmax.disruptor.dsl.ExceptionHandlerWrapper.handleEventException(ExceptionHandlerWrapper.java:18)
        at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:156)
        at java.base/java.lang.Thread.run(Thread.java:829)
2023-08-24 18:16:13.090 [JRaft-FSMCaller-Disruptor-0] [] WARN  com.alipay.sofa.jraft.core.NodeImpl - Node <group1/192.168.165.32:8261> got error: {}.
com.alipay.sofa.jraft.error.RaftException: ERROR_TYPE_LOG
        at com.alipay.sofa.jraft.storage.impl.LogManagerImpl.reportError(LogManagerImpl.java:573)
        at com.alipay.sofa.jraft.storage.impl.LogManagerImpl.lambda$init$0(LogManagerImpl.java:205)
        at com.alipay.sofa.jraft.util.LogExceptionHandler.handleEventException(LogExceptionHandler.java:66)
        at com.lmax.disruptor.dsl.ExceptionHandlerWrapper.handleEventException(ExceptionHandlerWrapper.java:18)
        at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:156)
        at java.base/java.lang.Thread.run(Thread.java:829)
2023-08-24 18:16:13.090 [JRaft-FSMCaller-Disruptor-0] [] WARN  com.alipay.sofa.jraft.core.FSMCallerImpl - FSMCaller already in error status, ignore new error.
com.alipay.sofa.jraft.error.RaftException: ERROR_TYPE_LOG
        at com.alipay.sofa.jraft.storage.impl.LogManagerImpl.reportError(LogManagerImpl.java:573)
        at com.alipay.sofa.jraft.storage.impl.LogManagerImpl.lambda$init$0(LogManagerImpl.java:205)
        at com.alipay.sofa.jraft.util.LogExceptionHandler.handleEventException(LogExceptionHandler.java:66)
        at com.lmax.disruptor.dsl.ExceptionHandlerWrapper.handleEventException(ExceptionHandlerWrapper.java:18)
        at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:156)
        at java.base/java.lang.Thread.run(Thread.java:829)
2023-08-24 18:16:13.097 [http-nio-8062-exec-94] [] DEBUG com.yss.henghe.platform.fs.client.FsClient - invoke err: Leader stepped down
2023-08-24 18:16:13.098 [http-nio-8062-exec-73] [] DEBUG com.yss.henghe.platform.fs.client.FsClient - invoke err: Leader stepped down
2023-08-24 18:16:13.098 [http-nio-8062-exec-95] [] DEBUG com.yss.henghe.platform.fs.client.FsClient - invoke err: Leader stepped down
2023-08-24 18:16:13.098 [http-nio-8062-exec-81] [] DEBUG com.yss.henghe.platform.fs.client.FsClient - invoke err: Leader stepped down
2023-08-24 18:16:13.098 [http-nio-8062-exec-99] [] DEBUG com.yss.henghe.platform.fs.client.FsClient - invoke err: Leader stepped down

从error日志看, jraft 在LogManagerImpl 处理 StableClosureEvent时 发生了OOM , Node节点发生EIO<1014> IO 异常, 在状态机返回ERROR_TYPE_LOG 错误, 进而导致 Leader stepped down ,这种情况怎么处理呢? 可以在 StateMachine.onError 方法里 捕获该异常, 忽略处理么? (然后重启Node节点,让 group 节点复活?)

fengjiachun commented 10 months ago

我想在提交task 之前,将 expectedTerm 的值赋上.

在之前拿没有任何意义,可以在 onApply 后拿到 term

指南中 提到: “long expectedTerm = -1 任务提交时预期的 leader term,如果不提供(也就是默认值 -1 ),在任务应用到状态机之前不会检查 leader 是否发生了变更,如果提供了(从状态机回调中获取,参见下文),那么在将任务应用到状态机之前,会检查 term 是否匹配,如果不匹配将拒绝该任务。” 我想 是不是 可以 通过 不匹配 term 机制来拒绝task , 不在状态机 流程抛异常

哦,原来是这样,killme2008 已经回答你了

fengjiachun commented 10 months ago
2023-08-24 18:16:11.848 [JRaft-FSMCaller-Disruptor-0] [] DEBUG com.yss.henghe.platform.fs.LeaderTaskClosure - task onCommitted
2023-08-24 18:16:11.849 [JRaft-FSMCaller-Disruptor-0] [] INFO  c.y.h.p.fs.storage.jraft.StorageStateMachine - On apply with term: 1 and index: 99.

**2023-08-24 18:16:12.677 [JRaft-LogManager-Disruptor-0] [] ERROR com.alipay.sofa.jraft.util.LogExceptionHandler - Handle LogManagerImpl disruptor event error, event is**

 com.alipay.sofa.jraft.storage.impl.LogManagerImpl$StableClosureEvent@53ddc044
java.lang.OutOfMemoryError: Java heap space
2023-08-24 18:16:12.840 [JRaft-Group-Default-Executor-7] [] ERROR com.alipay.sofa.jraft.core.NodeImpl - Node <group1/192.168.165.32:8261> append [0, 0] failed, status=Status[EIO<1014>: Corrupted LogStorage].
2023-08-24 18:16:12.977 [JRaft-StepDownTimer-<group1/192.168.165.32:8261>0] [] WARN  com.alipay.sofa.jraft.core.NodeImpl - Node <group1/192.168.165.32:8261> steps down when alive nodes don't satisfy quorum, term=1, deadNodes=192.168.165.30:8261,192.168.165.31:8261, conf=192.168.165.30:8261,192.168.165.31:8261,192.168.165.32:8261.
2023-08-24 18:16:12.986 [JRaft-FSMCaller-Disruptor-0] [] INFO  com.alipay.sofa.jraft.core.StateMachineAdapter - onLeaderStop: status=Status[ERAFTTIMEDOUT<10001>: Majority of the group dies: 2/3].
2023-08-24 18:16:13.088 [JRaft-LogManager-Disruptor-0] [] ERROR com.alipay.sofa.jraft.core.NodeImpl - Node <group1/192.168.165.32:8261> append [101, 101] failed, status=Status[EIO<1014>: Corrupted LogStorage].
2023-08-24 18:16:13.089 [JRaft-LogManager-Disruptor-0] [] ERROR com.alipay.sofa.jraft.core.NodeImpl - Node <group1/192.168.165.32:8261> append [103, 104] failed, status=Status[EIO<1014>: Corrupted LogStorage].
2023-08-24 18:16:13.089 [JRaft-LogManager-Disruptor-0] [] ERROR com.alipay.sofa.jraft.core.NodeImpl - Node <group1/192.168.165.32:8261> append [105, 105] failed, status=Status[EIO<1014>: Corrupted LogStorage].
2023-08-24 18:16:13.089 [JRaft-LogManager-Disruptor-0] [] ERROR com.alipay.sofa.jraft.core.NodeImpl - Node <group1/192.168.165.32:8261> append [111, 111] failed, status=Status[EIO<1014>: Corrupted LogStorage].
2023-08-24 18:16:13.089 [JRaft-FSMCaller-Disruptor-0] [] ERROR c.y.h.p.fs.storage.jraft.StorageStateMachine - fsm err: ERROR_TYPE_LOG
2023-08-24 18:16:13.090 [JRaft-FSMCaller-Disruptor-0] [] ERROR com.alipay.sofa.jraft.core.StateMachineAdapter - Encountered an error=Status[UNKNOWN<-1>: LogManager handle event error] on StateMachine com.yss.henghe.platform.fs.storage.jraft.StorageStateMachine, it's highly recommended to implement this method as raft stops working since some error occurs, you should figure out the cause and repair or remove this node.
com.alipay.sofa.jraft.error.RaftException: ERROR_TYPE_LOG
        at com.alipay.sofa.jraft.storage.impl.LogManagerImpl.reportError(LogManagerImpl.java:573)
        at com.alipay.sofa.jraft.storage.impl.LogManagerImpl.lambda$init$0(LogManagerImpl.java:205)
        at com.alipay.sofa.jraft.util.LogExceptionHandler.handleEventException(LogExceptionHandler.java:66)
        at com.lmax.disruptor.dsl.ExceptionHandlerWrapper.handleEventException(ExceptionHandlerWrapper.java:18)
        at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:156)
        at java.base/java.lang.Thread.run(Thread.java:829)
2023-08-24 18:16:13.090 [JRaft-FSMCaller-Disruptor-0] [] WARN  com.alipay.sofa.jraft.core.NodeImpl - Node <group1/192.168.165.32:8261> got error: {}.
com.alipay.sofa.jraft.error.RaftException: ERROR_TYPE_LOG
        at com.alipay.sofa.jraft.storage.impl.LogManagerImpl.reportError(LogManagerImpl.java:573)
        at com.alipay.sofa.jraft.storage.impl.LogManagerImpl.lambda$init$0(LogManagerImpl.java:205)
        at com.alipay.sofa.jraft.util.LogExceptionHandler.handleEventException(LogExceptionHandler.java:66)
        at com.lmax.disruptor.dsl.ExceptionHandlerWrapper.handleEventException(ExceptionHandlerWrapper.java:18)
        at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:156)
        at java.base/java.lang.Thread.run(Thread.java:829)
2023-08-24 18:16:13.090 [JRaft-FSMCaller-Disruptor-0] [] WARN  com.alipay.sofa.jraft.core.FSMCallerImpl - FSMCaller already in error status, ignore new error.
com.alipay.sofa.jraft.error.RaftException: ERROR_TYPE_LOG
        at com.alipay.sofa.jraft.storage.impl.LogManagerImpl.reportError(LogManagerImpl.java:573)
        at com.alipay.sofa.jraft.storage.impl.LogManagerImpl.lambda$init$0(LogManagerImpl.java:205)
        at com.alipay.sofa.jraft.util.LogExceptionHandler.handleEventException(LogExceptionHandler.java:66)
        at com.lmax.disruptor.dsl.ExceptionHandlerWrapper.handleEventException(ExceptionHandlerWrapper.java:18)
        at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:156)
        at java.base/java.lang.Thread.run(Thread.java:829)
2023-08-24 18:16:13.097 [http-nio-8062-exec-94] [] DEBUG com.yss.henghe.platform.fs.client.FsClient - invoke err: Leader stepped down
2023-08-24 18:16:13.098 [http-nio-8062-exec-73] [] DEBUG com.yss.henghe.platform.fs.client.FsClient - invoke err: Leader stepped down
2023-08-24 18:16:13.098 [http-nio-8062-exec-95] [] DEBUG com.yss.henghe.platform.fs.client.FsClient - invoke err: Leader stepped down
2023-08-24 18:16:13.098 [http-nio-8062-exec-81] [] DEBUG com.yss.henghe.platform.fs.client.FsClient - invoke err: Leader stepped down
2023-08-24 18:16:13.098 [http-nio-8062-exec-99] [] DEBUG com.yss.henghe.platform.fs.client.FsClient - invoke err: Leader stepped down

从error日志看, jraft 在LogManagerImpl 处理 StableClosureEvent时 发生了OOM , Node节点发生EIO<1014> IO 异常, 在状态机返回ERROR_TYPE_LOG 错误, 进而导致 Leader stepped down ,这种情况怎么处理呢? 可以在 StateMachine.onError 方法里 捕获该异常, 忽略处理么? (然后重启Node节点,让 group 节点复活?)

都 OOM 了还看啥,避免 OOM 呗

googlefan commented 10 months ago

@killme2008 @fengjiachun 👍 👍 👍 感谢 协助,这个问题可以关闭啦