Closed googlefan closed 10 months ago
Leader step down 后, client 没有获取到leader down 状态,会持续的进行rpc 调用,然后报错 not leader
Leader 切走了,你要连新的 leader 呀 (收到 not leader 错误后调用 refreshLeader)
Leader step down 后, client 没有获取到leader down 状态,会持续的进行rpc 调用,然后报错 not leader
Leader 切走了,你要连新的 leader 呀 (收到 not leader 错误后调用 refreshLeader)
是有连接新leader的操作的,您看下我贴的代码, rpc 调用那里 ,每次rpc都走了 refreshLeader; 现在的情况是, 状态机 持续进行Task OnApply;参数数据 done 是null, 但是iter.getData() buffer 是空的, 状态机 持续处于 running 状态; 通过debug,我发现 在refresh leader 时: if (!stampedLock.validate(stamp)) false 跳过了, 并且 新的rpc 调用不走processor了,直接返回response (BooleanCommand) success true,但是 并没有调用服务端.
public PeerId selectLeader(final String groupId) {
Requires.requireTrue(!StringUtils.isBlank(groupId), "Blank group id");
final GroupConf gc = this.groupConfTable.get(groupId);
if (gc == null) {
return null;
}
final StampedLock stampedLock = gc.stampedLock;
long stamp = stampedLock.tryOptimisticRead();
PeerId leader = gc.leader;
if (!stampedLock.validate(stamp)) {
stamp = stampedLock.readLock();
try {
leader = gc.leader;
} finally {
stampedLock.unlockRead(stamp);
}
}
return leader;
}
状态机任务处理代码如下:
@Override
public void onApply(Iterator iter) {
while (iter.hasNext()) {
final Closure done = iter.done();
CommandType cmdType;
final ByteBuffer data = iter.getData();
Object cmd;
LeaderTaskClosure<T> closure = null;
if (done != null) {
// Leader节点直接获取数据
closure = (LeaderTaskClosure<T>) done;
cmdType = closure.getCmdType();
cmd = closure.getCmd();
} else {
// Follower 节点需要从rpc传输来的数据做解码
if (data.remaining() <= 0) {
log.info("fsm data buffer is empty!");
continue;
}
try {
final byte b = data.get();
cmdType = CommandType.parseByte(b);
} catch (Exception e) {
log.error("fsm data get err:{}", e);
continue;
}
final byte[] cmdBytes = new byte[data.remaining()];
data.get(cmdBytes);
cmd = CommandCodec.decodeCommand(cmdBytes, cmdType);
}
Object response;
if (cmd == null) {
response = null;
} else {
CmdTaskHandler<T> handler = registry.getHandler(cmdType);
try {
response = handler.execute((T) cmd);
} catch (FsException e) {
response = e;
log.error("fsm run FsException:{}", e.getMessage());
}
}
if (closure != null) {
closure.setResponse(response);
closure.run(Status.OK());
}
log.info("On apply with term: {} and index: {}. ", iter.getTerm(), iter.getIndex());
iter.next();
}
}
public PeerId selectLeader(final String groupId) {
这是 select leader 不是 refresh,select 是查询本地缓存,我的意思是如果遇到 NotLeader 的错误需要客户端主动刷新 leader (也就是重定向),并重试请求
上面的答案适合集群已经成功切换 leader 的情况,如果 jraft 因为负载过高或者某些原因长期选不出 leader 来,那你需要看一下服务端的错误信息,而不是客户端
public PeerId selectLeader(final String groupId) {
这是 select leader 不是 refresh,select 是查询本地缓存,我的意思是如果遇到 NotLeader 的错误需要客户端主动刷新 leader (也就是重定向),并重试请求
额, 在selectLeader 之前 我有调用refreshLeader ,你说的是 return RouteTable.getInstance().refreshLeader(cliClientService, groupId, cliOptions.getTimeoutMs()); 这个吧?
refreshLeader
看返回值,以及我的第二条建议
public PeerId selectLeader(final String groupId) {
这是 select leader 不是 refresh,select 是查询本地缓存,我的意思是如果遇到 NotLeader 的错误需要客户端主动刷新 leader (也就是重定向),并重试请求
额, 在selectLeader 之前 我有调用refreshLeader ,你说的是 return RouteTable.getInstance().refreshLeader(cliClientService, groupId, cliOptions.getTimeoutMs()); 这个吧?
看返回值,以及我的第二条建议
public PeerId selectLeader(final String groupId) {
这是 select leader 不是 refresh,select 是查询本地缓存,我的意思是如果遇到 NotLeader 的错误需要客户端主动刷新 leader (也就是重定向),并重试请求
额, 在selectLeader 之前 我有调用refreshLeader ,你说的是 return RouteTable.getInstance().refreshLeader(cliClientService, groupId, cliOptions.getTimeoutMs()); 这个吧?
看返回值,以及我的第二条建议
嗯 ,好的, 麻烦再 帮我看看 服务的状态机的问题: 服务端在 经过压力 报错过后,持续 进行task onApply的问题,感觉是进入了死循环
public void handleRequest(final RpcContext rpcCtx, final T request) {
if (!raftGroupService.getRaftNode().isLeader(true)) {
rpcCtx.sendResponse(redirect());
return;
}
final CommandType cmdType = getCmdType();
final Task task = createTask(rpcCtx, request, cmdType);
raftGroupService.getRaftNode().apply(task);
}
貌似在 判断leader 的时候 加个blocking 能 解决这个问题 @fengjiachun
?
<! 刚加大压力测试,有缓解,但是 并没有 完全解决 .哎~。又进入死循环了:
2023-08-23 15:35:38.372 [JRaft-FSMCaller-Disruptor-0] [] ERROR c.y.h.p.fs.storage.jraft.StorageStateMachine - fsm data get err:{}
java.nio.BufferUnderflowException: null
at java.base/java.nio.Buffer.nextGetIndex(Buffer.java:643)
at java.base/java.nio.HeapByteBuffer.get(HeapByteBuffer.java:165)
at com.yss.henghe.platform.fs.storage.jraft.StorageStateMachine.onApply(StorageStateMachine.java:53)
at com.alipay.sofa.jraft.core.FSMCallerImpl.doApplyTasks(FSMCallerImpl.java:597)
at com.alipay.sofa.jraft.core.FSMCallerImpl.doCommitted(FSMCallerImpl.java:561)
at com.alipay.sofa.jraft.core.FSMCallerImpl.runApplyTask(FSMCallerImpl.java:467)
at com.alipay.sofa.jraft.core.FSMCallerImpl.access$100(FSMCallerImpl.java:73)
at com.alipay.sofa.jraft.core.FSMCallerImpl$ApplyTaskHandler.onEvent(FSMCallerImpl.java:150)
at com.alipay.sofa.jraft.core.FSMCallerImpl$ApplyTaskHandler.onEvent(FSMCallerImpl.java:142)
at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:137)
at java.base/java.lang.Thread.run(Thread.java:829)
2023-08-23 15:35:38.372 [JRaft-FSMCaller-Disruptor-0] [] ERROR c.y.h.p.fs.storage.jraft.StorageStateMachine - fsm data get err:{}
java.nio.BufferUnderflowException: null
at java.base/java.nio.Buffer.nextGetIndex(Buffer.java:643)
at java.base/java.nio.HeapByteBuffer.get(HeapByteBuffer.java:165)
at com.yss.henghe.platform.fs.storage.jraft.StorageStateMachine.onApply(StorageStateMachine.java:53)
at com.alipay.sofa.jraft.core.FSMCallerImpl.doApplyTasks(FSMCallerImpl.java:597)
at com.alipay.sofa.jraft.core.FSMCallerImpl.doCommitted(FSMCallerImpl.java:561)
at com.alipay.sofa.jraft.core.FSMCallerImpl.runApplyTask(FSMCallerImpl.java:467)
at com.alipay.sofa.jraft.core.FSMCallerImpl.access$100(FSMCallerImpl.java:73)
at com.alipay.sofa.jraft.core.FSMCallerImpl$ApplyTaskHandler.onEvent(FSMCallerImpl.java:150)
at com.alipay.sofa.jraft.core.FSMCallerImpl$ApplyTaskHandler.onEvent(FSMCallerImpl.java:142)
at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:137)
at java.base/java.lang.Thread.run(Thread.java:829)
2023-08-23 15:35:38.370 [http-nio-8062-exec-3] [] ERROR c.y.h.platform.fs.exceptions.ApiExceptionHandler - 上传文件失败
com.yss.henghe.platform.fs.FsException: Node is busy, has too many tasks, queue is full and bufferSize=16
at com.yss.henghe.platform.fs.client.FsClient.invoke(FsClient.java:374)
at com.yss.henghe.platform.fs.client.FsClient.putObject(FsClient.java:240)
at com.yss.henghe.platform.fs.service.impl.ObjectServiceImpl.putObject(ObjectServiceImpl.java:46)
at com.yss.henghe.platform.fs.s3.ObjectController.upLoadByFile(ObjectController.java:54)
at com.yss.henghe.platform.fs.s3.ObjectController$$FastClassBySpringCGLIB$$66c46a2a.invoke(<generated>)
2023-08-23 16:17:42.718 [Append-Entries-Thread-Send0] [] ERROR c.alipay.sofa.jraft.rpc.impl.AbstractClientService - Fail to run RpcResponseClosure, the request is group_id: "group1"
server_id: "192.168.165.32:8261"
peer_id: "192.168.165.30:8261"
term: 3
prev_log_term: 3
prev_log_index: 4221
entries {
term: 3
type: ENTRY_TYPE_DATA
data_len: 10486854
}
committed_index: 4213
data: "\001O\276com.yss.henghe.platform.fs.command.SaveCommand\226\006bucket\003key\bfileType\bfileSize\nsplitIndex\005byteso\220\rtest-bucket16\005dir43\tvideo/mp4w\000\334\373\366\220b\200\000\000\000\000 ftypisom\000\000\002\000isomiso2avc1mp41\000\000\331umoov\000\000\000lmvhd\000\000\000\000\000\000\000\000\000\000\000\000\000\000\003\350\000\001\026k\000\001\000\000\001\000\000\000\000\000\000\000\000\000\000\000\000\001\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\001\000\000\000\000\000\000\000\000\000\000\000\000\000\000@\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\003\000\000q\231trak\000\000\000\\tkhd\000\000\000\003\000\000\000\000\000\000\000\000\000\000\000\001\000\000\000\000\000\001\026H\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\001\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\001\000\000\000\000\000\000\000\000\000\000\000\000\000\000@\000\000\000\005\000\000\000\002\320\000\000\000\000\000$edt
2023-08-23 20:32:56.596 [JRaft-FSMCaller-Disruptor-0] [] INFO c.y.h.p.fs.storage.jraft.StorageStateMachine - fsm data buffer is empty!
2023-08-23 20:32:56.596 [JRaft-FSMCaller-Disruptor-0] [] INFO c.y.h.p.fs.storage.jraft.StorageStateMachine - fsm data buffer is empty!
2023-08-23 20:32:56.596 [JRaft-FSMCaller-Disruptor-0] [] INFO c.y.h.p.fs.storage.jraft.StorageStateMachine - fsm data buffer is empty!
2023-08-23 20:32:56.596 [JRaft-FSMCaller-Disruptor-0] [] INFO c.y.h.p.fs.storage.jraft.StorageStateMachine - fsm data buffer is empty!
2023-08-23 20:32:56.596 [JRaft-FSMCaller-Disruptor-0] [] INFO c.y.h.p.fs.storage.jraft.StorageStateMachine - fsm data buffer is empty!
2023-08-23 20:32:56.596 [JRaft-FSMCaller-Disruptor-0] [] INFO c.y.h.p.fs.storage.jraft.StorageStateMachine - fsm data buffer is empty!
2023-08-23 20:32:56.596 [JRaft-FSMCaller-Disruptor-0] [] INFO c.y.h.p.fs.storage.jraft.StorageStateMachine - fsm data buffer is empty!
2023-08-23 20:32:56.596 [JRaft-FSMCaller-Disruptor-0] [] INFO c.y.h.p.fs.storage.jraft.StorageStateMachine - fsm data buffer is empty!
2023-08-23 20:32:56.596 [JRaft-FSMCaller-Disruptor-0] [] INFO c.y.h.p.fs.storage.jraft.StorageStateMachine - fsm data buffer is empty!
2023-08-23 20:32:56.596 [JRaft-FSMCaller-Disruptor-0] [] INFO c.y.h.p.fs.storage.jraft.StorageStateMachine - fsm data buffer is empty!
2023-08-23 20:32:56.596 [JRaft-FSMCaller-Disruptor-0] [] INFO c.y.h.p.fs.storage.jraft.StorageStateMachine - fsm data buffer is empty!
2023-08-23 20:32:56.596 [JRaft-FSMCaller-Disruptor-0] [] INFO c.y.h.p.fs.storage.jraft.StorageStateMachine - fsm data buffer is empty!
2023-08-23 20:32:56.596 [JRaft-FSMCaller-Disruptor-0] [] INFO c.y.h.p.fs.storage.jraft.StorageStateMachine - fsm data buffer is empty!
2023-08-23 20:32:56.596 [JRaft-FSMCaller-Disruptor-0] [] INFO c.y.h.p.fs.storage.jraft.StorageStateMachine - fsm data buffer is empty!
找到bug 了, 是在 状态机处理 buffer data 时,如果buffer 已经 读完了, 没有 指向下一个任务导致的死循环 :
if (data.remaining() <= 0) {
log.info("fsm data buffer is empty!");
iter.next(); // 缺了这个
continue;
}
@fengjiachun 您好, 上面的问题基本解决了,现在出现了一个新 问题,还是 not active 的问题, 请协助一下:
Leader log:
2023-08-24 11:28:06.610 [Append-Entries-Thread-Send1] [] WARN com.alipay.sofa.jraft.core.Replicator - Fail to issue RPC to 192.168.165.32:8261, consecutiveErrorTimes=2051, error=Status[EINVAL<1015>: Node <group1/192.168.165.32:8261> is not in active state, state STATE_ERROR.], groupId=group1
Follower log:
2023-08-24 11:29:48.298 [group1/PeerPair[192.168.165.32:8261 -> 192.168.165.31:8261]-AppendEntriesThread0] [] WARN com.alipay.sofa.jraft.core.NodeImpl - Node <group1/192.168.165.32:8261> is not in active state, currTerm=3.
从日志可以看出 从节点 是 失活状态了,持续了 很久, jraft 会 自动去 检测 follwer 状态,自行恢复么? 还是必须要 人工介入 ,将 节点 重启 来解决这个问题?
自助排查一下看看,参考一下链接第11小节: https://www.sofastack.tech/projects/sofa-jraft/jraft-user-guide/
这个错误信息已经很明显了 , state STATE_ERROR
状态机处于 error 状态,为什么? 看日志。
这个错误信息已经很明显了
, state STATE_ERROR
状态机处于 error 状态,为什么? 看日志。
嗯嗯, 错误 有收集到:
2023-08-24 13:07:25.439 [JRaft-FSMCaller-Disruptor-0] [] ERROR c.y.h.p.fs.storage.jraft.StorageStateMachine - fsm err: ERROR_TYPE_LOG
2023-08-24 13:07:25.440 [JRaft-FSMCaller-Disruptor-0] [] ERROR com.alipay.sofa.jraft.core.StateMachineAdapter - Encountered an error=Status[UNKNOWN<-1>: LogManager handle event error] on StateMachine com.yss.henghe.platform.fs.storage.jraft.StorageStateMachine, it's highly recommended to implement this method as raft stops working since some error occurs, you should figure out the cause and repair or remove this node.
com.alipay.sofa.jraft.error.RaftException: ERROR_TYPE_LOG
at com.alipay.sofa.jraft.storage.impl.LogManagerImpl.reportError(LogManagerImpl.java:573)
at com.alipay.sofa.jraft.storage.impl.LogManagerImpl.lambda$init$0(LogManagerImpl.java:205)
at com.alipay.sofa.jraft.util.LogExceptionHandler.handleEventException(LogExceptionHandler.java:66)
at com.lmax.disruptor.dsl.ExceptionHandlerWrapper.handleEventException(ExceptionHandlerWrapper.java:18)
at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:156)
at java.base/java.lang.Thread.run(Thread.java:829)
2023-08-24 13:07:25.440 [JRaft-FSMCaller-Disruptor-0] [] WARN com.alipay.sofa.jraft.core.NodeImpl - Node <group1/192.168.165.32:8261> got error: {}.
com.alipay.sofa.jraft.error.RaftException: ERROR_TYPE_LOG
at com.alipay.sofa.jraft.storage.impl.LogManagerImpl.reportError(LogManagerImpl.java:573)
at com.alipay.sofa.jraft.storage.impl.LogManagerImpl.lambda$init$0(LogManagerImpl.java:205)
at com.alipay.sofa.jraft.util.LogExceptionHandler.handleEventException(LogExceptionHandler.java:66)
at com.lmax.disruptor.dsl.ExceptionHandlerWrapper.handleEventException(ExceptionHandlerWrapper.java:18)
at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:156)
at java.base/java.lang.Thread.run(Thread.java:829)
2023-08-24 13:07:25.440 [JRaft-FSMCaller-Disruptor-0] [] WARN com.alipay.sofa.jraft.core.FSMCallerImpl - FSMCaller already in error status, ignore new error.
com.alipay.sofa.jraft.error.RaftException: ERROR_TYPE_LOG
at com.alipay.sofa.jraft.storage.impl.LogManagerImpl.reportError(LogManagerImpl.java:573)
at com.alipay.sofa.jraft.storage.impl.LogManagerImpl.lambda$init$0(LogManagerImpl.java:205)
at com.alipay.sofa.jraft.util.LogExceptionHandler.handleEventException(LogExceptionHandler.java:66)
at com.lmax.disruptor.dsl.ExceptionHandlerWrapper.handleEventException(ExceptionHandlerWrapper.java:18)
at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:156)
at java.base/java.lang.Thread.run(Thread.java:829)
节点 出错之后, 该节点就一直处于 RpcRequestProcessor.handleRequest 方法中 处于死循环状态了, 不太理解为什么会这样.
读下开发指南吧。 你的状态机代码应该有抛出异常,所以整个状态机陷入错误状态,没法继续处理,只能人为介入处理。
raft 为了保证一致性,在遇到状态机异常的时候,就不会继续服务,日志也提醒的很清楚了,需要你找出根因,解决掉。
读下开发指南吧。 你的状态机代码应该有抛出异常,所以整个状态机陷入错误状态,没法继续处理,只能人为介入处理。
raft 为了保证一致性,在遇到状态机异常的时候,就不会继续服务,日志也提醒的很清楚了,需要你找出根因,解决掉。
您好, 我重新看了一下 指南,想做了如下 修改:
public void handleRequest(final RpcContext rpcCtx, final T request) {
if (!raftGroupService.getRaftNode().isLeader(true)) {
rpcCtx.sendResponse(redirect());
return;
}
final CommandType cmdType = getCmdType();
final Task task = createTask(rpcCtx, request, cmdType);
raftGroupService.getRaftNode().apply(task);
}
我看遍了 raftGroupService 可获取的方法, 没找到可以获取Node。term 的方法, 是 哪个姿势不对呢? @killme2008
而且 NodeIml 的
@OnlyForTest
long getCurrentTerm() {
this.readLock.lock();
try {
return this.currTerm;
} finally {
this.readLock.unlock();
}
}
只用于测试的
public void handleRequest(final RpcContext rpcCtx, final T request) {
if (!raftGroupService.getRaftNode().isLeader(true)) {
rpcCtx.sendResponse(redirect());
return;
}
final CommandType cmdType = getCmdType();
final Task task = createTask(rpcCtx, request, cmdType);
raftGroupService.getRaftNode().apply(task);
}
这是你自己写的代码吧?要不要同步调用不是你自己说的算吗?如果不知道怎么写,可以使用 SynchronizedClosure
我想在提交task 之前,将 expectedTerm 的值赋上.
在之前拿没有任何意义,可以在 onApply 后拿到 term
我想在提交task 之前,将 expectedTerm 的值赋上.
在之前拿没有任何意义,可以在 onApply 后拿到 term
指南中 提到: “long expectedTerm = -1 任务提交时预期的 leader term,如果不提供(也就是默认值 -1 ),在任务应用到状态机之前不会检查 leader 是否发生了变更,如果提供了(从状态机回调中获取,参见下文),那么在将任务应用到状态机之前,会检查 term 是否匹配,如果不匹配将拒绝该任务。” 我想 是不是 可以 通过 不匹配 term 机制来拒绝task , 不在状态机 流程抛异常
关于 ”apply-task-mode: Blocking“
public void handleRequest(final RpcContext rpcCtx, final T request) { if (!raftGroupService.getRaftNode().isLeader(true)) { rpcCtx.sendResponse(redirect()); return; } final CommandType cmdType = getCmdType(); final Task task = createTask(rpcCtx, request, cmdType); raftGroupService.getRaftNode().apply(task); }
这是你自己写的代码吧?要不要同步调用不是你自己说的算吗?如果不知道怎么写,可以使用 SynchronizedClosure
是的,我是打算用同步提交的方式来 减轻服务端的压力的, 看到配置项中有这个配置, 就想尝试用用
当前 term ,可以 leader 在 onLeaderStart
的时候记录下来,后面直接使用这个值。
2023-08-24 18:16:11.848 [JRaft-FSMCaller-Disruptor-0] [] DEBUG com.yss.henghe.platform.fs.LeaderTaskClosure - task onCommitted
2023-08-24 18:16:11.849 [JRaft-FSMCaller-Disruptor-0] [] INFO c.y.h.p.fs.storage.jraft.StorageStateMachine - On apply with term: 1 and index: 99.
**2023-08-24 18:16:12.677 [JRaft-LogManager-Disruptor-0] [] ERROR com.alipay.sofa.jraft.util.LogExceptionHandler - Handle LogManagerImpl disruptor event error, event is**
com.alipay.sofa.jraft.storage.impl.LogManagerImpl$StableClosureEvent@53ddc044
java.lang.OutOfMemoryError: Java heap space
2023-08-24 18:16:12.840 [JRaft-Group-Default-Executor-7] [] ERROR com.alipay.sofa.jraft.core.NodeImpl - Node <group1/192.168.165.32:8261> append [0, 0] failed, status=Status[EIO<1014>: Corrupted LogStorage].
2023-08-24 18:16:12.977 [JRaft-StepDownTimer-<group1/192.168.165.32:8261>0] [] WARN com.alipay.sofa.jraft.core.NodeImpl - Node <group1/192.168.165.32:8261> steps down when alive nodes don't satisfy quorum, term=1, deadNodes=192.168.165.30:8261,192.168.165.31:8261, conf=192.168.165.30:8261,192.168.165.31:8261,192.168.165.32:8261.
2023-08-24 18:16:12.986 [JRaft-FSMCaller-Disruptor-0] [] INFO com.alipay.sofa.jraft.core.StateMachineAdapter - onLeaderStop: status=Status[ERAFTTIMEDOUT<10001>: Majority of the group dies: 2/3].
2023-08-24 18:16:13.088 [JRaft-LogManager-Disruptor-0] [] ERROR com.alipay.sofa.jraft.core.NodeImpl - Node <group1/192.168.165.32:8261> append [101, 101] failed, status=Status[EIO<1014>: Corrupted LogStorage].
2023-08-24 18:16:13.089 [JRaft-LogManager-Disruptor-0] [] ERROR com.alipay.sofa.jraft.core.NodeImpl - Node <group1/192.168.165.32:8261> append [103, 104] failed, status=Status[EIO<1014>: Corrupted LogStorage].
2023-08-24 18:16:13.089 [JRaft-LogManager-Disruptor-0] [] ERROR com.alipay.sofa.jraft.core.NodeImpl - Node <group1/192.168.165.32:8261> append [105, 105] failed, status=Status[EIO<1014>: Corrupted LogStorage].
2023-08-24 18:16:13.089 [JRaft-LogManager-Disruptor-0] [] ERROR com.alipay.sofa.jraft.core.NodeImpl - Node <group1/192.168.165.32:8261> append [111, 111] failed, status=Status[EIO<1014>: Corrupted LogStorage].
2023-08-24 18:16:13.089 [JRaft-FSMCaller-Disruptor-0] [] ERROR c.y.h.p.fs.storage.jraft.StorageStateMachine - fsm err: ERROR_TYPE_LOG
2023-08-24 18:16:13.090 [JRaft-FSMCaller-Disruptor-0] [] ERROR com.alipay.sofa.jraft.core.StateMachineAdapter - Encountered an error=Status[UNKNOWN<-1>: LogManager handle event error] on StateMachine com.yss.henghe.platform.fs.storage.jraft.StorageStateMachine, it's highly recommended to implement this method as raft stops working since some error occurs, you should figure out the cause and repair or remove this node.
com.alipay.sofa.jraft.error.RaftException: ERROR_TYPE_LOG
at com.alipay.sofa.jraft.storage.impl.LogManagerImpl.reportError(LogManagerImpl.java:573)
at com.alipay.sofa.jraft.storage.impl.LogManagerImpl.lambda$init$0(LogManagerImpl.java:205)
at com.alipay.sofa.jraft.util.LogExceptionHandler.handleEventException(LogExceptionHandler.java:66)
at com.lmax.disruptor.dsl.ExceptionHandlerWrapper.handleEventException(ExceptionHandlerWrapper.java:18)
at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:156)
at java.base/java.lang.Thread.run(Thread.java:829)
2023-08-24 18:16:13.090 [JRaft-FSMCaller-Disruptor-0] [] WARN com.alipay.sofa.jraft.core.NodeImpl - Node <group1/192.168.165.32:8261> got error: {}.
com.alipay.sofa.jraft.error.RaftException: ERROR_TYPE_LOG
at com.alipay.sofa.jraft.storage.impl.LogManagerImpl.reportError(LogManagerImpl.java:573)
at com.alipay.sofa.jraft.storage.impl.LogManagerImpl.lambda$init$0(LogManagerImpl.java:205)
at com.alipay.sofa.jraft.util.LogExceptionHandler.handleEventException(LogExceptionHandler.java:66)
at com.lmax.disruptor.dsl.ExceptionHandlerWrapper.handleEventException(ExceptionHandlerWrapper.java:18)
at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:156)
at java.base/java.lang.Thread.run(Thread.java:829)
2023-08-24 18:16:13.090 [JRaft-FSMCaller-Disruptor-0] [] WARN com.alipay.sofa.jraft.core.FSMCallerImpl - FSMCaller already in error status, ignore new error.
com.alipay.sofa.jraft.error.RaftException: ERROR_TYPE_LOG
at com.alipay.sofa.jraft.storage.impl.LogManagerImpl.reportError(LogManagerImpl.java:573)
at com.alipay.sofa.jraft.storage.impl.LogManagerImpl.lambda$init$0(LogManagerImpl.java:205)
at com.alipay.sofa.jraft.util.LogExceptionHandler.handleEventException(LogExceptionHandler.java:66)
at com.lmax.disruptor.dsl.ExceptionHandlerWrapper.handleEventException(ExceptionHandlerWrapper.java:18)
at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:156)
at java.base/java.lang.Thread.run(Thread.java:829)
2023-08-24 18:16:13.097 [http-nio-8062-exec-94] [] DEBUG com.yss.henghe.platform.fs.client.FsClient - invoke err: Leader stepped down
2023-08-24 18:16:13.098 [http-nio-8062-exec-73] [] DEBUG com.yss.henghe.platform.fs.client.FsClient - invoke err: Leader stepped down
2023-08-24 18:16:13.098 [http-nio-8062-exec-95] [] DEBUG com.yss.henghe.platform.fs.client.FsClient - invoke err: Leader stepped down
2023-08-24 18:16:13.098 [http-nio-8062-exec-81] [] DEBUG com.yss.henghe.platform.fs.client.FsClient - invoke err: Leader stepped down
2023-08-24 18:16:13.098 [http-nio-8062-exec-99] [] DEBUG com.yss.henghe.platform.fs.client.FsClient - invoke err: Leader stepped down
从error日志看, jraft 在LogManagerImpl 处理 StableClosureEvent时 发生了OOM , Node节点发生EIO<1014> IO 异常, 在状态机返回ERROR_TYPE_LOG 错误, 进而导致 Leader stepped down ,这种情况怎么处理呢? 可以在 StateMachine.onError 方法里 捕获该异常, 忽略处理么? (然后重启Node节点,让 group 节点复活?)
我想在提交task 之前,将 expectedTerm 的值赋上.
在之前拿没有任何意义,可以在 onApply 后拿到 term
指南中 提到: “long expectedTerm = -1 任务提交时预期的 leader term,如果不提供(也就是默认值 -1 ),在任务应用到状态机之前不会检查 leader 是否发生了变更,如果提供了(从状态机回调中获取,参见下文),那么在将任务应用到状态机之前,会检查 term 是否匹配,如果不匹配将拒绝该任务。” 我想 是不是 可以 通过 不匹配 term 机制来拒绝task , 不在状态机 流程抛异常
哦,原来是这样,killme2008 已经回答你了
2023-08-24 18:16:11.848 [JRaft-FSMCaller-Disruptor-0] [] DEBUG com.yss.henghe.platform.fs.LeaderTaskClosure - task onCommitted 2023-08-24 18:16:11.849 [JRaft-FSMCaller-Disruptor-0] [] INFO c.y.h.p.fs.storage.jraft.StorageStateMachine - On apply with term: 1 and index: 99. **2023-08-24 18:16:12.677 [JRaft-LogManager-Disruptor-0] [] ERROR com.alipay.sofa.jraft.util.LogExceptionHandler - Handle LogManagerImpl disruptor event error, event is** com.alipay.sofa.jraft.storage.impl.LogManagerImpl$StableClosureEvent@53ddc044 java.lang.OutOfMemoryError: Java heap space 2023-08-24 18:16:12.840 [JRaft-Group-Default-Executor-7] [] ERROR com.alipay.sofa.jraft.core.NodeImpl - Node <group1/192.168.165.32:8261> append [0, 0] failed, status=Status[EIO<1014>: Corrupted LogStorage]. 2023-08-24 18:16:12.977 [JRaft-StepDownTimer-<group1/192.168.165.32:8261>0] [] WARN com.alipay.sofa.jraft.core.NodeImpl - Node <group1/192.168.165.32:8261> steps down when alive nodes don't satisfy quorum, term=1, deadNodes=192.168.165.30:8261,192.168.165.31:8261, conf=192.168.165.30:8261,192.168.165.31:8261,192.168.165.32:8261. 2023-08-24 18:16:12.986 [JRaft-FSMCaller-Disruptor-0] [] INFO com.alipay.sofa.jraft.core.StateMachineAdapter - onLeaderStop: status=Status[ERAFTTIMEDOUT<10001>: Majority of the group dies: 2/3]. 2023-08-24 18:16:13.088 [JRaft-LogManager-Disruptor-0] [] ERROR com.alipay.sofa.jraft.core.NodeImpl - Node <group1/192.168.165.32:8261> append [101, 101] failed, status=Status[EIO<1014>: Corrupted LogStorage]. 2023-08-24 18:16:13.089 [JRaft-LogManager-Disruptor-0] [] ERROR com.alipay.sofa.jraft.core.NodeImpl - Node <group1/192.168.165.32:8261> append [103, 104] failed, status=Status[EIO<1014>: Corrupted LogStorage]. 2023-08-24 18:16:13.089 [JRaft-LogManager-Disruptor-0] [] ERROR com.alipay.sofa.jraft.core.NodeImpl - Node <group1/192.168.165.32:8261> append [105, 105] failed, status=Status[EIO<1014>: Corrupted LogStorage]. 2023-08-24 18:16:13.089 [JRaft-LogManager-Disruptor-0] [] ERROR com.alipay.sofa.jraft.core.NodeImpl - Node <group1/192.168.165.32:8261> append [111, 111] failed, status=Status[EIO<1014>: Corrupted LogStorage]. 2023-08-24 18:16:13.089 [JRaft-FSMCaller-Disruptor-0] [] ERROR c.y.h.p.fs.storage.jraft.StorageStateMachine - fsm err: ERROR_TYPE_LOG 2023-08-24 18:16:13.090 [JRaft-FSMCaller-Disruptor-0] [] ERROR com.alipay.sofa.jraft.core.StateMachineAdapter - Encountered an error=Status[UNKNOWN<-1>: LogManager handle event error] on StateMachine com.yss.henghe.platform.fs.storage.jraft.StorageStateMachine, it's highly recommended to implement this method as raft stops working since some error occurs, you should figure out the cause and repair or remove this node. com.alipay.sofa.jraft.error.RaftException: ERROR_TYPE_LOG at com.alipay.sofa.jraft.storage.impl.LogManagerImpl.reportError(LogManagerImpl.java:573) at com.alipay.sofa.jraft.storage.impl.LogManagerImpl.lambda$init$0(LogManagerImpl.java:205) at com.alipay.sofa.jraft.util.LogExceptionHandler.handleEventException(LogExceptionHandler.java:66) at com.lmax.disruptor.dsl.ExceptionHandlerWrapper.handleEventException(ExceptionHandlerWrapper.java:18) at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:156) at java.base/java.lang.Thread.run(Thread.java:829) 2023-08-24 18:16:13.090 [JRaft-FSMCaller-Disruptor-0] [] WARN com.alipay.sofa.jraft.core.NodeImpl - Node <group1/192.168.165.32:8261> got error: {}. com.alipay.sofa.jraft.error.RaftException: ERROR_TYPE_LOG at com.alipay.sofa.jraft.storage.impl.LogManagerImpl.reportError(LogManagerImpl.java:573) at com.alipay.sofa.jraft.storage.impl.LogManagerImpl.lambda$init$0(LogManagerImpl.java:205) at com.alipay.sofa.jraft.util.LogExceptionHandler.handleEventException(LogExceptionHandler.java:66) at com.lmax.disruptor.dsl.ExceptionHandlerWrapper.handleEventException(ExceptionHandlerWrapper.java:18) at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:156) at java.base/java.lang.Thread.run(Thread.java:829) 2023-08-24 18:16:13.090 [JRaft-FSMCaller-Disruptor-0] [] WARN com.alipay.sofa.jraft.core.FSMCallerImpl - FSMCaller already in error status, ignore new error. com.alipay.sofa.jraft.error.RaftException: ERROR_TYPE_LOG at com.alipay.sofa.jraft.storage.impl.LogManagerImpl.reportError(LogManagerImpl.java:573) at com.alipay.sofa.jraft.storage.impl.LogManagerImpl.lambda$init$0(LogManagerImpl.java:205) at com.alipay.sofa.jraft.util.LogExceptionHandler.handleEventException(LogExceptionHandler.java:66) at com.lmax.disruptor.dsl.ExceptionHandlerWrapper.handleEventException(ExceptionHandlerWrapper.java:18) at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:156) at java.base/java.lang.Thread.run(Thread.java:829) 2023-08-24 18:16:13.097 [http-nio-8062-exec-94] [] DEBUG com.yss.henghe.platform.fs.client.FsClient - invoke err: Leader stepped down 2023-08-24 18:16:13.098 [http-nio-8062-exec-73] [] DEBUG com.yss.henghe.platform.fs.client.FsClient - invoke err: Leader stepped down 2023-08-24 18:16:13.098 [http-nio-8062-exec-95] [] DEBUG com.yss.henghe.platform.fs.client.FsClient - invoke err: Leader stepped down 2023-08-24 18:16:13.098 [http-nio-8062-exec-81] [] DEBUG com.yss.henghe.platform.fs.client.FsClient - invoke err: Leader stepped down 2023-08-24 18:16:13.098 [http-nio-8062-exec-99] [] DEBUG com.yss.henghe.platform.fs.client.FsClient - invoke err: Leader stepped down
从error日志看, jraft 在LogManagerImpl 处理 StableClosureEvent时 发生了OOM , Node节点发生EIO<1014> IO 异常, 在状态机返回ERROR_TYPE_LOG 错误, 进而导致 Leader stepped down ,这种情况怎么处理呢? 可以在 StateMachine.onError 方法里 捕获该异常, 忽略处理么? (然后重启Node节点,让 group 节点复活?)
都 OOM 了还看啥,避免 OOM 呗
@killme2008 @fengjiachun 👍 👍 👍 感谢 协助,这个问题可以关闭啦
Your question
我使用Jraft 实现了一个 分布式 文件系统, 使用默认的bolt rpc 框架,编解码使用的 默认 的hessian .大文件做了切割处理,10MB每块; 在处理小文件的 上传,性能不错,也很稳定,在做大文件压测时,我控制了 各种设置,尽量不让系统发生OOM,虽然也发生了,但是问题的关键是,当系统压力过大时,Leader step down 后, client 没有获取到leader down 状态,会持续的进行rpc 调用,然后报错 not leader ;系统就会进入无休止的 不可用状态. 我的问题是:
Your scenes
ClientSDK Code:
Jraft 参数配置如下:
Environment
java -version
): openJDK 11uname -a
): centos