data and closure of task maybe does not match，out-of-order

hanzhihua commented 4 years ago

Your question

应用重启时偶尔会出现下面这样的错误

2019-11-18 19:08:06.691 WARN leader [JRaft-NodeImpl-Disruptor-0]-c.a.s.j.storage.impl.LogManagerImpl.checkAndResolveConflict - Received entries of which the lastLog=0 is not greater than appliedIndex=315546172, return immediately with nothing changed. 2019-11-18 19:08:06.691 ERROR leader [JRaft-Closure-Executor-201]-com.alipay.sofa.jraft.core.NodeImpl.run - Node <archer_control/10.69.0.33:6024> append [0, 1] failed.

后面感觉就是closuse跟data关系有点乱了，我看了一下代码里,面写ClosureQueue跟写Log是并行处理的，另外replicator从logmanager里面拿日志跟follower去同步，然后在做onCommmited,然后再是状态机apply.

感觉这个过程中data和closure是会存在乱序的可能性，而现在出现业务错误，也感觉是乱了，请赐教

Environment

jdk 1.8 3台集群（jraft 版本1.2.6） uname :4.9.0-0.bpo.6-amd64 #1 SMP Debian 4.9.88-1+deb9u1~bpo8+1 (2018-05-13) x86_64 GNU/Linux

fengjiachun commented 4 years ago

我只是指出你这里的问题，当前issue的问题，在没分析出过程之前，你可认为没错误的还是需要提供一下日志再说

hanzhihua commented 4 years ago

现在就一个问题，就是做为leader角色走到这个逻辑分支是不是正常的，如果不正常的化，我多加点日志，找找问题如果正常的话，处理方式是不是可以优化一下

hanzhihua commented 4 years ago

我出现的task和closure出现乱序问题，应该就是走到这个逻辑的里面

fengjiachun commented 4 years ago

现在就一个问题，就是做为leader角色走到这个逻辑分支是不是正常的，如果不正常的化，我多加点日志，找找问题如果正常的话，处理方式是不是可以优化一下

leader可以走到这个分枝的，完全正常

hanzhihua commented 4 years ago

走到这里面logentry删除了，但ballot和closure还保留了，会不会有问题

fengjiachun commented 4 years ago

应该是我的回答让你产生了误解，不好意思， appendEntries 这个方法 leader 和 follower 都会走到这个分支没任何问题，同样 checkAndResolveConflict 两者都会调用，但是对于 leader 来说 checkAndResolveConflict 不会返回 false，leader 永远返回 true

killme2008 commented 4 years ago

目前的代码逻辑没有问题的，就目前你给出的信息来看，是无法判断问题在哪里。

你说的乱序是你发现的？还是说你走读代码认为有问题？

task 的 closure 回调顺序和 task 的提交顺序不是保证一致的，这个你需要知道。

不过你目前的 snapshot 实现会导致数据不一致，确保获取“snapshot”的时候是同步的。

fengjiachun commented 4 years ago

可以加钉钉群（23390449）找我，通过钉钉给我发一下详细日志

fengjiachun commented 4 years ago

你好，options: nodeId: state: STATE_LEADER term: 10 conf: ConfigurationEntry [id=LogId [index=330456742, term=10], conf=10.69.0.33:6024,10.69.0.34:6024,10.69.1.11:6024, oldConf=] electionTimer: RepeatedTimer [timerTask=null, stopped=true, running=false, destroyed=false, invoking=false, timeoutMs=3000] voteTimer: RepeatedTimer [timerTask=null, stopped=true, running=false, destroyed=false, invoking=false, timeoutMs=3000] stepDownTimer: RepeatedTimer [timerTask=com.alipay.sofa.jraft.util.RepeatedTimer$1@5b611a2b, stopped=false, running=true, destroyed=false, invoking=false, timeoutMs=1500] snapshotTimer: RepeatedTimer [timerTask=com.alipay.sofa.jraft.util.RepeatedTimer$1@17986965, stopped=false, running=true, destroyed=false, invoking=false, timeoutMs=3600000] logManager: storage: [329708782, 330710565] diskId: LogId [index=330710565, term=10] appliedId: LogId [index=330710565, term=10] lastSnapshotId: LogId [index=330456742, term=10] fsmCaller: StateMachine [Idle] ballotBox: lastCommittedIndex: 330710565 pendingIndex: 330710566 pendingMetaQueueSize: 0 snapshotExecutor: lastSnapshotTerm: 10 lastSnapshotIndex: 330456742 term: 9 savingSnapshot: false loadingSnapshot: false stopped: false replicatorGroup: replicators: [Replicator [state=Replicate, statInfo=,peerId=10.69.0.33:6024], Replicator [state=Replicate, statInfo=,peerId=10.69.1.11:6024]] failureReplicators: []

用了snapshot了，下面是用法，跟count例子查不多，唯一改动的是，如果snapshot出错了，只是打印日志 ‘public void onSnapshotSave(final SnapshotWriter writer, final Closure done) { log.warn("onSnapshotSave..."); Utils.runInThread(() -> { final CNStateSnapshotFile snapshot = new CNStateSnapshotFile(writer.getPath() + File.separator + DATA_FILE_NAME); if (snapshot.save(cnStateInner)) { if (writer.addFile(DATA_FILE_NAME)) { done.run(Status.OK()); return ; }else{ log.error("write addFile:{} occur error,and ignore closure",snapshot.getPath()); } }else{ log.error("snapshot.save:{} occur error,and ignore closure",snapshot.getPath()); } }); }
@Override
public boolean onSnapshotLoad(final SnapshotReader reader) {
    log.warn("onSnapshotLoad...");
    if (isLeader()) {
        log.warn("Leader is not supposed to load snapshot");
        return false;
    }
    if (reader.getFileMeta(DATA_FILE_NAME) == null) {
        log.error("Fail to find data file in {}", reader.getPath());
        return false;
    }
    final CNStateSnapshotFile snapshot = new CNStateSnapshotFile(reader.getPath() + File.separator + DATA_FILE_NAME);
    try {
        CNState loadState = snapshot.load();
        if (loadState != null) {
            cnStateInner.load(loadState);
        }
        return true;
    } catch (final Exception e) {
        log.error("Fail to load snapshot from {}", snapshot.getPath());
        return false;
    }
}
’

还注意到你发的这个信息的第一行， nodeId 为空，是怎么回事？是你把发出来的信息重新处理了一下么？

hanzhihua commented 4 years ago

@killme2008 task 的 closure 回调顺序和 task 的提交顺序不是保证一致的: Task里面包含data、closure,如果closure返回，也许跟这个data没有什么关系，是这样理解吗？

fengjiachun commented 4 years ago

无法提供更多信息就先关闭了，有更多信息了再看吧

killme2008 commented 4 years ago

@hanzhihua 我的意思是说你的 task 提交顺序是 1,2,3，但是 task 的 closure 对应的执行顺序是不确定的，可能是 1,2,3，也可能是 1,3,2。但是 closure 和 data 一定是对应的。

killme2008 commented 4 years ago

@hanzhihua 你可以参考这两行加下日志，在 1.2.6 分支打包一个版本，后面如果还有错误可以看下这个两个地方

https://github.com/sofastack/sofa-jraft/blob/47b88d72a71b8954b62a2a0c68f1380bf2643c8e/jraft-core/src/main/java/com/alipay/sofa/jraft/core/NodeImpl.java#L1159

https://github.com/sofastack/sofa-jraft/blob/5bbce20b3667f08bfb14c5857dba262c83850e64/jraft-core/src/main/java/com/alipay/sofa/jraft/storage/snapshot/SnapshotExecutorImpl.java#L263

hanzhihua commented 4 years ago

@killme2008 已加，谢谢

sofastack / sofa-jraft

data and closure of task maybe does not match，out-of-order #341

Your question

Environment