RheaKV Read-index timeout

yuyang0423 commented 4 years ago

Describe the bug

In a Three nodes env, after previous leader panic, new leader has highly probability got RheaKV Read-index timeout onLeaderStart.

but I print node metric read-index didn't have max over default 2 sec.

Actual behavior

2020-08-05 12:54:48.065 [JRaft-FSMCaller-Disruptor-0] INFO  StateMachineAdapter:62 - onLeaderStart: term=2.
2020-08-05 12:54:48.068 [JRaft-FSMCaller-Disruptor-0] INFO  NamedThreadFactory:82 - Creates new Thread[rheakv-raft-state-trigger #2,5,main].
2020-08-05 12:54:48.094 [Rpc-netty-server-worker-10-thread-4] INFO  NamedThreadFactory:82 - Creates new Thread[rheakv-cli-rpc-executor #0,5,main].
2020-08-05 12:54:48.107 [Rpc-netty-server-worker-10-thread-3] INFO  NamedThreadFactory:82 - Creates new Thread[rheakv-kv-store-rpc-executor #1,5,main].
2020-08-05 12:54:48.111 [Rpc-netty-server-worker-10-thread-4] INFO  NamedThreadFactory:82 - Creates new Thread[rheakv-cli-rpc-executor #1,5,main].
2020-08-05 12:54:48.120 [Rpc-netty-server-worker-10-thread-4] INFO  NamedThreadFactory:82 - Creates new Thread[rheakv-cli-rpc-executor #2,5,main].
2020-08-05 12:54:48.122 [Rpc-netty-server-worker-10-thread-4] INFO  NamedThreadFactory:82 - Creates new Thread[rheakv-cli-rpc-executor #3,5,main].
2020-08-05 12:54:48.153 [Rpc-netty-server-worker-10-thread-3] INFO  NamedThreadFactory:82 - Creates new Thread[rheakv-kv-store-rpc-executor #2,5,main].
2020-08-05 12:54:48.154 [Rpc-netty-server-worker-10-thread-3] INFO  NamedThreadFactory:82 - Creates new Thread[rheakv-kv-store-rpc-executor #3,5,main].
2020-08-05 12:54:49.074 [read-index.timeout.scanner0] INFO  NamedThreadFactory:82 - Creates new Thread[rheakv-read-index-callback #1,5,main].
2020-08-05 12:54:49.562 [JRaft-StepDownTimer-<$GOUR_NAME--1/192.168.80.168:8082>0] ERROR AbstractClientService:148 - Fail to connect 192.168.80.166:8082, remoting exception: com.alipay.remoting.exception.RemotingException: Create connection failed. The address is 192.168.80.166:8082.
2020-08-05 12:54:49.562 [JRaft-StepDownTimer-<$GOUR_NAME--1/192.168.80.168:8082>0] ERROR Replicator:804 - Fail to init sending channel to 192.168.80.166:8082.
2020-08-05 12:54:49.562 [JRaft-StepDownTimer-<$GOUR_NAME--1/192.168.80.168:8082>0] ERROR ReplicatorGroupImpl:129 - Fail to start replicator to peer=192.168.80.166:8082, replicatorType=Follower.
2020-08-05 12:54:49.563 [rheakv-read-index-callback #1] WARN  RaftRawKVStore:83 - Fail to [get] with 'ReadIndex': Status[ETIMEDOUT<1010>: read-index request timeout], try to applying to the state machine.
2020-08-05 12:54:49.573 [JRaft-FSMCaller-Disruptor-0] ERROR RocksRawKVStore:255 - Fail to [GET], key: [2F66656465726174696F6E2F6E6F6465], java.lang.IllegalStateException: Should not send rpc response repeatedly!
    at com.alipay.remoting.rpc.protocol.RpcAsyncContext.sendResponse(RpcAsyncContext.java:65)
    at com.alipay.sofa.jraft.rpc.impl.BoltRpcServer$2$1.sendResponse(BoltRpcServer.java:100)
    at com.alipay.sofa.jraft.rhea.RequestProcessClosure.run(RequestProcessClosure.java:64)
    at com.alipay.sofa.jraft.rhea.RequestProcessClosure.sendResponse(RequestProcessClosure.java:56)
    at com.alipay.sofa.jraft.rhea.DefaultRegionKVService$10.run(DefaultRegionKVService.java:383)
    at com.alipay.sofa.jraft.rhea.storage.MetricsKVClosureAdapter.run(MetricsKVClosureAdapter.java:93)
    at com.alipay.sofa.jraft.rhea.storage.KVClosureAdapter.run(KVClosureAdapter.java:72)
    at com.alipay.sofa.jraft.rhea.storage.BaseRawKVStore.setSuccess(BaseRawKVStore.java:179)
    at com.alipay.sofa.jraft.rhea.storage.RocksRawKVStore.get(RocksRawKVStore.java:253)
    at com.alipay.sofa.jraft.rhea.storage.BaseRawKVStore.get(BaseRawKVStore.java:42)
    at com.alipay.sofa.jraft.rhea.storage.BatchRawKVStore.batchGet(BatchRawKVStore.java:111)
    at com.alipay.sofa.jraft.rhea.storage.KVStoreStateMachine.batchApply(KVStoreStateMachine.java:189)
    at com.alipay.sofa.jraft.rhea.storage.KVStoreStateMachine.batchApplyAndRecycle(KVStoreStateMachine.java:148)
    at com.alipay.sofa.jraft.rhea.storage.KVStoreStateMachine.onApply(KVStoreStateMachine.java:117)
    at com.alipay.sofa.jraft.core.FSMCallerImpl.doApplyTasks(FSMCallerImpl.java:539)
    at com.alipay.sofa.jraft.core.FSMCallerImpl.doCommitted(FSMCallerImpl.java:508)
    at com.alipay.sofa.jraft.core.FSMCallerImpl.runApplyTask(FSMCallerImpl.java:440)
    at com.alipay.sofa.jraft.core.FSMCallerImpl.access$100(FSMCallerImpl.java:73)
    at com.alipay.sofa.jraft.core.FSMCallerImpl$ApplyTaskHandler.onEvent(FSMCallerImpl.java:148)
    at com.alipay.sofa.jraft.core.FSMCallerImpl$ApplyTaskHandler.onEvent(FSMCallerImpl.java:142)
    at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:137)
    at java.lang.Thread.run(Thread.java:745

Steps to reproduce

setup a three node jraft federation, using rheaKV;
panic leader node with sysrq;
this problem will occur on new leader;

If not, panic new leader continuously.

Environment

SOFAJRaft version: v1.3.3
JVM version (e.g. java -version): java 1.8
OS version (e.g. uname -a): CentOS 7 1908 (kernel 5.4.22)
Maven version: 3

fengjiachun commented 4 years ago

Fail to connect 192.168.80.166:8082,

read-index 请求需要 leader 向每个节点发送一个心跳，包括 panic 掉的节点，在 1.3.3 版本使用的底层 bolt 通信框架是阻塞建连的，可能导致 read-index 超时，建议升级到 1.3.4，在 1.3.4 中已经解决这个问题，详情见 #489

yuyang0423 commented 4 years ago

Fail to connect 192.168.80.166:8082,

read-index 请求需要 leader 向每个节点发送一个心跳，包括 panic 掉的节点，在 1.3.3 版本使用的底层 bolt 通信框架是阻塞建连的，可能导致 read-index 超时，建议升级到 1.3.4，在 1.3.4 中已经解决这个问题，详情见 #489

Got it, will test new version and let you know, thanks again.

yuyang0423 commented 4 years ago

Problem solved in v1.3.4, thanks.

sofastack / sofa-jraft