sofastack / sofa-jraft

A production-grade java implementation of RAFT consensus algorithm.
https://www.sofastack.tech/projects/sofa-jraft/
Apache License 2.0
3.57k stars 1.14k forks source link

RheaKV read-index request timeout #529

Closed joer1900 closed 3 years ago

joer1900 commented 3 years ago

问题描述

3节点集群运行一段时间后会出现以下问题:

[20/10/26 19:30:37:582][ WARN][rheakv-read-index-callback #17][com.alipay.sofa.jraft.rhea.storage.RaftRawKVStore$1.lambda$run$0(RaftRawKVStore.java:83)] Fail to [get] with 'ReadIndex': Status[ETIMEDOUT<1010>: read-index request timeout], try to applying to the state machine. [20/10/26 19:30:37:582][ WARN][rheakv-read-index-callback #25][com.alipay.sofa.jraft.rhea.storage.RaftRawKVStore$1.lambda$run$0(RaftRawKVStore.java:83)] Fail to [get] with 'ReadIndex': Status[ETIMEDOUT<1010>: read-index request timeout], try to applying to the state machine. [20/10/26 19:30:37:583][ WARN][rheakv-read-index-callback #25][com.alipay.sofa.jraft.rhea.storage.RaftRawKVStore$1.lambda$run$0(RaftRawKVStore.java:83)] Fail to [get] with 'ReadIndex': Status[ETIMEDOUT<1010>: read-index request timeout], try to applying to the state machine. [20/10/26 19:30:37:583][ WARN][rheakv-read-index-callback #17][com.alipay.sofa.jraft.rhea.storage.RaftRawKVStore$2.lambda$run$0(RaftRawKVStore.java:117)] Fail to [multiGet] with 'ReadIndex': Status[ETIMEDOUT<1010>: read-index request timeout], try to applying to the state machine. [20/10/26 19:30:37:584][ WARN][rheakv-read-index-callback #17][com.alipay.sofa.jraft.rhea.storage.RaftRawKVStore$1.lambda$run$0(RaftRawKVStore.java:83)] Fail to [get] with 'ReadIndex': Status[ETIMEDOUT<1010>: read-index request timeout], try to applying to the state machine. [20/10/26 19:30:37:584][ WARN][rheakv-read-index-callback #17][com.alipay.sofa.jraft.rhea.storage.RaftRawKVStore$2.lambda$run$0(RaftRawKVStore.java:117)] Fail to [multiGet] with 'ReadIndex': Status[ETIMEDOUT<1010>: read-index request timeout], try to applying to the state machine. [20/10/26 19:30:37:584][ WARN][rheakv-read-index-callback #17][com.alipay.sofa.jraft.rhea.storage.RaftRawKVStore$1.lambda$run$0(RaftRawKVStore.java:83)] Fail to [get] with 'ReadIndex': Status[ETIMEDOUT<1010>: read-index request timeout], try to applying to the state machine. [20/10/26 19:30:37:584][ WARN][rheakv-read-index-callback #17][com.alipay.sofa.jraft.rhea.storage.RaftRawKVStore$1.lambda$run$0(RaftRawKVStore.java:83)] Fail to [get] with 'ReadIndex': Status[ETIMEDOUT<1010>: read-index request timeout], try to applying to the state machine. [20/10/26 19:30:37:585][ WARN][rheakv-read-index-callback #17][com.alipay.sofa.jraft.rhea.storage.RaftRawKVStore$1.lambda$run$0(RaftRawKVStore.java:83)] Fail to [get] with 'ReadIndex': Status[ETIMEDOUT<1010>: read-index request timeout], try to applying to the state machine. [20/10/26 19:30:37:748][ WARN][rheakv-read-index-callback #17][com.alipay.sofa.jraft.rhea.storage.RaftRawKVStore$1.lambda$run$0(RaftRawKVStore.java:83)] Fail to [get] with 'ReadIndex': Status[ETIMEDOUT<1010>: read-index request timeout], try to applying to the state machine. [20/10/26 19:30:37:878][ WARN][rheakv-read-index-callback #17][com.alipay.sofa.jraft.rhea.storage.RaftRawKVStore$1.lambda$run$0(RaftRawKVStore.java:83)] Fail to [get] with 'ReadIndex': Status[ETIMEDOUT<1010>: read-index request timeout], try to applying to the state machine. [20/10/26 19:30:37:992][ WARN][rheakv-read-index-callback #17][com.alipay.sofa.jraft.rhea.storage.RaftRawKVStore$1.lambda$run$0(RaftRawKVStore.java:83)] Fail to [get] with 'ReadIndex': Status[ETIMEDOUT<1010>: read-index request timeout], try to applying to the state machine. [20/10/26 19:30:38:086][ WARN][rheakv-read-index-callback #17][com.alipay.sofa.jraft.rhea.storage.RaftRawKVStore$1.lambda$run$0(RaftRawKVStore.java:83)] Fail to [get] with 'ReadIndex': Status[UNKNOWN<-1>: Fail to run ReadIndex task, maybe the leader stepped down.], try to applying to the state machine. [20/10/26 19:30:38:086][ WARN][rheakv-read-index-callback #3][com.alipay.sofa.jraft.rhea.storage.RaftRawKVStore$2.lambda$run$0(RaftRawKVStore.java:117)] Fail to [multiGet] with 'ReadIndex': Status[UNKNOWN<-1>: Fail to run ReadIndex task, maybe the leader stepped down.], try to applying to the state machine. [20/10/26 19:30:38:086][ WARN][rheakv-read-index-callback #26][com.alipay.sofa.jraft.rhea.storage.RaftRawKVStore$1.lambda$run$0(RaftRawKVStore.java:83)] Fail to [get] with 'ReadIndex': Status[UNKNOWN<-1>: Fail to run ReadIndex task, maybe the leader stepped down.], try to applying to the state machine. [20/10/26 19:30:38:086][ WARN][rheakv-read-index-callback #31][com.alipay.sofa.jraft.rhea.storage.RaftRawKVStore$1.lambda$run$0(RaftRawKVStore.java:83)] Fail to [get] with 'ReadIndex': Status[UNKNOWN<-1>: Fail to run ReadIndex task, maybe the leader stepped down.], try to applying to the state machine.

监控日志: INFO | jvm 1 | 2020/10/26 19:30:22 | rhea-rpc-request-timer_-1GET INFO | jvm 1 | 2020/10/26 19:30:22 | count = 2041312 INFO | jvm 1 | 2020/10/26 19:30:22 | mean rate = 57.96 calls/second INFO | jvm 1 | 2020/10/26 19:30:22 | 1-minute rate = 59.53 calls/second INFO | jvm 1 | 2020/10/26 19:30:22 | 5-minute rate = 59.82 calls/second INFO | jvm 1 | 2020/10/26 19:30:22 | 15-minute rate = 59.66 calls/second INFO | jvm 1 | 2020/10/26 19:30:22 | min = 0.23 milliseconds INFO | jvm 1 | 2020/10/26 19:30:22 | max = 47.88 milliseconds INFO | jvm 1 | 2020/10/26 19:30:22 | mean = 1.31 milliseconds INFO | jvm 1 | 2020/10/26 19:30:22 | stddev = 2.65 milliseconds INFO | jvm 1 | 2020/10/26 19:30:22 | median = 0.52 milliseconds INFO | jvm 1 | 2020/10/26 19:30:22 | 75% <= 0.79 milliseconds INFO | jvm 1 | 2020/10/26 19:30:22 | 95% <= 5.30 milliseconds INFO | jvm 1 | 2020/10/26 19:30:22 | 98% <= 9.32 milliseconds INFO | jvm 1 | 2020/10/26 19:30:22 | 99% <= 12.20 milliseconds INFO | jvm 1 | 2020/10/26 19:30:22 | 99.9% <= 20.22 milliseconds INFO | jvm 1 | 2020/10/26 19:30:22 | rhea-rpc-request-timer-1_MULTIGET INFO | jvm 1 | 2020/10/26 19:30:22 | count = 77016 INFO | jvm 1 | 2020/10/26 19:30:22 | mean rate = 2.19 calls/second INFO | jvm 1 | 2020/10/26 19:30:22 | 1-minute rate = 1.79 calls/second INFO | jvm 1 | 2020/10/26 19:30:22 | 5-minute rate = 1.62 calls/second INFO | jvm 1 | 2020/10/26 19:30:22 | 15-minute rate = 1.65 calls/second INFO | jvm 1 | 2020/10/26 19:30:22 | min = 0.29 milliseconds INFO | jvm 1 | 2020/10/26 19:30:22 | max = 31.78 milliseconds INFO | jvm 1 | 2020/10/26 19:30:22 | mean = 1.42 milliseconds INFO | jvm 1 | 2020/10/26 19:30:22 | stddev = 2.30 milliseconds INFO | jvm 1 | 2020/10/26 19:30:22 | median = 0.64 milliseconds INFO | jvm 1 | 2020/10/26 19:30:22 | 75% <= 0.86 milliseconds INFO | jvm 1 | 2020/10/26 19:30:22 | 95% <= 6.10 milliseconds INFO | jvm 1 | 2020/10/26 19:30:22 | 98% <= 8.86 milliseconds INFO | jvm 1 | 2020/10/26 19:30:22 | 99% <= 10.36 milliseconds INFO | jvm 1 | 2020/10/26 19:30:22 | 99.9% <= 16.63 milliseconds INFO | jvm 1 | 2020/10/26 19:30:22 | rhea-rpc-request-timer-1PUT INFO | jvm 1 | 2020/10/26 19:30:22 | count = 486612 INFO | jvm 1 | 2020/10/26 19:30:22 | mean rate = 13.85 calls/second INFO | jvm 1 | 2020/10/26 19:30:22 | 1-minute rate = 14.43 calls/second INFO | jvm 1 | 2020/10/26 19:30:22 | 5-minute rate = 14.48 calls/second INFO | jvm 1 | 2020/10/26 19:30:22 | 15-minute rate = 14.46 calls/second INFO | jvm 1 | 2020/10/26 19:30:22 | min = 2.21 milliseconds INFO | jvm 1 | 2020/10/26 19:30:22 | max = 64.79 milliseconds INFO | jvm 1 | 2020/10/26 19:30:22 | mean = 7.00 milliseconds INFO | jvm 1 | 2020/10/26 19:30:22 | stddev = 5.82 milliseconds INFO | jvm 1 | 2020/10/26 19:30:22 | median = 5.27 milliseconds INFO | jvm 1 | 2020/10/26 19:30:22 | 75% <= 8.22 milliseconds INFO | jvm 1 | 2020/10/26 19:30:22 | 95% <= 17.14 milliseconds INFO | jvm 1 | 2020/10/26 19:30:22 | 98% <= 24.51 milliseconds INFO | jvm 1 | 2020/10/26 19:30:22 | 99% <= 30.35 milliseconds INFO | jvm 1 | 2020/10/26 19:30:22 | 99.9% <= 55.83 milliseconds INFO | jvm 1 | 2020/10/26 19:30:22 | rhea-rpc-request-timer-1_PUT_LIST INFO | jvm 1 | 2020/10/26 19:30:22 | count = 8826 INFO | jvm 1 | 2020/10/26 19:30:22 | mean rate = 0.25 calls/second INFO | jvm 1 | 2020/10/26 19:30:22 | 1-minute rate = 0.08 calls/second INFO | jvm 1 | 2020/10/26 19:30:22 | 5-minute rate = 0.05 calls/second INFO | jvm 1 | 2020/10/26 19:30:22 | 15-minute rate = 0.04 calls/second INFO | jvm 1 | 2020/10/26 19:30:22 | min = 2.42 milliseconds INFO | jvm 1 | 2020/10/26 19:30:22 | max = 2022.48 milliseconds INFO | jvm 1 | 2020/10/26 19:30:22 | mean = 13.43 milliseconds INFO | jvm 1 | 2020/10/26 19:30:22 | stddev = 12.82 milliseconds INFO | jvm 1 | 2020/10/26 19:30:22 | median = 10.87 milliseconds INFO | jvm 1 | 2020/10/26 19:30:22 | 75% <= 15.11 milliseconds INFO | jvm 1 | 2020/10/26 19:30:22 | 95% <= 32.95 milliseconds INFO | jvm 1 | 2020/10/26 19:30:22 | 98% <= 61.94 milliseconds INFO | jvm 1 | 2020/10/26 19:30:22 | 99% <= 61.94 milliseconds INFO | jvm 1 | 2020/10/26 19:30:22 | 99.9% <= 61.94 milliseconds

INFO | jvm 1 | 2020/10/26 19:30:42 | rhea-db-timer_BACKUP_DB INFO | jvm 1 | 2020/10/26 19:30:42 | count = 10 INFO | jvm 1 | 2020/10/26 19:30:42 | mean rate = 0.00 calls/second INFO | jvm 1 | 2020/10/26 19:30:42 | 1-minute rate = 0.00 calls/second INFO | jvm 1 | 2020/10/26 19:30:42 | 5-minute rate = 0.00 calls/second INFO | jvm 1 | 2020/10/26 19:30:42 | 15-minute rate = 0.00 calls/second INFO | jvm 1 | 2020/10/26 19:30:42 | min = 151.20 milliseconds INFO | jvm 1 | 2020/10/26 19:30:42 | max = 7889.61 milliseconds INFO | jvm 1 | 2020/10/26 19:30:42 | mean = 234.37 milliseconds INFO | jvm 1 | 2020/10/26 19:30:42 | stddev = 0.00 milliseconds INFO | jvm 1 | 2020/10/26 19:30:42 | median = 234.37 milliseconds INFO | jvm 1 | 2020/10/26 19:30:42 | 75% <= 234.37 milliseconds INFO | jvm 1 | 2020/10/26 19:30:42 | 95% <= 234.37 milliseconds INFO | jvm 1 | 2020/10/26 19:30:42 | 98% <= 234.37 milliseconds INFO | jvm 1 | 2020/10/26 19:30:42 | 99% <= 234.37 milliseconds INFO | jvm 1 | 2020/10/26 19:30:42 | 99.9% <= 234.37 milliseconds

INFO | jvm 1 | 2020/10/26 19:30:42 | rhea-db-timer_GET INFO | jvm 1 | 2020/10/26 19:30:42 | count = 1648 INFO | jvm 1 | 2020/10/26 19:30:42 | mean rate = 0.05 calls/second INFO | jvm 1 | 2020/10/26 19:30:42 | 1-minute rate = 0.00 calls/second INFO | jvm 1 | 2020/10/26 19:30:42 | 5-minute rate = 0.00 calls/second INFO | jvm 1 | 2020/10/26 19:30:42 | 15-minute rate = 0.00 calls/second INFO | jvm 1 | 2020/10/26 19:30:42 | min = 0.00 milliseconds INFO | jvm 1 | 2020/10/26 19:30:42 | max = 0.76 milliseconds INFO | jvm 1 | 2020/10/26 19:30:42 | mean = 0.02 milliseconds INFO | jvm 1 | 2020/10/26 19:30:42 | stddev = 0.03 milliseconds INFO | jvm 1 | 2020/10/26 19:30:42 | median = 0.01 milliseconds INFO | jvm 1 | 2020/10/26 19:30:42 | 75% <= 0.02 milliseconds INFO | jvm 1 | 2020/10/26 19:30:42 | 95% <= 0.03 milliseconds INFO | jvm 1 | 2020/10/26 19:30:42 | 98% <= 0.05 milliseconds INFO | jvm 1 | 2020/10/26 19:30:42 | 99% <= 0.07 milliseconds INFO | jvm 1 | 2020/10/26 19:30:42 | 99.9% <= 0.76 milliseconds INFO | jvm 1 | 2020/10/26 19:30:42 | rhea-db-timer_MULTI_GET INFO | jvm 1 | 2020/10/26 19:30:42 | count = 197 INFO | jvm 1 | 2020/10/26 19:30:42 | mean rate = 0.01 calls/second INFO | jvm 1 | 2020/10/26 19:30:42 | 1-minute rate = 0.00 calls/second INFO | jvm 1 | 2020/10/26 19:30:42 | 5-minute rate = 0.00 calls/second INFO | jvm 1 | 2020/10/26 19:30:42 | 15-minute rate = 0.00 calls/second INFO | jvm 1 | 2020/10/26 19:30:42 | min = 0.02 milliseconds INFO | jvm 1 | 2020/10/26 19:30:42 | max = 407.59 milliseconds INFO | jvm 1 | 2020/10/26 19:30:42 | mean = 0.08 milliseconds INFO | jvm 1 | 2020/10/26 19:30:42 | stddev = 0.03 milliseconds INFO | jvm 1 | 2020/10/26 19:30:42 | median = 0.07 milliseconds INFO | jvm 1 | 2020/10/26 19:30:42 | 75% <= 0.09 milliseconds INFO | jvm 1 | 2020/10/26 19:30:42 | 95% <= 0.14 milliseconds INFO | jvm 1 | 2020/10/26 19:30:42 | 98% <= 0.16 milliseconds INFO | jvm 1 | 2020/10/26 19:30:42 | 99% <= 0.16 milliseconds INFO | jvm 1 | 2020/10/26 19:30:42 | 99.9% <= 0.16 milliseconds INFO | jvm 1 | 2020/10/26 19:30:42 | rhea-db-timer_PUT

每过一段时间耗时也会变长,最长耗时约在4s多,

场景

使用RheaKV做写的一个简单的注册中心,用来存储应用信息。

配置如下:

RheaKVStoreOptions

clusterId: 1 clusterName: rhea_test placementDriverOptions: fake: true storeEngineOptions: memoryDBOptions: keysPerSegment: 4096 commonNodeOptions: electionTimeoutMs: 5000 raftDataPath: rhea_raft/ serverAddress: ip: 192.168.133.69 port: 8200 initialServerList: 192.168.133.69:8400,192.168.133.70:8200,192.168.133.71:8300 onlyLeaderRead: true failoverRetries: 2

Environment

fengjiachun commented 3 years ago

怀疑是 gc 导致,看一下 gc 日志,或者改用 rocksdb 做存储,不要用 memory

bobbylv commented 3 years ago

用rocksdb 也是会频繁出现这问题

fengjiachun commented 3 years ago

信息不够,先关闭了