sofastack / sofa-jraft

A production-grade java implementation of RAFT consensus algorithm.
https://www.sofastack.tech/projects/sofa-jraft/
Apache License 2.0
3.57k stars 1.14k forks source link

jraft性能优化咨询 #647

Closed tsgmq closed 3 years ago

tsgmq commented 3 years ago

Your question

目前disruptor,batch都使用了,异步的时间符合预期,但是CompletableFuture compete的耗时还是到了2.4ms左右,A,B,C三台机器,AB之间ping耗时0.03ms,A,B分别和C耗时0.7ms,目前使用A作为leader,请教下对于并发不高,单机5000以内,耗时要求到极致的,可以有哪些参数可以优化? 环境信息:cpu cores : 8 16 Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz

Linux ip-10-121-6-86.ap-southeast-1.compute.internal 4.14.232-176.381.amzn2.x86_64 #1 SMP Wed May 19 00:31:54 UTC 2021 x86_64 x86_64x86_64 GNU/Linux MemTotal: 31889560 kB

java version "16.0.1" 2021-04-20 Java(TM) SE Runtime Environment (build 16.0.1+9-24) Java HotSpot(TM) 64-Bit Server VM (build 16.0.1+9-24, mixed mode, sharing)

Describe your question clearly

Your scenes

Describe your use scenes (why need this feature)

Your advice

Describe the advice or solution you'd like

Environment

killme2008 commented 3 years ago

CompletableFuture compete的耗时还是到了2.4ms左右 这个是啥意思?

killme2008 commented 3 years ago

单机 5000 的并发已经很高了,5000 是 QPS ? 你追求的是写入的 latency ?

建议将场景和信息补充的更充分。

tsgmq commented 3 years ago

CompletableFuture compete的耗时还是到了2.4ms左右 这个是啥意思?

long start = System.nanoTime(); this.tradeServer.createOrder(requestEntry, baseResultCompletableFuture); baseResultCompletableFuture.thenAccept((baseResult) -> { long end = System.nanoTime(); String conetnt = "desc"+userId + ", 使用raft后真实耗时=" + (end - start) + "纳秒"; System.out.println(conetnt); }) createOrder提交到disruptor,然后再applyTask走raft 2.4ms指的是thenAccept后的时间减去开始的时间,只追求写入的 耗时最低,如何配置参数呢,5000是峰值写入速度,平常1000 tps的写入

tsgmq commented 3 years ago

一次请求的request.size=500个字节左右

killme2008 commented 3 years ago

可以参照文档,打开 metrics 看看各项数据指标

https://www.sofastack.tech/projects/sofa-jraft/jraft-user-guide/

第 8 节, Metrics 监控 以及第九节性能优化建议。

tsgmq commented 3 years ago

这里有一个优化技巧,通常 leader 获取到的 done closure,可以扩展包装一个 closure 类 包含了没有序列化的用户请求,那么在逻辑处理部分可以直接从 closure 获取到用户请求,无需通过 data 反序列化得到,减少了 leader 的 CPU 开销,具体可参见 counter 例子。

按照这个思想优化了一把,目前耗时降低到1.8ms左右了

tsgmq commented 3 years ago

可以参照文档,打开 metrics 看看各项数据指标

https://www.sofastack.tech/projects/sofa-jraft/jraft-user-guide/

第 8 节, Metrics 监控 以及第九节性能优化建议。

大佬,我再请教两个小白问题,如果追求写入性能最优,如下两个方面是否可以考虑 1,onApply修改状态机的时候提交到异步线程去执行 2,leader和follower之间的通信是否可以改成UDP方式,快速广播

killme2008 commented 3 years ago
  1. 这样做会有数据丢失和不一致的问题产生,如果数据之间可以并行,是可以做并发 apply 的。
  2. 不能,需要可靠通信。
tsgmq commented 3 years ago
  1. 这样做会有数据丢失和不一致的问题产生,如果数据之间可以并行,是可以做并发 apply 的。
  2. 不能,需要可靠通信。

业务上要求同一个uid之间必须要串行处理,不同的uid之间可以并行处理,我搞了个shardThradPoolExecutor,你帮忙评估下,onApply是否可以丢进线程池去执行了,数组中每个元素是长度为1的线程池,这样保证相同的uid肯定会排队执行了

public class ShardThreadPoolExecutor { private int parallelism = Runtime.getRuntime().availableProcessors();

private ExecutorService[] executors = new ExecutorService[parallelism];

{
    for (int i = 0; i < parallelism; i++) {
        executors[i] = new ThreadPoolExecutor(1, 1,
                60L, TimeUnit.MILLISECONDS,
                new LinkedBlockingQueue<Runnable>());
    }
}

public <V> Future<V> submitByShardingKey(long shardingKey, Callable<V> callable) {
    return executors[(int)( (shardingKey < 0) ? -shardingKey : shardingKey)% parallelism].submit(callable);
}

@PreDestroy
public void shutDown() {
    Arrays.stream(executors).forEach(executorService -> executorService.shutdown());
}

}

fengjiachun commented 3 years ago

可以参照文档,打开 metrics 看看各项数据指标 https://www.sofastack.tech/projects/sofa-jraft/jraft-user-guide/ 第 8 节, Metrics 监控 以及第九节性能优化建议。

大佬,我再请教两个小白问题,如果追求写入性能最优,如下两个方面是否可以考虑 1,onApply修改状态机的时候提交到异步线程去执行 2,leader和follower之间的通信是否可以改成UDP方式,快速广播

  1. 异步不行,靠谱的方式是 batch 和 并行(需要数据之间无依赖),更高级点的可参考 #614 的思路根据业务特化实现到自己的状态机里面
  2. 需要你自己扩展实现 jraft 通信层,UDP 本身不可靠,需要你在上层自己保证可靠,猜测实际意义不大
tsgmq commented 3 years ago

可以参照文档,打开 metrics 看看各项数据指标 https://www.sofastack.tech/projects/sofa-jraft/jraft-user-guide/ 第 8 节, Metrics 监控 以及第九节性能优化建议。

大佬,我再请教两个小白问题,如果追求写入性能最优,如下两个方面是否可以考虑 1,onApply修改状态机的时候提交到异步线程去执行 2,leader和follower之间的通信是否可以改成UDP方式,快速广播

  1. 异步不行,靠谱的方式是 batch 和 并行(需要数据之间无依赖),更高级点的可参考 A high-throughput parallelized state machine #614 的思路根据业务特化实现到自己的状态机里面
  2. 需要你自己扩展实现 jraft 通信层,UDP 本身不可靠,需要你在上层自己保证可靠,猜测实际意义不大

目前batch都已经应用了,参照了kv的思想的,当disruptor中size到达阈值就转批次处理了。整个吞吐没有问题的,但是业务上从进入到服务开始卖点,到真实处理结束的才算结束时间,也就是 closure.setData(data); closure.run(Status.OK()); 所以我们想进一步优化处理的时间,目前同一个List批次的,已经应用上面的ShardThreadPoolExecutor多线程处理了。使用了CountDownLatch来阻塞,确保onApply Iterator中一个处理完毕后再处理下一个保证了顺序,请教下,整体能否都提交到异步线程池中,onApply的元素就提交到线程池,由线程池保证同一个uid不同的logEntry都能顺序执行

tsgmq commented 3 years ago

可以参照文档,打开 metrics 看看各项数据指标 https://www.sofastack.tech/projects/sofa-jraft/jraft-user-guide/ 第 8 节, Metrics 监控 以及第九节性能优化建议。

大佬,我再请教两个小白问题,如果追求写入性能最优,如下两个方面是否可以考虑 1,onApply修改状态机的时候提交到异步线程去执行 2,leader和follower之间的通信是否可以改成UDP方式,快速广播

  1. 异步不行,靠谱的方式是 batch 和 并行(需要数据之间无依赖),更高级点的可参考 A high-throughput parallelized state machine #614 的思路根据业务特化实现到自己的状态机里面
  2. 需要你自己扩展实现 jraft 通信层,UDP 本身不可靠,需要你在上层自己保证可靠,猜测实际意义不大

目前batch都已经应用了,参照了kv的思想的,当disruptor中size到达阈值就转批次处理了。整个吞吐没有问题的,但是业务上从进入到服务开始卖点,到真实处理结束的才算结束时间,也就是 closure.setData(data); closure.run(Status.OK()); 所以我们想进一步优化处理的时间,目前同一个List批次的,已经应用上面的ShardThreadPoolExecutor多线程处理了。使用了CountDownLatch来阻塞,确保onApply Iterator中一个处理完毕后再处理下一个保证了顺序,请教下,整体能否都提交到异步线程池中,onApply的元素就提交到线程池,由线程池保证同一个uid不同的logEntry都能顺序执行

tsgmq commented 3 years ago

可以参照文档,打开 metrics 看看各项数据指标

https://www.sofastack.tech/projects/sofa-jraft/jraft-user-guide/

第 8 节, Metrics 监控 以及第九节性能优化建议。

老板,帮忙指点下,看看哪儿还能优化 append-logs-bytes count = 12027 min = 369 max = 1126 mean = 383.30 stddev = 67.34 median = 373.00 75% <= 377.00 95% <= 378.00 98% <= 378.00 99% <= 749.00 99.9% <= 1126.00 append-logs-count count = 12027 min = 1 max = 3 mean = 1.02 stddev = 0.17 median = 1.00 75% <= 1.00 95% <= 1.00 98% <= 1.00 99% <= 2.00 99.9% <= 3.00 fsm-apply-tasks-count count = 12058 min = 0 max = 2 mean = 0.04 stddev = 0.20 median = 0.00 75% <= 0.00 95% <= 0.00 98% <= 1.00 99% <= 1.00 99.9% <= 1.00 handle-read-index-entries count = 709 min = 1 max = 1 mean = 1.00 stddev = 0.00 median = 1.00 75% <= 1.00 95% <= 1.00 98% <= 1.00 99% <= 1.00 99.9% <= 1.00 replicate-entries-bytes count = 24105 min = 369 max = 754 mean = 376.74 stddev = 31.22 median = 373.00 75% <= 377.00 95% <= 378.00 98% <= 378.00 99% <= 378.00 99.9% <= 754.00 replicate-entries-count count = 24105 min = 1 max = 2 mean = 1.01 stddev = 0.07 median = 1.00 75% <= 1.00 95% <= 1.00 98% <= 1.00 99% <= 1.00 99.9% <= 2.00 replicate-inflights-count count = 24108 min = 1 max = 4 mean = 1.10 stddev = 0.38 median = 1.00 75% <= 1.00 95% <= 2.00 98% <= 2.00 99% <= 3.00 99.9% <= 4.00

-- <trade/...:7788> -- Timers ---------------------------------------------------------------------- append-logs count = 12027 mean rate = 15.16 calls/second 1-minute rate = 7.45 calls/second 5-minute rate = 7.32 calls/second 15-minute rate = 7.13 calls/second min = 0.00 milliseconds max = 12.00 milliseconds mean = 1.26 milliseconds stddev = 0.66 milliseconds median = 1.00 milliseconds 75% <= 1.00 milliseconds 95% <= 2.00 milliseconds 98% <= 2.00 milliseconds 99% <= 3.00 milliseconds 99.9% <= 8.00 milliseconds fsm-apply-tasks count = 12058 mean rate = 16.61 calls/second 1-minute rate = 8.14 calls/second 5-minute rate = 7.50 calls/second 15-minute rate = 7.29 calls/second min = 0.00 milliseconds max = 9.00 milliseconds mean = 0.23 milliseconds stddev = 0.48 milliseconds median = 0.00 milliseconds 75% <= 0.00 milliseconds 95% <= 1.00 milliseconds 98% <= 1.00 milliseconds 99% <= 1.00 milliseconds 99.9% <= 1.00 milliseconds fsm-commit count = 12059 mean rate = 15.20 calls/second 1-minute rate = 7.57 calls/second 5-minute rate = 7.36 calls/second 15-minute rate = 7.15 calls/second min = 0.00 milliseconds max = 9.00 milliseconds mean = 0.24 milliseconds stddev = 0.48 milliseconds median = 0.00 milliseconds 75% <= 0.00 milliseconds 95% <= 1.00 milliseconds 98% <= 1.00 milliseconds 99% <= 1.00 milliseconds 99.9% <= 1.00 milliseconds fsm-snapshot-load count = 1 mean rate = 0.00 calls/second 1-minute rate = 0.00 calls/second 5-minute rate = 0.01 calls/second 15-minute rate = 0.08 calls/second min = 263.00 milliseconds max = 263.00 milliseconds mean = 263.00 milliseconds stddev = 0.00 milliseconds median = 263.00 milliseconds 75% <= 263.00 milliseconds 95% <= 263.00 milliseconds 98% <= 263.00 milliseconds 99% <= 263.00 milliseconds 99.9% <= 263.00 milliseconds fsm-snapshot-save count = 17 mean rate = 0.02 calls/second 1-minute rate = 0.01 calls/second 5-minute rate = 0.03 calls/second 15-minute rate = 0.10 calls/second min = 0.00 milliseconds max = 1.00 milliseconds mean = 0.00 milliseconds stddev = 0.00 milliseconds median = 0.00 milliseconds 75% <= 0.00 milliseconds 95% <= 0.00 milliseconds 98% <= 0.00 milliseconds 99% <= 0.00 milliseconds 99.9% <= 0.00 milliseconds handle-read-index count = 709 mean rate = 1.73 calls/second 1-minute rate = 0.06 calls/second 5-minute rate = 3.14 calls/second 15-minute rate = 6.40 calls/second min = 0.00 milliseconds max = 1.00 milliseconds mean = 0.04 milliseconds stddev = 0.20 milliseconds median = 0.00 milliseconds 75% <= 0.00 milliseconds 95% <= 0.00 milliseconds 98% <= 1.00 milliseconds 99% <= 1.00 milliseconds 99.9% <= 1.00 milliseconds pre-vote count = 1 mean rate = 0.00 calls/second 1-minute rate = 0.00 calls/second 5-minute rate = 0.01 calls/second 15-minute rate = 0.08 calls/second min = 16.00 milliseconds max = 16.00 milliseconds mean = 16.00 milliseconds stddev = 0.00 milliseconds median = 16.00 milliseconds 75% <= 16.00 milliseconds 95% <= 16.00 milliseconds 98% <= 16.00 milliseconds 99% <= 16.00 milliseconds 99.9% <= 16.00 milliseconds replicate-entries count = 24105 mean rate = 30.39 calls/second 1-minute rate = 15.11 calls/second 5-minute rate = 14.70 calls/second 15-minute rate = 14.29 calls/second min = 1.00 milliseconds max = 21.00 milliseconds mean = 1.96 milliseconds stddev = 1.35 milliseconds median = 2.00 milliseconds 75% <= 2.00 milliseconds 95% <= 3.00 milliseconds 98% <= 3.00 milliseconds 99% <= 9.00 milliseconds 99.9% <= 21.00 milliseconds request-vote count = 1 mean rate = 0.00 calls/second 1-minute rate = 0.00 calls/second 5-minute rate = 0.01 calls/second 15-minute rate = 0.08 calls/second min = 11.00 milliseconds max = 11.00 milliseconds mean = 11.00 milliseconds stddev = 0.00 milliseconds median = 11.00 milliseconds 75% <= 11.00 milliseconds 95% <= 11.00 milliseconds 98% <= 11.00 milliseconds 99% <= 11.00 milliseconds 99.9% <= 11.00 milliseconds save-raft-meta count = 1 mean rate = 0.00 calls/second 1-minute rate = 0.00 calls/second 5-minute rate = 0.01 calls/second 15-minute rate = 0.08 calls/second min = 5.00 milliseconds max = 5.00 milliseconds mean = 5.00 milliseconds stddev = 0.00 milliseconds median = 5.00 milliseconds 75% <= 5.00 milliseconds 95% <= 5.00 milliseconds 98% <= 5.00 milliseconds 99% <= 5.00 milliseconds 99.9% <= 5.00 milliseconds truncate-log-prefix count = 27 mean rate = 0.03 calls/second 1-minute rate = 0.04 calls/second 5-minute rate = 0.05 calls/second 15-minute rate = 0.10 calls/second min = 1.00 milliseconds max = 2.00 milliseconds mean = 1.28 milliseconds stddev = 0.45 milliseconds median = 1.00 milliseconds 75% <= 2.00 milliseconds 95% <= 2.00 milliseconds 98% <= 2.00 milliseconds 99% <= 2.00 milliseconds 99.9% <= 2.00 milliseconds

tsgmq commented 3 years ago

-- Timers ---------------------------------------------------------------------- scheduledThreadPool.JRaft-Node-ScheduleThreadPool count = 15691 mean rate = 19.79 calls/second 1-minute rate = 19.86 calls/second 5-minute rate = 19.20 calls/second 15-minute rate = 16.15 calls/second min = 0.00 milliseconds max = 0.02 milliseconds mean = 0.01 milliseconds stddev = 0.00 milliseconds median = 0.01 milliseconds 75% <= 0.01 milliseconds 95% <= 0.01 milliseconds 98% <= 0.01 milliseconds 99% <= 0.01 milliseconds 99.9% <= 0.02 milliseconds threadPool.JRAFT_CLOSURE_EXECUTOR count = 24179 mean rate = 30.07 calls/second 1-minute rate = 14.80 calls/second 5-minute rate = 14.68 calls/second 15-minute rate = 14.24 calls/second min = 0.01 milliseconds max = 0.14 milliseconds mean = 0.02 milliseconds stddev = 0.01 milliseconds median = 0.02 milliseconds 75% <= 0.03 milliseconds 95% <= 0.04 milliseconds 98% <= 0.05 milliseconds 99% <= 0.06 milliseconds 99.9% <= 0.13 milliseconds threadPool.JRAFT_RPC_CLOSURE_EXECUTOR count = 15691 mean rate = 19.79 calls/second 1-minute rate = 19.86 calls/second 5-minute rate = 19.20 calls/second 15-minute rate = 16.15 calls/second min = 0.01 milliseconds max = 0.05 milliseconds mean = 0.01 milliseconds stddev = 0.00 milliseconds median = 0.01 milliseconds 75% <= 0.02 milliseconds 95% <= 0.02 milliseconds 98% <= 0.02 milliseconds 99% <= 0.03 milliseconds 99.9% <= 0.04 milliseconds threadPool.JRaft-RPC-Processor count = 2 mean rate = 0.00 calls/second 1-minute rate = 0.00 calls/second 5-minute rate = 0.03 calls/second 15-minute rate = 0.17 calls/second min = 12.43 milliseconds max = 15.40 milliseconds mean = 13.91 milliseconds stddev = 1.49 milliseconds median = 15.40 milliseconds 75% <= 15.40 milliseconds 95% <= 15.40 milliseconds 98% <= 15.40 milliseconds 99% <= 15.40 milliseconds 99.9% <= 15.40 milliseconds

tsgmq commented 3 years ago
  1. batch 和 并行

老师,目前 batch 和 并行已经实现完毕了。 A high-throughput parallelized state machine #614这个为了实现通用性就会将问题复杂化,如果我已经能够明确知道task之间的顺序,我使用10个disruptor来接受task,保证有依赖的task会到同一个disruptor中去排队,然后结合batch和并行优化。基于现有的已经发布的版本中,是否有办法让10个disruptor中的内容并行去onApply,目前的单线程onApply没有充分发挥出16核的威力,觉得还能有进一步优化的空间

hzh0425 commented 3 years ago

@tsgmq 我觉得要实现并行化的难点就在于怎么探测batch之间的数据依赖性,如果你已经知道了task之间执行的顺序,其实怎么实现都ok呀..

tsgmq commented 3 years ago

@tsgmq 我觉得要实现并行化的难点就在于怎么探测batch之间的数据依赖性,如果你已经知道了task之间执行的顺序,其实怎么实现都ok呀..

是这个道理,开源上我修改也能满足我的诉求,但是我想尽量保证和jraft的版本一起升级,以免后续有新特性,我合并又困难了,就是因为我们业务上可以知道顺序了。想了解下现有的1.3.7的版本上是否可能实现。目前可以多线程apply(Task),就不太清楚能否AppendEntries的使用也可以多线程(同一个Replicator多线程)

killme2008 commented 3 years ago

我不是很明白,并行 apply 在 state machine 里做就可以了,做一个 fork-join 模型就结束了,并不需要去改 jraft 代码。 append entries 要求不能乱序,这也不能乱改的。

tsgmq commented 3 years ago

我不是很明白,并行 apply 在 state machine 里做就可以了,做一个 fork-join 模型就结束了,并不需要去改 jraft 代码。 append entries 要求不能乱序,这也不能乱改的。

从发送端,到onApply消费端,目前感觉能优化的都做了(cache,batch,protobuf序列化,log4j all async,fork-join,jdk 16 ZGC,轻量级对象池,磁盘优先级,网络优先级,rocksdb参数调整)。目前5节点单集群1万 tps写入,耗时稳定在0.6ms-0.8ms左右,就想着基于现有的jraft的版本上是否还能有优化空间(目标1万 tps写入 0.5ms)。就想看看append entries能否做到全局不保证顺序,只要局部有序就可以。就像kafka 提供让生产者自己决定需要有序就进同一个队列。基于现有的版本,是否只能使用RAFT_GROUP了,让局部有序的分别进入不同的组了。之前00-09的uid用户,现在再分成00,01.。。。。09 形成10个raft group了。这样性能是可以解决,但是leader太多了。目前业务上对于leader还有特殊的逻辑,每次切换leader有一定的代价,如果这01.。。。。09 形成10个raft group的leader在同一台jvm 进程中就好处理了

Shy-Chen commented 5 months ago

我不太明白,工具应用在状态机里做就可以了,做一个fork-join模型就结束了,并不需要去改jraft代码。追加条目要求不能乱序,这也不能乱改的。

从发送端,到onApply消费端,目前感觉性能优化的都做了(cache,batch,protobuf序列化,log4j all async,fork-join,jdk 16 ZGC,轻量级对象池,磁盘优先级,网络优先目前5节点单负载1万tps写入,运行稳定在0.6ms-0.8ms左右,就考虑基于现有的jraft的版本上是否还能有优化空间(目标1万) tps写入0.5ms)。就想看看append条目能否实现全局不保证顺序,只要局部小区就可以。就像kafka提供让生产者自己决定需要小区就进同一个队列。的版本,是否只能使用RAFT_GROUP了,让局部小区的分别分配到不同的组了。00-09之前的uid用户,现在再度假00,01.。。。。09组建了10个筏组了。这样性能是可以解决的,但是leader太多了。目前业务上对于leader形成还有特殊的逻辑,每次切换leader都有一定的代价,如果这01.。。。。09 10个筏组的leader在同一台jvm进程中就处理了

最近在研究优化这块内容 可以加个联系方式吗?