kvcache-ai / Mooncake

Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.
https://arxiv.org/abs/2407.00079
Apache License 2.0
1.97k stars 98 forks source link

benchmark performance #5

Open BaiStone2017 opened 5 days ago

BaiStone2017 commented 5 days ago

In https://github.com/kvcache-ai/mooncake/blob/main/doc/en/vllm_benchmark_results.md

The performance of "Non-disaggregated", use 2 A10?

ShangmingCai commented 4 days ago

In https://github.com/kvcache-ai/mooncake/blob/main/doc/en/vllm_benchmark_results.md。

The performance of "Non-disaggregated", use 2 A10?

Currently, it is conducted on 1 A10 to test and compare the TTFT latency and verify the feasibility of inter-node disaggregated designs. To fairly compare the total throughput of non-disaggregated and disaggregated designs, we need to conduct experiments under specific prefill/decode workloads to utilize the prefill node fully. However, we have not found a good way to conduct a fair comparison of 2 non-disaggregated instances and 1 prefill + 1 decode without OOM. According to the author of PR 8498,

"for disagg prefill it will have lower throughput compared to chunked prefill if the prefill workload / decode workload doesn’t match # of prefill GPUs / # of decode GPUs. In my current implementation, the # of prefill GPU / # of decode GPU is 1:1, but the prefill workload / decode workload is typically a really small number (roughly 0.1 IIRC)."

After we solve the TP problem, we will conduct a series of experiments with different GPU ratios. If you are interested, you can also join vllm's slack channel about prefill disaggregation to get the latest updates.

Edenzzzz commented 4 days ago

Are there benchmark comparisons against NCCL?

ShangmingCai commented 4 days ago

Are there benchmark comparisons against NCCL?

We are unable to obtain inter-node disaggregated results with NCCL based on PR 8498 currently due to its parallel_state initialization process of disagg_group in conflict with vllm's process_group. This could be fixed with the help of PR 10072, which has already been merged. More results will be provided once we finish the integration of mooncake_transfer_engine with PR 10502.