benchmark performance - Githubissues

BaiStone2017 commented 5 days ago

In https://github.com/kvcache-ai/mooncake/blob/main/doc/en/vllm_benchmark_results.md。

The performance of "Non-disaggregated", use 2 A10?

ShangmingCai commented 4 days ago

In https://github.com/kvcache-ai/mooncake/blob/main/doc/en/vllm_benchmark_results.md。

The performance of "Non-disaggregated", use 2 A10?

Currently, it is conducted on 1 A10 to test and compare the TTFT latency and verify the feasibility of inter-node disaggregated designs. To fairly compare the total throughput of non-disaggregated and disaggregated designs, we need to conduct experiments under specific prefill/decode workloads to utilize the prefill node fully. However, we have not found a good way to conduct a fair comparison of 2 non-disaggregated instances and 1 prefill + 1 decode without OOM. According to the author of PR 8498,

"for disagg prefill it will have lower throughput compared to chunked prefill if the prefill workload / decode workload doesn’t match # of prefill GPUs / # of decode GPUs. In my current implementation, the # of prefill GPU / # of decode GPU is 1:1, but the prefill workload / decode workload is typically a really small number (roughly 0.1 IIRC)."

After we solve the TP problem, we will conduct a series of experiments with different GPU ratios. If you are interested, you can also join vllm's slack channel about prefill disaggregation to get the latest updates.

Edenzzzz commented 4 days ago

Are there benchmark comparisons against NCCL?

ShangmingCai commented 4 days ago

Are there benchmark comparisons against NCCL?

We are unable to obtain inter-node disaggregated results with NCCL based on PR 8498 currently due to its parallel_state initialization process of disagg_group in conflict with vllm's process_group. This could be fixed with the help of PR 10072, which has already been merged. More results will be provided once we finish the integration of mooncake_transfer_engine with PR 10502.

kvcache-ai / Mooncake

benchmark performance #5