Open BaiStone2017 opened 5 days ago
In https://github.com/kvcache-ai/mooncake/blob/main/doc/en/vllm_benchmark_results.md。
The performance of "Non-disaggregated", use 2 A10?
Currently, it is conducted on 1 A10 to test and compare the TTFT latency and verify the feasibility of inter-node disaggregated designs. To fairly compare the total throughput of non-disaggregated and disaggregated designs, we need to conduct experiments under specific prefill/decode workloads to utilize the prefill node fully. However, we have not found a good way to conduct a fair comparison of 2 non-disaggregated instances and 1 prefill + 1 decode without OOM. According to the author of PR 8498,
"for disagg prefill it will have lower throughput compared to chunked prefill if the prefill workload / decode workload doesn’t match # of prefill GPUs / # of decode GPUs. In my current implementation, the # of prefill GPU / # of decode GPU is 1:1, but the prefill workload / decode workload is typically a really small number (roughly 0.1 IIRC)."
After we solve the TP problem, we will conduct a series of experiments with different GPU ratios. If you are interested, you can also join vllm's slack channel about prefill disaggregation to get the latest updates.
Are there benchmark comparisons against NCCL?
Are there benchmark comparisons against NCCL?
We are unable to obtain inter-node disaggregated results with NCCL based on PR 8498 currently due to its parallel_state
initialization process of disagg_group
in conflict with vllm's process_group
. This could be fixed with the help of PR 10072, which has already been merged. More results will be provided once we finish the integration of mooncake_transfer_engine
with PR 10502.
In https://github.com/kvcache-ai/mooncake/blob/main/doc/en/vllm_benchmark_results.md。
The performance of "Non-disaggregated", use 2 A10?