Add Splitwise: prompt and token phase separation

vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

https://docs.vllm.ai

Apache License 2.0

23.79k stars 3.41k forks source link

Add Splitwise: prompt and token phase separation #2472

Closed goiri closed 3 months ago

goiri commented 6 months ago

We have built the system described in http://aka.ms/splitwise Splitwise splits the prompt and token phases to run in different servers. This leverages the differences between these two phases to improve throughput. We have an internal prototype on top of an internal vLLM branch. This issue tracks the effort to open source this prototype and make it part of the official vLLM.

This includes:

Add MSCCL++ support https://github.com/microsoft/mscclpp
Add per-layer KV-cache transfer
Coordination across prompt and token servers
Documentation

goiri commented 6 months ago

This was asked in https://github.com/vllm-project/vllm/issues/2370.

irasin commented 6 months ago

LGTM, I was wondering when can we use it in vllm?

goiri commented 6 months ago

@irasin, @aashaka is doing some cleanup and refactoring and will be posting the PRs in the next few weeks. We will be updating this issue (and linking the PRs) with the progress.

adney11 commented 5 months ago

Hi All,

Just wanted to check in and see if there is any update on Splitwise's implementation in vLLM, and if this internal prototype codebase can be released?

Thank you!

aashaka commented 5 months ago

This has now been released in PR https://github.com/vllm-project/vllm/pull/2809. @adney11, @irasin