DistServe support - Githubissues

icyxp commented 1 month ago

Feature request

Motivation

DistServe improves the performance of large language models (LLMs) serving by disaggregating the prefill and decoding computation. Existing LLM serving systems colocate the two phases and batch the computation of prefill and decoding across all users and requests. We find that this strategy not only leads to strong prefill-decoding interferences but also couples the resource allocation and parallelism plans for both phases. In DistServe, you can simply set the parallelism configs and scheduling strategies for the two phases and it will work just like a single instance which handles the KV-Cache communication and memory management automatically.

Your contribution

none

LysandreJik commented 1 month ago

Thanks for your request @farzanehnakhaee70! I'm making sure the team sees it.

Usually what drives an addition to our toolkit is the community excitement for it. Getting a lot of 👍 on your message will show us that it's important for a lot of people :)

github-actions[bot] commented 2 weeks ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

huggingface / text-generation-inference

DistServe support #2183

Feature request

Motivation

Your contribution