vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
31k stars 4.71k forks source link

[Feature]: APC introspection interface #8523

Open lun-4 opened 2 months ago

lun-4 commented 2 months ago

πŸš€ The feature, motivation and pitch

we're working towards using vllm for a large-scale deployment with Automatic Prefix Caching enabled. the issue with that in our use-case we want to have the same vllm instance handle the given blocks so that APC provides any form of throughput.

to do that we need to know which blocks are in the given instance for a request, and all the currently cached blocks in the instance. (we can hash parts of a request context ourselves and map internal hash -> block id so we know when a vllm instance has the blocks for a given request).

I was thinking about exposing it via some internal function on LLMEngine that would give the block hashes for the request id and another for all block hashes, and then we can expose it on our vllm http wrapper in the response headers.

for that I've been investigating vllm source code to find possible hooks to add those functions, but haven't been successful in understanding the architecture of the APC at the required level to do this. if there's some pointers around how an implementation would look like I likely would be able to PR it.

Alternatives

No response

Additional context

No response

Before submitting a new issue...

robertgshaw2-neuralmagic commented 2 months ago

We currently do not have an API for this, but I think it is something that could be exposed quite easily. This could also be useful for sticky sessions in Kubernetes (cc @ywang96)

lun-4 commented 2 months ago

if vllm is in agreement, I can start working on this for a PR but as I suggested in the main post, an "implementation plan" of sorts (can be high level) would be welcomed from my part to prevent major re-architectures down the line through PR reviews.

nstogner commented 1 month ago

We are looking to introduce cache-aware routing in the KubeAI project. @lun-4 curious what solution you landed on?

lun-4 commented 1 month ago

we are currently running vllm in production without any routing