Open lun-4 opened 2 months ago
We currently do not have an API for this, but I think it is something that could be exposed quite easily. This could also be useful for sticky sessions in Kubernetes (cc @ywang96)
if vllm is in agreement, I can start working on this for a PR but as I suggested in the main post, an "implementation plan" of sorts (can be high level) would be welcomed from my part to prevent major re-architectures down the line through PR reviews.
We are looking to introduce cache-aware routing in the KubeAI project. @lun-4 curious what solution you landed on?
we are currently running vllm in production without any routing
π The feature, motivation and pitch
we're working towards using vllm for a large-scale deployment with Automatic Prefix Caching enabled. the issue with that in our use-case we want to have the same vllm instance handle the given blocks so that APC provides any form of throughput.
to do that we need to know which blocks are in the given instance for a request, and all the currently cached blocks in the instance. (we can hash parts of a request context ourselves and map internal hash -> block id so we know when a vllm instance has the blocks for a given request).
I was thinking about exposing it via some internal function on
LLMEngine
that would give the block hashes for the request id and another for all block hashes, and then we can expose it on our vllm http wrapper in the response headers.for that I've been investigating
vllm
source code to find possible hooks to add those functions, but haven't been successful in understanding the architecture of the APC at the required level to do this. if there's some pointers around how an implementation would look like I likely would be able to PR it.Alternatives
No response
Additional context
No response
Before submitting a new issue...