Open nstogner opened 1 month ago
we could send a UID, for example in the extra_body
of some given request
response = client.chat.completions.create(
model="model",
messages=[
{
"role": "user",
"content": "heyyy"
}
],
temperature=0.2,
extra_body={"decode_id": "SOME_HASH"},
)
response = client.chat.completions.create(
model="model",
messages=[
{
"role": "user",
"content": "heyyy there"
}
],
temperature=0.2,
extra_body={"decode_id": "SOME_HASH"},
)
My thinking is we could treat e.g. the X-Session-ID
HTTP header as a way to tell us that a request belongs to the same session. You can set custom HTTP headers in the python openai client as well.
vLLM issue to watch: https://github.com/vllm-project/vllm/issues/8523
I see 3 main options:
Option 1 is not currently possible at the moment from my understanding (see issue linked in previous comment).
Option 2 is probably the simplest to implement in KubeAI, but harder for clients to take full advantage of. The simplest approach would be to use a "user session" as the routing key. This might work well in chat-scenarios but might not take advantage of scenarios where large prefixes transcend user-sessions.
Option 3 could be implemented fairly simply as a short-term solution: consider an implementation where we define a static prefix-length that KubeAI uses to calculate prefix hashes. We could evolve that technique to be more sophisticated over time. It is possible that Option 3 might even prove to be more advantageous than Option 1 in the longer term due to the amount/frequency of communication that might be required to support Option 1.
I think it's important that the user has control over the behavior so I see a future we do both option 2 and 3. Option 3 would be nice due to getting benefits out of the box for everyone.
I suggest starting with option 2, so only requests that specify a session header will enable this behavior.
Option 3 could be done by hashing the first X (e.g. 100) characters of each prompt. Afterwards use a hashing table to send requests to different back ends, while still respecting target concurrent request counts. The tricky part may be coming up with a value of X that makes sense as a default value.
One more thought that came to mind for option 3. We can take the first 100 characters, 500 characters and 1000 characters and do hashing based on those.
See: https://docs.vllm.ai/en/stable/automatic_prefix_caching/apc.html