substratusai / kubeai

AI Inference Operator for Kubernetes
https://www.kubeai.org
Apache License 2.0
549 stars 45 forks source link

Route requests to take advantage of prefix caching #266

Open nstogner opened 1 month ago

nstogner commented 1 month ago

See: https://docs.vllm.ai/en/stable/automatic_prefix_caching/apc.html

sam-huang1223 commented 1 month ago

we could send a UID, for example in the extra_body of some given request

response = client.chat.completions.create(
  model="model",
  messages=[
    {
      "role": "user",
      "content": "heyyy"
    }
  ],
  temperature=0.2,
  extra_body={"decode_id": "SOME_HASH"},
)

response = client.chat.completions.create(
  model="model",
  messages=[
    {
      "role": "user",
      "content": "heyyy there"
    }
  ],
  temperature=0.2,
  extra_body={"decode_id": "SOME_HASH"},
)
samos123 commented 1 month ago

My thinking is we could treat e.g. the X-Session-ID HTTP header as a way to tell us that a request belongs to the same session. You can set custom HTTP headers in the python openai client as well.

nstogner commented 1 month ago

vLLM issue to watch: https://github.com/vllm-project/vllm/issues/8523

nstogner commented 1 month ago

I see 3 main options:

  1. Integrate with vLLM to ask/be-told what the state of the cache is.
  2. Sticky sessions based on request attributes (HTTP headers, etc).
  3. Calculate an approximation of the cache state of the backend engines within KubeAI.

Option 1 is not currently possible at the moment from my understanding (see issue linked in previous comment).

Option 2 is probably the simplest to implement in KubeAI, but harder for clients to take full advantage of. The simplest approach would be to use a "user session" as the routing key. This might work well in chat-scenarios but might not take advantage of scenarios where large prefixes transcend user-sessions.

Option 3 could be implemented fairly simply as a short-term solution: consider an implementation where we define a static prefix-length that KubeAI uses to calculate prefix hashes. We could evolve that technique to be more sophisticated over time. It is possible that Option 3 might even prove to be more advantageous than Option 1 in the longer term due to the amount/frequency of communication that might be required to support Option 1.

samos123 commented 1 month ago

I think it's important that the user has control over the behavior so I see a future we do both option 2 and 3. Option 3 would be nice due to getting benefits out of the box for everyone.

I suggest starting with option 2, so only requests that specify a session header will enable this behavior.

Option 3 could be done by hashing the first X (e.g. 100) characters of each prompt. Afterwards use a hashing table to send requests to different back ends, while still respecting target concurrent request counts. The tricky part may be coming up with a value of X that makes sense as a default value.

samos123 commented 1 month ago

One more thought that came to mind for option 3. We can take the first 100 characters, 500 characters and 1000 characters and do hashing based on those.