Closed Curiosity007 closed 5 months ago
Can you elaborate on why?
To clarify: Inference without a K/V cache would be extremely slow, so I'm not sure if that's what you're asking for.
That's because I am using it for a repetitive task. But I need to fine tune the prompts in order to find out the sweet spot. Now what happens, if I change my prompt little bit, it actually caches all the previous tokens, and only 1 or 2 new tokens are being generated, which is somehow defeating the purpose of prompt tuning.
I think you're misunderstanding what caching means here. It means two things, really:
First off, each output token you produce when generating from a LLM is a function of all the preceding tokens. So, if you have a prompt of 420 tokens (0-419), generation happens like this:
Key/value caching optimizes this process by simply not discarding all those values:
It's important to note that in this case, the keys/values for a sequence is determined entirely by the input token IDs for that sequence, as long as you're starting from position zero. Run the same input sequence again, and you might sample something different and build a different completion, but for the part of the input that doesn't change, the keys/values won't change either.
So once you're done generating a sequence, the cache will have a bunch of keys/values in it, and along the way you'll have recorded which token IDs produced them. Then, if the first 150 tokens of the next prompt precisely match what's already in the cache (starting from position zero), you may as well skip 150 tokens ahead because otherwise you'd just be overwriting the cache with the same values that are already there. This is called "prompt caching" or "cache reuse", and it's why you're seeing "cached tokens" in the logs.
The most recent versions of Tabby use the new dynamic generator in ExLlamaV2 which takes prompt caching a little bit further using paged attention. This means, among other things, you can remember more than one past sequence and reuse keys/values more often. But either way it's strictly an optimization and you wouldn't get different outputs by disabling it, only slower outputs.
Thank you for explaining the concept. And that is why I am not getting different response because the delta change in different prompts are minimal - only 10 to 20 tokens, which is .1% of entire prompt tokens. But then the question remains. Is there any way to get atleast little different answers? If we use openai key, use the same api key, ask the almost same question with temp 0, the answer varies, atleast a little. Is it achievable here as well?
It depends very much on the model and what you're trying to make it do (or not do.) Some models can be very stubborn and not very creative at all. If you ask Llama3-instruct anything remotely dangerous, controversial or lewd, its response will almost always start with "I cannot provide," no matter how you try to tweak the prompt.
You could play with sampling parameters, like temperature which flattens the output distribution making the model less stable and predictable (for better or worse). There's skew which, well, skews the distribution and can bias the model towards its second or third choice for any given token. That can give some interesting generations if nothing else. The banned strings feature suppresses exact strings (like "I cannot provide" or "as an AI language model"), and if you're trying to make the model reliably conform to a format like a JSON schema, you can generate with that schema as a constraint.
Mostly I'd say make sure you've tried some different models. ChatGPT is tough to beat if that's what you're comparing to, but Llama3-instruct is still very good at taking directions, for instance.
Thank you for explaining the concept. And that is why I am not getting different response because the delta change in different prompts are minimal - only 10 to 20 tokens, which is .1% of entire prompt tokens. But then the question remains. Is there any way to get atleast little different answers? If we use openai key, use the same api key, ask the almost same question with temp 0, the answer varies, atleast a little. Is it achievable here as well?
Calculate KV for attention is a deterministic process, so generating from scratch vs cached will give you the same results.
It depends very much on the model and what you're trying to make it do (or not do.) Some models can be very stubborn and not very creative at all. If you ask Llama3-instruct anything remotely dangerous, controversial or lewd, its response will almost always start with "I cannot provide," no matter how you try to tweak the prompt.
You could play with sampling parameters, like temperature which flattens the output distribution making the model less stable and predictable (for better or worse). There's skew which, well, skews the distribution and can bias the model towards its second or third choice for any given token. That can give some interesting generations if nothing else. The banned strings feature suppresses exact strings (like "I cannot provide" or "as an AI language model"), and if you're trying to make the model reliably conform to a format like a JSON schema, you can generate with that schema as a constraint.
Mostly I'd say make sure you've tried some different models. ChatGPT is tough to beat if that's what you're comparing to, but Llama3-instruct is still very good at taking directions, for instance.
Llama 3 8b instruct is the model I am using. Will try different sampling parameters. I had earlier used oobabooga api, where, if I keep temp 0, and tweak the prompt, it used to provide me little bit of different answers. That is why I was looking for it
Thank you for explaining the concept. And that is why I am not getting different response because the delta change in different prompts are minimal - only 10 to 20 tokens, which is .1% of entire prompt tokens. But then the question remains. Is there any way to get atleast little different answers? If we use openai key, use the same api key, ask the almost same question with temp 0, the answer varies, atleast a little. Is it achievable here as well?
Calculate KV for attention is a deterministic process, so generating from scratch vs cached will give you the same results.
Even for change in prompt? Should not that prompt be consumed as a whole to see whether the underlying meaning has changed or not (attention) , and based on that kv should be generated?
Even for change in prompt? Should not that prompt be consumed as a whole to see whether the underlying meaning has changed or not (attention) , and based on that kv should be generated?
Decoder-only models (most modern LLMs) use only causal attention, meaning that everything is evaluated left-to-right. Even when processing all the prompt tokens in parallel, this is still done with a causal mask. So changing the token at position 100 has no effect on how tokens 0-99 are interpreted, but it does affect every token after position 100.
I dont want to use cache at all. Rather I would like to generate the response from scratch, so , all new tokens. How do i do it