Open mdy666 opened 3 days ago
Hey! Can you pls elaborate on what is wrong in this method? It is used when we do beam search and constrastive generation afaik
cc @gante
Hey! Can you pls elaborate on what is wrong in this method? It is used when we do beam search and constrastive generation afaik
cc @gante
maybe it should be "value_cache" rather than "key_cache", but i don't know it well
Oh right, didn't notice it! Yes, that needs to be fixed and weird we didn't catch any tests failing. Feel free to open a PR if you are willing to 😄 and tag @gante for review. If you don't have bandwidth, we'll make sure to fix it soon.
Thanks for reporting!
@zucchini-nlp Hi I want to use this thread to ask a somewhat related question. I want to basically to extend the "Re-use Cache to continue generation" tutorial https://huggingface.co/docs/transformers/en/kv_cache#re-use-cache-to-continue-generation to batched case. But the model gives erroneous output. By preliminary debugging, I suspect it's because of the default left padding used, so that the cache positions are not aligned correctly. (not sure whether the bug from issue contributes as well or not ) I want to know that are there any existing code that I can refer to? Thanks!
@mearcstapa-gqz you mean use a batched cache from pre-fill stage in batched generation or use one same pre-fill prompt but continue generate with multiple texts at once? Please share your minimal code and I'll see what might be the error, as expanding to batched generation should be straightforward unless I am missing anything
@zucchini-nlp Thanks! On second look, I noticed that the example provided https://huggingface.co/docs/transformers/en/kv_cache#re-use-cache-to-continue-generation is indeed batched. I got it wrong when I saw "max_batch_size=1" in the argument in StaticCache. the example code use a for-loop for prompts
My use case is the same as the example code. I have
texts=[f"<|im_start|>system\n{SOME_SHARED_SYSTEM_PROMPT}<|im_end|>\n<|im_start|>user\n{SOME_SHARED_USER_PREFIX}{query}<|im_end|>\n<|im_start|>assistant\n" for query in queries]
And a want to cache the f"<|im_start|>system\n{SOME_SHARED_SYSTEM_PROMPT}<|im_end|>\n<|im_start|>user\n{SOME_SHARED_USER_PREFIX}"
part.
I would try to debug my self then, should it fails, I would provide a minimal code and ask for help again.
@zucchini-nlp the example code use a for-loop for prompts. I can't figure out how to set up the past_key_values to make it work like normal batch inference. Here's a minimal code.
import os
import copy
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, DynamicCache, StaticCache
model_id = "Qwen/Qwen2.5-0.5B-Instruct"
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.padding_side = "left"
# Curiously, setting tokenizer.padding_side = "right" yields coherent(? but if I switch my model to "Qwen/Qwen2-VL-2B-Instruct", padding_side right produces gibberish also) result for both get_output(inputs) and get_output(inputs, past_key_values=copy.deepcopy(prompt_cache)).
# But there's a warning "A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer."
# What are the implications??
# https://huggingface.co/docs/transformers/llm_tutorial#wrong-padding-side
INITIAL_PROMPT = '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n'
prompts = ["Help me to write a blogpost about travelling.", "What is the capital of France?"]
inputs = tokenizer([INITIAL_PROMPT + prompt + '<|im_end|>\n<|im_start|>assistant\n' for prompt in prompts], return_tensors="pt", padding=True).to("cuda")
def get_output(inputs, past_key_values=None):
generated_ids = model.generate(**inputs, past_key_values=past_key_values,max_new_tokens=20)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_texts = tokenizer.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
for o in output_texts:
print(o)
get_output(inputs) # normal batch inference
prompt_cache = DynamicCache()
inputs_initial_prompt = tokenizer([INITIAL_PROMPT] * 2, return_tensors="pt").to("cuda")
with torch.no_grad():
prompt_cache = model(**inputs_initial_prompt, past_key_values = prompt_cache).past_key_values
get_output(inputs, past_key_values=copy.deepcopy(prompt_cache)) # incoherent output, how to set past_key_values properly?
Hmm you're right, in case we want to do batching the padding will not be set correctly because the initial prompt has no padding while the subsequent calls will be padded on the left. So we'll end up with sequences as follows:
INITIAL_PROMPT [PAD] [PAD] [PAD] [PAD] INPUT-TEXT
I don't see an easy way to overcome this unless we start supporting nested tensors. Also cc @gante , if you have any ideas or maybe we add this to out TODO list
@zucchini-nlp May I ask why something like this won't work?
Acutally I tried to make the input look like
INITIAL_PROMPT [PAD] [PAD] [PAD] [PAD] INPUT-TEXT
with
texts = [f"{query}<|im_end|>\n<|im_start|>assistant\n" for query in ["Help me to write a blogpost about travelling.", "What is the capital of France?"]]
inputs = processor(
text=texts, images=None, padding=True, return_tensors="pt"
)
inputs = inputs.to(model.device)
inputs = BatchFeature(data={
'input_ids': torch.concat([inputs_initial_prompt.input_ids, inputs.input_ids], -1),
'attention_mask': torch.concat([inputs_initial_prompt.attention_mask, inputs.attention_mask], -1)
})
Is it because the attention_mask passed is actually generated inside the model?
System Info
Although this method is un-useful, but it's little wrong
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
nan
Expected behavior
fix