huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
133.7k stars 26.71k forks source link

cache wrong code #34232

Open mdy666 opened 3 days ago

mdy666 commented 3 days ago

System Info

Although this method is un-useful, but it's little wrong image

Who can help?

No response

Information

Tasks

Reproduction

nan

Expected behavior

fix

zucchini-nlp commented 3 days ago

Hey! Can you pls elaborate on what is wrong in this method? It is used when we do beam search and constrastive generation afaik

cc @gante

mdy666 commented 3 days ago

Hey! Can you pls elaborate on what is wrong in this method? It is used when we do beam search and constrastive generation afaik

cc @gante

maybe it should be "value_cache" rather than "key_cache", but i don't know it well

zucchini-nlp commented 3 days ago

Oh right, didn't notice it! Yes, that needs to be fixed and weird we didn't catch any tests failing. Feel free to open a PR if you are willing to 😄 and tag @gante for review. If you don't have bandwidth, we'll make sure to fix it soon.

Thanks for reporting!

mearcstapa-gqz commented 3 days ago

@zucchini-nlp Hi I want to use this thread to ask a somewhat related question. I want to basically to extend the "Re-use Cache to continue generation" tutorial https://huggingface.co/docs/transformers/en/kv_cache#re-use-cache-to-continue-generation to batched case. But the model gives erroneous output. By preliminary debugging, I suspect it's because of the default left padding used, so that the cache positions are not aligned correctly. (not sure whether the bug from issue contributes as well or not ) I want to know that are there any existing code that I can refer to? Thanks!

zucchini-nlp commented 3 days ago

@mearcstapa-gqz you mean use a batched cache from pre-fill stage in batched generation or use one same pre-fill prompt but continue generate with multiple texts at once? Please share your minimal code and I'll see what might be the error, as expanding to batched generation should be straightforward unless I am missing anything

mearcstapa-gqz commented 3 days ago

@zucchini-nlp Thanks! On second look, I noticed that the example provided https://huggingface.co/docs/transformers/en/kv_cache#re-use-cache-to-continue-generation is indeed batched. I got it wrong when I saw "max_batch_size=1" in the argument in StaticCache. the example code use a for-loop for prompts

My use case is the same as the example code. I have texts=[f"<|im_start|>system\n{SOME_SHARED_SYSTEM_PROMPT}<|im_end|>\n<|im_start|>user\n{SOME_SHARED_USER_PREFIX}{query}<|im_end|>\n<|im_start|>assistant\n" for query in queries] And a want to cache the f"<|im_start|>system\n{SOME_SHARED_SYSTEM_PROMPT}<|im_end|>\n<|im_start|>user\n{SOME_SHARED_USER_PREFIX}" part.

I would try to debug my self then, should it fails, I would provide a minimal code and ask for help again.

mearcstapa-gqz commented 2 hours ago

@zucchini-nlp the example code use a for-loop for prompts. I can't figure out how to set up the past_key_values to make it work like normal batch inference. Here's a minimal code.

import os
import copy
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, DynamicCache, StaticCache

model_id = "Qwen/Qwen2.5-0.5B-Instruct"
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.padding_side = "left" 
# Curiously, setting tokenizer.padding_side = "right" yields coherent(? but if I switch my model to "Qwen/Qwen2-VL-2B-Instruct", padding_side right produces gibberish also) result for both get_output(inputs) and get_output(inputs, past_key_values=copy.deepcopy(prompt_cache)). 
# But there's a warning "A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer." 
# What are the implications??
# https://huggingface.co/docs/transformers/llm_tutorial#wrong-padding-side

INITIAL_PROMPT = '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n'
prompts = ["Help me to write a blogpost about travelling.", "What is the capital of France?"]

inputs = tokenizer([INITIAL_PROMPT + prompt + '<|im_end|>\n<|im_start|>assistant\n' for prompt in prompts], return_tensors="pt", padding=True).to("cuda")

def get_output(inputs, past_key_values=None):
    generated_ids = model.generate(**inputs, past_key_values=past_key_values,max_new_tokens=20)

    generated_ids_trimmed = [
        out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
    ]
    output_texts = tokenizer.batch_decode(
        generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )
    for o in output_texts:
        print(o)

get_output(inputs) # normal batch inference

prompt_cache = DynamicCache()
inputs_initial_prompt = tokenizer([INITIAL_PROMPT] * 2, return_tensors="pt").to("cuda")

with torch.no_grad():
     prompt_cache = model(**inputs_initial_prompt, past_key_values = prompt_cache).past_key_values

get_output(inputs, past_key_values=copy.deepcopy(prompt_cache)) # incoherent output, how to set past_key_values properly?
zucchini-nlp commented 2 hours ago

Hmm you're right, in case we want to do batching the padding will not be set correctly because the initial prompt has no padding while the subsequent calls will be padded on the left. So we'll end up with sequences as follows:

INITIAL_PROMPT [PAD] [PAD] [PAD] [PAD] INPUT-TEXT

I don't see an easy way to overcome this unless we start supporting nested tensors. Also cc @gante , if you have any ideas or maybe we add this to out TODO list

mearcstapa-gqz commented 1 hour ago

@zucchini-nlp May I ask why something like this won't work?

Acutally I tried to make the input look like INITIAL_PROMPT [PAD] [PAD] [PAD] [PAD] INPUT-TEXT with


texts = [f"{query}<|im_end|>\n<|im_start|>assistant\n" for query in ["Help me to write a blogpost about travelling.", "What is the capital of France?"]]

inputs = processor(
    text=texts, images=None, padding=True, return_tensors="pt"
)
inputs = inputs.to(model.device)

inputs = BatchFeature(data={
    'input_ids': torch.concat([inputs_initial_prompt.input_ids, inputs.input_ids], -1),
    'attention_mask': torch.concat([inputs_initial_prompt.attention_mask, inputs.attention_mask], -1)
})

Is it because the attention_mask passed is actually generated inside the model?