Open BBuf opened 2 weeks ago
@ArthurZucker Hi, when you have time, could you please help take a look at this bug? Thank you very much.
Can you try using input_ids
directly instead of inputs_embeds
. This may help in avoiding dimension mismatch.
outputs = model.generate(input_ids=input_ids, past_key_values=prompt_cache)
The StaticCache mechanism is designed to work with input tokens rather than embeddings directly. By letting the model handle the embedding process internally, we avoid dimension mismatch issues during the attention mask creation.
Can you try using
input_ids
directly instead ofinputs_embeds
. This may help in avoiding dimension mismatch.
outputs = model.generate(input_ids=input_ids, past_key_values=prompt_cache)
The StaticCache mechanism is designed to work with input tokens rather than embeddings directly. By letting the model handle the embedding process internally, we avoid dimension mismatch issues during the attention mask creation.
I can use StaticCache with input_ids, but unfortunately, in my scenario, I can't provide input_ids to the model.generate
API, so it looks like I'll have to give up using StaticCache.
Can you explain why you can't pass input_ids
to model.generate
?
Can you explain why you can't pass
input_ids
tomodel.generate
?
Because the features I passed to model.generate
are a combination of encoded audio feature values and text feature values that have gone through the LLM model's embedding layer.
After going down the rabbit hole, here's what I think, when we use input_embeds
instead of input_ids
, we must explicitly provide attention mask since model cannot automatically infer it from the embedded inputs.
Can you try the below thing:
attention_mask = torch.ones((batch_size, sequence_length), device="cuda")
outputs = model.generate(
inputs_embeds=inputs_embeds,
attention_mask=attention_mask,
past_key_values=prompt_cache
)
Static Cache should be working with input embeds even without attention mask, and from the code snippet I see that the first generate()
was successful. But the second call fails because we cannot continue generate with embeds
as inputs due to how model forward kwargs are prepared internally within generation logic. So in this case, even if we bypass error by using attn mask, the model will use only the cached inputs and disregard the new concatenated prompt
I'll check out if we can accommodate space for continue generation with embeds later next week, also feel free to open a PR if you have any initial fix :)
System Info
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
When I use StaticCache to perform inference on Qwen2.5, a bug occurs. In this example, I pass the tensor after the embedding layer to model.generate instead of the token IDs from the tokenizer. The reproduction script is as follows:
I used the latest version of Transformers by compiling it from source. The error message is as follows:
Expected behavior
I can successfully run the above script using StaticCache.