Closed AbrahamSanders closed 3 weeks ago
Seems like it is caused by https://github.com/huggingface/transformers/pull/31898, which removed cropping attn and positions to the max-length if SinkCache is used. It failed to Qwen, because it still has an old RoPE impl while Llama uses a slightly improved version. But note that not failing in Llama doesn't mean the generation is correct, applied position embeddings are still not the ones expected by Sink Cache
cc @gante let's get Sink cache working so that we can track if new changes are breaking anything. I guess we first need to decide where to do special handling for these cache types :)
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
System Info
transformers
version: 4.43.2Who can help?
@zucchini-nlp @gante
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Run the code below on CPU (using GPU hides the actual error behind a
RuntimeError: CUDA error: device-side assert triggered
).*** Works fine in v4.42.4, error appears only in 4.43.0+ ***
*** Tested Llama-2, Mistral, and Qwen1.5 models. Issue only appears to affect Qwen1.5, but may impact other models that I didn't test. ***
This appears to be a separate issue from #31381
Traceback:
Expected behavior
The generation would complete with no error.