SinkCache with Qwen1.5 broken in 4.43.0+

AbrahamSanders commented 2 months ago

System Info

transformers version: 4.43.2
Platform: Linux-5.15.0-91-generic-x86_64-with-glibc2.35
Python version: 3.9.16
Huggingface_hub version: 0.24.0
Safetensors version: 0.4.2
Accelerate version: 0.33.0
Accelerate config: not found
PyTorch version (GPU?): 2.1.2+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?: No
Using GPU in script?: No
GPU type: NVIDIA RTX A6000

Who can help?

@zucchini-nlp @gante

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Run the code below on CPU (using GPU hides the actual error behind a RuntimeError: CUDA error: device-side assert triggered).

*** Works fine in v4.42.4, error appears only in 4.43.0+ ***

*** Tested Llama-2, Mistral, and Qwen1.5 models. Issue only appears to affect Qwen1.5, but may impact other models that I didn't test. ***

This appears to be a separate issue from #31381

from transformers import AutoModelForCausalLM, AutoTokenizer, SinkCache
from transformers.trainer_utils import set_seed

# Load the model and tokenizer
# model_name = "meta-llama/Llama-2-7b-hf" # <-- this works!
# model_name = "mistralai/Mistral-7B-v0.1" # <-- this works!
model_name = "Qwen/Qwen1.5-1.8B" # <-- this doesn't work!

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Prepare the input
input_text = "The quick brown fox jumps over the lazy"
inputs = tokenizer(input_text, return_tensors="pt")

# Generate the output
set_seed(42)
sink_cache = SinkCache(window_length=50, num_sink_tokens=8)
output = model.generate(
    **inputs, 
    use_cache=True, 
    past_key_values=sink_cache, 
    max_new_tokens=100,
    do_sample=True,
    top_p=0.9,
)

# Decode the output
output_text = tokenizer.decode(output[0])
print(output_text)

Traceback:

Traceback (most recent call last):
  File "/remote/ayuser/user/sinkcache_test.py", line 19, in <module>
    output = model.generate(
  File "/home/user/anaconda3/envs/bark/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/user/anaconda3/envs/bark/lib/python3.9/site-packages/transformers/generation/utils.py", line 1989, in generate
    result = self._sample(
  File "/home/user/anaconda3/envs/bark/lib/python3.9/site-packages/transformers/generation/utils.py", line 2932, in _sample
    outputs = self(**model_inputs, return_dict=True)
  File "/home/user/anaconda3/envs/bark/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/anaconda3/envs/bark/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/anaconda3/envs/bark/lib/python3.9/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 1054, in forward
    outputs = self.model(
  File "/home/user/anaconda3/envs/bark/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/anaconda3/envs/bark/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/anaconda3/envs/bark/lib/python3.9/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 856, in forward
    layer_outputs = decoder_layer(
  File "/home/user/anaconda3/envs/bark/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/anaconda3/envs/bark/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/anaconda3/envs/bark/lib/python3.9/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 596, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/user/anaconda3/envs/bark/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/anaconda3/envs/bark/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/anaconda3/envs/bark/lib/python3.9/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 496, in forward
    query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
  File "/home/user/anaconda3/envs/bark/lib/python3.9/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 149, in apply_rotary_pos_emb
    cos = cos[position_ids].unsqueeze(unsqueeze_dim)
IndexError: index 50 is out of bounds for dimension 0 with size 50

Expected behavior

The generation would complete with no error.

zucchini-nlp commented 2 months ago

Seems like it is caused by https://github.com/huggingface/transformers/pull/31898, which removed cropping attn and positions to the max-length if SinkCache is used. It failed to Qwen, because it still has an old RoPE impl while Llama uses a slightly improved version. But note that not failing in Llama doesn't mean the generation is correct, applied position embeddings are still not the ones expected by Sink Cache

cc @gante let's get Sink cache working so that we can track if new changes are breaking anything. I guess we first need to decide where to do special handling for these cache types :)

github-actions[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / transformers