tomaarsen / attention_sinks

Extend existing LLMs way beyond the original training length with constant memory usage, without retraining
https://huggingface.co/blog/tomaarsen/attention-sinks
Apache License 2.0
650 stars 41 forks source link

Error when using Qwen-14B #24

Open sun1092469590 opened 9 months ago

sun1092469590 commented 9 months ago

Hello,

When using attention sink with Qwen-14B, I get the following error: TypeError: 'NoneType' object is not subscriptable

my script as is:

import torch from transformers import AutoTokenizer, TextStreamer, GenerationConfig from attention_sinks import AutoModelForCausalLM

model_id = "Qwen/Qwen-14B" model = AutoModelForCausalLM.from_pretrained( model_id, trust_remote_code=True,

for efficiency:

device_map="auto",
torch_dtype=torch.float16,
attention_sink_size=4,
attention_sink_window_size=252, # <- Low for the sake of faster generation

) model.eval() tokenizer = AutoTokenizer.from_pretrained(model_id,trust_remote_code=True) tokenizer.pad_token_id = tokenizer.eos_token_id

text = "保持身体健康有多种方式" input_ids = tokenizer.encode(text, return_tensors="pt").to(model.device)

with torch.no_grad(): streamer = TextStreamer(tokenizer) generation_config=GenerationConfig( use_cache=True, min_new_tokens=100_000, max_new_tokens=1_000_000, penalty_alpha=0.6, top_k=5, pad_token_id=tokenizer.pad_token_id, eos_token_id=tokenizer.eos_token_id, ) generated_tokens = model.generate(
input_ids, generation_config, streamer=streamer, )

Decode the final generated text

output_text = tokenizer.decode(generated_tokens[0], skip_special_tokens=True)

Error appear in model.generate(), I want to know why this happen

tomaarsen commented 9 months ago

Will download the model and try and reproduce this, but I'm noticing that trust_remote_code=True is not added in the AutoModelForCausalLM.from_pretrained, which means that the model should not be loaded at all, as it's a model with remote code. So, perhaps not having trust_remote_code=True causes the issue?

sun1092469590 commented 9 months ago

Will download the model and try and reproduce this, but I'm noticing that trust_remote_code=True is not added in the AutoModelForCausalLM.from_pretrained, which means that the model should not be loaded at all, as it's a model with remote code. So, perhaps not having trust_remote_code=True causes the issue?

  • Tom Aarsen

sorry, I have add the trust_remote_code=True in AutoModelForCausalLM.from_pretrained and AutoTokenizer.from_pretrained, but Error is also happend

tomaarsen commented 9 months ago

The 14B model is downloading, but it will take a while. Until then, this is my output with your script with QWen/QWen-7B:

保持身体健康有多种方式,以下是一些建议:

1. 均衡饮食:饮食应包括五大类食物,即谷物、蔬菜、水果、蛋白质和脂肪。避免高糖、高盐、高脂肪和加工食品。

2. 锻炼身体:每周至少进行150分钟的中等强度有氧运动,如快走、跑步、游泳等。此外,还应进行力量训练,如举重、俯卧撑等。

3. 充足睡眠:每晚应保证7-8小时的睡眠时间,以帮助身体恢复和修复。

4. 减少压力:压力是导致许多健康问题的主要原因之一。可以通过冥想、瑜伽、深呼吸等方式来减轻压力。

5. 戒烟限酒:吸烟和过量饮酒都会对身体健康造成负面影响。应尽量避免吸烟和过量饮酒。

6. 定期体检:定期进行体检可以帮助发现潜在的健康问题,并及早进行治疗。

希望这些建议对您有所帮助。如果您有任何其他问题,请随时问我。

<more text>

For reference, I am using transformers==4.34.0, maybe that's the issue?

But I'll try with QWen-14B too to see if I can reproduce the problem.

tomaarsen commented 9 months ago

This is my output for QWen-14B:

保持身体健康有多种方式,包括饮食、运动和睡眠。饮食方面,我们应该多吃水果、蔬菜和全谷类食品,少吃高热量、高脂肪和高糖分的食品。运动方面,每周至少进行150分钟的中等强度
有氧运动,如快走、跑步、游泳等。此外,还应该进行力量训练,以增强肌肉和骨骼。睡眠方面,每晚应该保证7-8 小时的睡眠时间,避免熬夜和过度使用电子设备。<more text>

It seems to work just fine for me. Perhaps you can 1) verify that you have the right transformers version and 2) post here the full Traceback, so I can see where it actually throws an error.

sun1092469590 commented 9 months ago

thank you very much. my current transformers version is also 4.34.0 and I can run QWen-14B normaly when attention_sink is not added. 7bb2533b13469abc3e4d8da3e71316a 958d65a94de3a3210ba7bd9d2b9bd43 142019338f1b62a5869b09b3de2a6a4 101ec095115660499df401fac840ddc

tomaarsen commented 9 months ago

I see now that you're using Flash Attention. The current Attention Sinks implementation for QWen doesn't work with FA. I'll try to see if I can extend the implementation so it does work, but I'm still in the process of getting FA installed, so it's not easy to test.

tomaarsen commented 9 months ago

I'll be testing here: https://github.com/tomaarsen/attention_sinks/tree/model/qwen_fa

tomaarsen commented 9 months ago

Sadly, I can't reasonably test this without investing some more time into WSL or dualboot, as I'm on Windows. Colab also doesn't work: RuntimeError: FlashAttention only supports Ampere GPUs or newer. Perhaps you can run

pip install git+https://github.com/tomaarsen/attention_sinks.git@model/qwen_fa

and check if it works. It would be very helpful.

sun1092469590 commented 9 months ago

thank you very much for your detailed answer. I will firstly try you method and if does not work I will stop use Flash Attention and test.

sun1092469590 commented 9 months ago

1) I stop use Flash Attention by add parameter "use_flash_attn=False" in AutoModelForCausalLM.from_pretrained(), and result is normal as you show me. As is : model = AutoModelForCausalLM.from_pretrained( model_id, device_map="auto", torch_dtype=torch.float16, attention_sink_size=4, attention_sink_window_size=252, use_flash_attn=False ) 2) I download new Branch version :https://github.com/tomaarsen/attention_sinks/tree/model/qwen_fa ,and test it. code can run without any Error, but result is not very good, maybe falsh_atten is not install correctly ,text = "保持身体健康有多种方式", I show the result as follows, image

sun1092469590 commented 9 months ago

alse I want to know if Chat model can use attention_sink as Qwen-14B-Chat. and how to use it in chat model

tomaarsen commented 9 months ago
  1. I stop use Flash Attention by add parameter "use_flash_attn=False" in AutoModelForCausalLM.from_pretrained(), and result is normal as you show me. As is : model = AutoModelForCausalLM.from_pretrained( model_id, device_map="auto", torch_dtype=torch.float16, attention_sink_size=4, attention_sink_window_size=252, use_flash_attn=False )

Awesome! I'm glad.

2. I download new  Branch version :https://github.com/tomaarsen/attention_sinks/tree/model/qwen_fa ,and test it. code can run without  any Error, but result is not very good, maybe falsh_atten is not install correctly ,text = "保持身体健康有多种方式",  I show the result as follows,
   ![image](https://user-images.githubusercontent.com/19388387/277531786-90fc7232-6953-4dc6-a1c2-1f3b72a733ff.png)

That's a shame. There must be a bug there somewhere. I made #25 to add an error when flash attention is used. Perhaps in the future I can try to fix the support for flash attention with QWen.

alse I want to know if Chat model can use attention_sink as Qwen-14B-Chat. and how to use it in chat model

You can indeed, see for example this script: https://github.com/tomaarsen/attention_sinks/blob/main/demo/streaming.py In this file, the LLM is continuously given a prompt from a dataset of prompts. In practice, you could wait and receive these prompts from the user on the fly. Then you can generate tokens with this loop: https://github.com/tomaarsen/attention_sinks/blob/1f17f70d9fb8e47bf123e599a557f6faffc2520e/demo/streaming.py#L37-L45

Note: The streamer just writes the text to a file and the terminal, that line is optional.

sun1092469590 commented 9 months ago

ok,thank you very much , I will try it. I try other method and has output, this is my code, this method is right or wrong? import torch from transformers import AutoTokenizer, TextStreamer, GenerationConfig from attention_sinks import AutoModelForCausalLM

model_id = "Qwen/Qwen-14B-Chat" model = AutoModelForCausalLM.from_pretrained( model_id, device_map="auto", torch_dtype=torch.float16, attention_sink_size=4, attention_sink_window_size=252, use_flash_att=False, trust_remote_code=True ) model.eval() tokenizer = AutoTokenizer.from_pretrained(model_id,trust_remote_code=True) tokenizer.pad_token_id = tokenizer.eos_token_id

text = "保持身体健康的几种方式" response, history = model.chat(tokenizer,text,history=None) print(response)

tomaarsen commented 9 months ago

I'm afraid not. It's important to pass the old past_key_values to every forward call, which isn't done with model.chat.

sun1092469590 commented 9 months ago

I see. I will try your method, thank you for quick reply.

sun1092469590 commented 9 months ago

I use Qwen-14B-Chat and some script in demo /streaming.py to get result , but is very easy to appear OOM,here max_new_tokens=256 and is not very large, my GPU is 4*80G. this is my script and Error log:

import torch from transformers import AutoTokenizer, TextStreamer, GenerationConfig from attention_sinks import AutoModelForCausalLM from typing import Any, Dict, List

model_id = "Qwen/Qwen-14B-Chat" model = AutoModelForCausalLM.from_pretrained( model_id, device_map="auto", torch_dtype=torch.float16, attention_sink_size=4, attention_sink_window_size=256, use_flash_att=False, trust_remote_code=True ) model.eval() tokenizer = AutoTokenizer.from_pretrained(model_id,trust_remote_code=True) tokenizer.pad_token_id = tokenizer.eos_token_id

prompt = "保持身体健康有多种方式" past_key_values = None new_line_tokens = tokenizer("\n\n", return_tensors="pt", add_special_tokens=False).input_ids

prompt = tokenizer.apply_chat_template([{"role": "user", "content": prompt}], tokenize=False) input_ids = tokenizer(prompt, return_tensors="pt").input_ids input_ids = input_ids.to(‘cuda:3’)

max_newtokens=256 output="" for in range(max_new_tokens): outputs = model(input_ids, past_key_values=past_key_values, use_cache=True) past_key_values = outputs.past_key_values pred_token_idx = outputs.logits[:, -1, :].argmax(dim=-1).unsqueeze(1) output+=tokenizer.decode(pred_token_idx.cpu()[0],skip_special_tokens=True) input_ids = pred_token_idx if pred_token_idx == tokenizer.eos_token_id: break

print(output)

Here is some Error log: d56baf0fdac7a38d85befb663e750c9 9f9a74c375fcf78a7bc425ab02ffa4d 4e0e1fbc3d9df7bd2b2e692e851435b