Open sun1092469590 opened 9 months ago
Will download the model and try and reproduce this, but I'm noticing that trust_remote_code=True
is not added in the AutoModelForCausalLM.from_pretrained
, which means that the model should not be loaded at all, as it's a model with remote code. So, perhaps not having trust_remote_code=True
causes the issue?
Will download the model and try and reproduce this, but I'm noticing that
trust_remote_code=True
is not added in theAutoModelForCausalLM.from_pretrained
, which means that the model should not be loaded at all, as it's a model with remote code. So, perhaps not havingtrust_remote_code=True
causes the issue?
- Tom Aarsen
sorry, I have add the trust_remote_code=True
in AutoModelForCausalLM.from_pretrained and AutoTokenizer.from_pretrained, but Error is also happend
The 14B model is downloading, but it will take a while. Until then, this is my output with your script with QWen/QWen-7B
:
保持身体健康有多种方式,以下是一些建议:
1. 均衡饮食:饮食应包括五大类食物,即谷物、蔬菜、水果、蛋白质和脂肪。避免高糖、高盐、高脂肪和加工食品。
2. 锻炼身体:每周至少进行150分钟的中等强度有氧运动,如快走、跑步、游泳等。此外,还应进行力量训练,如举重、俯卧撑等。
3. 充足睡眠:每晚应保证7-8小时的睡眠时间,以帮助身体恢复和修复。
4. 减少压力:压力是导致许多健康问题的主要原因之一。可以通过冥想、瑜伽、深呼吸等方式来减轻压力。
5. 戒烟限酒:吸烟和过量饮酒都会对身体健康造成负面影响。应尽量避免吸烟和过量饮酒。
6. 定期体检:定期进行体检可以帮助发现潜在的健康问题,并及早进行治疗。
希望这些建议对您有所帮助。如果您有任何其他问题,请随时问我。
<more text>
For reference, I am using transformers==4.34.0
, maybe that's the issue?
But I'll try with QWen-14B too to see if I can reproduce the problem.
This is my output for QWen-14B:
保持身体健康有多种方式,包括饮食、运动和睡眠。饮食方面,我们应该多吃水果、蔬菜和全谷类食品,少吃高热量、高脂肪和高糖分的食品。运动方面,每周至少进行150分钟的中等强度
有氧运动,如快走、跑步、游泳等。此外,还应该进行力量训练,以增强肌肉和骨骼。睡眠方面,每晚应该保证7-8 小时的睡眠时间,避免熬夜和过度使用电子设备。<more text>
It seems to work just fine for me. Perhaps you can 1) verify that you have the right transformers
version and 2) post here the full Traceback, so I can see where it actually throws an error.
thank you very much. my current transformers version is also 4.34.0 and I can run QWen-14B normaly when attention_sink is not added.
I see now that you're using Flash Attention. The current Attention Sinks implementation for QWen doesn't work with FA. I'll try to see if I can extend the implementation so it does work, but I'm still in the process of getting FA installed, so it's not easy to test.
I'll be testing here: https://github.com/tomaarsen/attention_sinks/tree/model/qwen_fa
Sadly, I can't reasonably test this without investing some more time into WSL or dualboot, as I'm on Windows. Colab also doesn't work: RuntimeError: FlashAttention only supports Ampere GPUs or newer.
Perhaps you can run
pip install git+https://github.com/tomaarsen/attention_sinks.git@model/qwen_fa
and check if it works. It would be very helpful.
thank you very much for your detailed answer. I will firstly try you method and if does not work I will stop use Flash Attention and test.
1) I stop use Flash Attention by add parameter "use_flash_attn=False" in AutoModelForCausalLM.from_pretrained(), and result is normal as you show me. As is : model = AutoModelForCausalLM.from_pretrained( model_id, device_map="auto", torch_dtype=torch.float16, attention_sink_size=4, attention_sink_window_size=252, use_flash_attn=False ) 2) I download new Branch version :https://github.com/tomaarsen/attention_sinks/tree/model/qwen_fa ,and test it. code can run without any Error, but result is not very good, maybe falsh_atten is not install correctly ,text = "保持身体健康有多种方式", I show the result as follows,
alse I want to know if Chat model can use attention_sink as Qwen-14B-Chat. and how to use it in chat model
- I stop use Flash Attention by add parameter "use_flash_attn=False" in AutoModelForCausalLM.from_pretrained(), and result is normal as you show me. As is : model = AutoModelForCausalLM.from_pretrained( model_id, device_map="auto", torch_dtype=torch.float16, attention_sink_size=4, attention_sink_window_size=252, use_flash_attn=False )
Awesome! I'm glad.
2. I download new Branch version :https://github.com/tomaarsen/attention_sinks/tree/model/qwen_fa ,and test it. code can run without any Error, but result is not very good, maybe falsh_atten is not install correctly ,text = "保持身体健康有多种方式", I show the result as follows, ![image](https://user-images.githubusercontent.com/19388387/277531786-90fc7232-6953-4dc6-a1c2-1f3b72a733ff.png)
That's a shame. There must be a bug there somewhere. I made #25 to add an error when flash attention is used. Perhaps in the future I can try to fix the support for flash attention with QWen.
alse I want to know if Chat model can use attention_sink as Qwen-14B-Chat. and how to use it in chat model
You can indeed, see for example this script: https://github.com/tomaarsen/attention_sinks/blob/main/demo/streaming.py In this file, the LLM is continuously given a prompt from a dataset of prompts. In practice, you could wait and receive these prompts from the user on the fly. Then you can generate tokens with this loop: https://github.com/tomaarsen/attention_sinks/blob/1f17f70d9fb8e47bf123e599a557f6faffc2520e/demo/streaming.py#L37-L45
Note: The streamer
just writes the text to a file and the terminal, that line is optional.
ok,thank you very much , I will try it. I try other method and has output, this is my code, this method is right or wrong? import torch from transformers import AutoTokenizer, TextStreamer, GenerationConfig from attention_sinks import AutoModelForCausalLM
model_id = "Qwen/Qwen-14B-Chat" model = AutoModelForCausalLM.from_pretrained( model_id, device_map="auto", torch_dtype=torch.float16, attention_sink_size=4, attention_sink_window_size=252, use_flash_att=False, trust_remote_code=True ) model.eval() tokenizer = AutoTokenizer.from_pretrained(model_id,trust_remote_code=True) tokenizer.pad_token_id = tokenizer.eos_token_id
text = "保持身体健康的几种方式" response, history = model.chat(tokenizer,text,history=None) print(response)
I'm afraid not. It's important to pass the old past_key_values
to every forward call, which isn't done with model.chat
.
I see. I will try your method, thank you for quick reply.
I use Qwen-14B-Chat and some script in demo /streaming.py to get result , but is very easy to appear OOM,here max_new_tokens=256 and is not very large, my GPU is 4*80G. this is my script and Error log:
import torch from transformers import AutoTokenizer, TextStreamer, GenerationConfig from attention_sinks import AutoModelForCausalLM from typing import Any, Dict, List
model_id = "Qwen/Qwen-14B-Chat" model = AutoModelForCausalLM.from_pretrained( model_id, device_map="auto", torch_dtype=torch.float16, attention_sink_size=4, attention_sink_window_size=256, use_flash_att=False, trust_remote_code=True ) model.eval() tokenizer = AutoTokenizer.from_pretrained(model_id,trust_remote_code=True) tokenizer.pad_token_id = tokenizer.eos_token_id
prompt = "保持身体健康有多种方式" past_key_values = None new_line_tokens = tokenizer("\n\n", return_tensors="pt", add_special_tokens=False).input_ids
prompt = tokenizer.apply_chat_template([{"role": "user", "content": prompt}], tokenize=False) input_ids = tokenizer(prompt, return_tensors="pt").input_ids input_ids = input_ids.to(‘cuda:3’)
max_newtokens=256 output="" for in range(max_new_tokens): outputs = model(input_ids, past_key_values=past_key_values, use_cache=True) past_key_values = outputs.past_key_values pred_token_idx = outputs.logits[:, -1, :].argmax(dim=-1).unsqueeze(1) output+=tokenizer.decode(pred_token_idx.cpu()[0],skip_special_tokens=True) input_ids = pred_token_idx if pred_token_idx == tokenizer.eos_token_id: break
print(output)
Here is some Error log:
Hello,
When using attention sink with Qwen-14B, I get the following error: TypeError: 'NoneType' object is not subscriptable
my script as is:
import torch from transformers import AutoTokenizer, TextStreamer, GenerationConfig from attention_sinks import AutoModelForCausalLM
model_id = "Qwen/Qwen-14B" model = AutoModelForCausalLM.from_pretrained( model_id, trust_remote_code=True,
for efficiency:
) model.eval() tokenizer = AutoTokenizer.from_pretrained(model_id,trust_remote_code=True) tokenizer.pad_token_id = tokenizer.eos_token_id
text = "保持身体健康有多种方式" input_ids = tokenizer.encode(text, return_tensors="pt").to(model.device)
with torch.no_grad(): streamer = TextStreamer(tokenizer) generation_config=GenerationConfig( use_cache=True, min_new_tokens=100_000, max_new_tokens=1_000_000, penalty_alpha=0.6, top_k=5, pad_token_id=tokenizer.pad_token_id, eos_token_id=tokenizer.eos_token_id, ) generated_tokens = model.generate(
input_ids, generation_config, streamer=streamer, )
Decode the final generated text
Error appear in model.generate(), I want to know why this happen