Closed sunzhaoyang1 closed 9 months ago
try: from airllm import AirLLMQWen instead of AirLLMLlama2
I'm closing this, feel free to reopen if this does fix it.
try: from airllm import AirLLMQWen instead of AirLLMLlama2
仍然报错:
Fetching 37 files: 100%|███████████████████████████████████████████████████| 37/37 [00:00<00:00, 7404.07it/s]
The model is automatically converting to fp16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".
Try importing flash-attention for faster inference...
Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
Traceback (most recent call last):
File "D:\TCM_LLM\TCM-CHAT.PY", line 29, in pad_token
(tokenizer.pad_token = tokenizer.eos_token e.g.)
or add a new pad token via tokenizer.add_special_tokens({'pad_token': '[PAD]'})
.
QWen's tokenizer doesn't have a pad token by default, you can remove the padding line:
input_tokens = model.tokenizer(input_text,
return_tensors="pt",
return_attention_mask=False,
truncation=True,
max_length=MAX_LENGTH,
# padding=True <----------
)
thank U very much
QWen's tokenizer doesn't have a pad token by default, you can remove the padding line:
input_tokens = model.tokenizer(input_text, return_tensors="pt", return_attention_mask=False, truncation=True, max_length=MAX_LENGTH, # padding=True <---------- )
modeling_qwen.py: 100%|█████████████████████████████████████████████████| 55.6k/55.6k [00:00<00:00, 1.48MB/s] Fetching 37 files: 100%|█████████████████████████████████████████████████████| 37/37 [00:02<00:00, 15.20it/s] 0%| | 0/3 [00:00<?, ?it/s] Traceback (most recent call last): File "D:\TCM_LLM\TCM-CHAT.PY", line 21, in
model = AirLLMLlama2(r"Qwen/Qwen-14B-Chat")
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\airllm\airllm.py", line 75, in init
self.model_local_path, self.checkpoint_path = find_or_create_local_splitted_path(model_local_path_or_repo_id,
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\airllm\utils.py", line 289, in find_or_create_local_splitted_path
return Path(hf_cache_path), split_and_save_layers(hf_cache_path, layer_shards_saving_path,
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\airllm\utils.py", line 220, in split_and_save_layers
if max(shards) > shard:
ValueError: max() arg is an empty sequence
Name: airllm Version: 2.3.1 Summary: AirLLM allows single 4GB GPU card to run 70B large language models without quantization, distillation or pruning. Home-page: https://github.com/lyogavin/Anima/tree/main/air_llm Author: Gavin Li Author-email: gavinli@animaai.cloud License: Location: c:\users\administrator\appdata\local\programs\python\python310\lib\site-packages Requires: accelerate, huggingface-hub, optimum, safetensors, scipy, torch, tqdm, transformers Required-by: