lyogavin / airllm

AirLLM 70B inference with single 4GB GPU
Apache License 2.0
4.03k stars 334 forks source link

导入qwen报错:ValueError: max() arg is an empty sequence。airllm为最新版本。 #55

Closed sunzhaoyang1 closed 9 months ago

sunzhaoyang1 commented 9 months ago

modeling_qwen.py: 100%|█████████████████████████████████████████████████| 55.6k/55.6k [00:00<00:00, 1.48MB/s] Fetching 37 files: 100%|█████████████████████████████████████████████████████| 37/37 [00:02<00:00, 15.20it/s] 0%| | 0/3 [00:00<?, ?it/s] Traceback (most recent call last): File "D:\TCM_LLM\TCM-CHAT.PY", line 21, in model = AirLLMLlama2(r"Qwen/Qwen-14B-Chat") File "C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\airllm\airllm.py", line 75, in init self.model_local_path, self.checkpoint_path = find_or_create_local_splitted_path(model_local_path_or_repo_id, File "C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\airllm\utils.py", line 289, in find_or_create_local_splitted_path return Path(hf_cache_path), split_and_save_layers(hf_cache_path, layer_shards_saving_path, File "C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\airllm\utils.py", line 220, in split_and_save_layers if max(shards) > shard: ValueError: max() arg is an empty sequence

Name: airllm Version: 2.3.1 Summary: AirLLM allows single 4GB GPU card to run 70B large language models without quantization, distillation or pruning. Home-page: https://github.com/lyogavin/Anima/tree/main/air_llm Author: Gavin Li Author-email: gavinli@animaai.cloud License: Location: c:\users\administrator\appdata\local\programs\python\python310\lib\site-packages Requires: accelerate, huggingface-hub, optimum, safetensors, scipy, torch, tqdm, transformers Required-by:

lyogavin commented 9 months ago

try: from airllm import AirLLMQWen instead of AirLLMLlama2

lyogavin commented 9 months ago

I'm closing this, feel free to reopen if this does fix it.

sunzhaoyang1 commented 9 months ago

try: from airllm import AirLLMQWen instead of AirLLMLlama2

仍然报错: Fetching 37 files: 100%|███████████████████████████████████████████████████| 37/37 [00:00<00:00, 7404.07it/s] The model is automatically converting to fp16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". Try importing flash-attention for faster inference... Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention Traceback (most recent call last): File "D:\TCM_LLM\TCM-CHAT.PY", line 29, in input_tokens = model.tokenizer(input_text, File "C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers\tokenization_utils_base.py", line 2798, in call encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs) File "C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers\tokenization_utils_base.py", line 2884, in _call_one return self.batch_encode_plus( File "C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers\tokenization_utils_base.py", line 3066, in batch_encode_plus padding_strategy, truncation_strategy, max_length, kwargs = self._get_padding_truncation_strategies( File "C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers\tokenization_utils_base.py", line 2703, in _get_padding_truncation_strategies raise ValueError( ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as pad_token (tokenizer.pad_token = tokenizer.eos_token e.g.) or add a new pad token via tokenizer.add_special_tokens({'pad_token': '[PAD]'}).

lyogavin commented 9 months ago

QWen's tokenizer doesn't have a pad token by default, you can remove the padding line:

input_tokens = model.tokenizer(input_text,
    return_tensors="pt", 
    return_attention_mask=False, 
    truncation=True, 
    max_length=MAX_LENGTH, 
    # padding=True   <----------
)
sunzhaoyang1 commented 9 months ago

thank U very much

QWen's tokenizer doesn't have a pad token by default, you can remove the padding line:

input_tokens = model.tokenizer(input_text,
    return_tensors="pt", 
    return_attention_mask=False, 
    truncation=True, 
    max_length=MAX_LENGTH, 
    # padding=True   <----------
)