Open werruww opened 8 hours ago
max_new_tokens=xxxxxxxxxxxxxxxxxxxxxx,
The higher the number, the more load on the graphics unit and the longer the time.
generation_output = model.generate( input_tokens['input_ids'].cuda(), max_new_tokens=12, use_cache=True, return_dict_in_generate=True)
model.tokenizer.decode(generation_output.sequences[0])
it run on colab t4
from airllm import AutoModel
MAX_LENGTH = 128
from airllm import AutoModel
MAX_LENGTH = 128
model = AutoModel.from_pretrained("Qwen/Qwen-7B",
compression='4bit',
delete_original=True # specify '8bit' for 8-bit block-wise quantization
)
input_text = [
'Who invented the electric light bulb?',
]
input_tokens = model.tokenizer(input_text, return_tensors="pt", return_attention_mask=False, truncation=True, max_length=MAX_LENGTH,
)
generation_output = model.generate( input_tokens['input_ids'].cuda(), max_new_tokens=5, use_cache=True, no_repeat_ngram_size=3, # يمنع تكرار ثلاث كلمات متتابعة repetition_penalty=1.2, # عقوبة للتكرار لتجنب تكرار الكلمات return_dict_in_generate=True)
model.tokenizer.decode(generation_output.sequences[0])
bitsandbytes installed cache_utils installed Fetching 20 files: 100% 20/20 [00:00<00:00, 1025.63it/s] found_layers:{'transformer.wte.': True, 'transformer.h.0.': True, 'transformer.h.1.': True, 'transformer.h.2.': True, 'transformer.h.3.': True, 'transformer.h.4.': True, 'transformer.h.5.': True, 'transformer.h.6.': True, 'transformer.h.7.': True, 'transformer.h.8.': True, 'transformer.h.9.': True, 'transformer.h.10.': True, 'transformer.h.11.': True, 'transformer.h.12.': True, 'transformer.h.13.': True, 'transformer.h.14.': True, 'transformer.h.15.': True, 'transformer.h.16.': True, 'transformer.h.17.': True, 'transformer.h.18.': True, 'transformer.h.19.': True, 'transformer.h.20.': True, 'transformer.h.21.': True, 'transformer.h.22.': True, 'transformer.h.23.': True, 'transformer.h.24.': True, 'transformer.h.25.': True, 'transformer.h.26.': True, 'transformer.h.27.': True, 'transformer.h.28.': True, 'transformer.h.29.': True, 'transformer.h.30.': True, 'transformer.h.31.': True, 'transformer.ln_f.': True, 'lm_head.': True} saved layers already found in /root/.cache/huggingface/hub/models--Qwen--Qwen-7B/snapshots/ef3c5c9c57b252f3149c1408daf4d649ec8b6c85/splitted_model.4bit WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference... WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention either BetterTransformer or attn_implementation='sdpa' is available, creating model directly not support prefetching for compression for now. loading with no prepetching mode. WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference... WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention either BetterTransformer or attn_implementation='sdpa' is available, creating model directly running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.70it/s] WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference... WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention either BetterTransformer or attn_implementation='sdpa' is available, creating model directly running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.78it/s] WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference... WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention either BetterTransformer or attn_implementation='sdpa' is available, creating model directly running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.88it/s] WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference... WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention either BetterTransformer or attn_implementation='sdpa' is available, creating model directly running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.77it/s] WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference... WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention either BetterTransformer or attn_implementation='sdpa' is available, creating model directly running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.87it/s] Who invented the electric light bulb? A. Thomas Edison
How do I prevent the question "Who invented the electric light bulb? A. Thomas Edison" from being repeated and the answer being "A. Thomas Edison" only?
How do I prevent the question "Who invented the electric light bulb? A. Thomas Edison" from being repeated and the answer being "A. Thomas Edison" only?
How do I prevent the question from being repeated?
echo=True Is there an instance of echo=True in airllm
Echo the prompt back in the output
Fetching 20 files: 100% 20/20 [00:00<00:00, 1258.81it/s] found_layers:{'transformer.wte.': True, 'transformer.h.0.': True, 'transformer.h.1.': True, 'transformer.h.2.': True, 'transformer.h.3.': True, 'transformer.h.4.': True, 'transformer.h.5.': True, 'transformer.h.6.': True, 'transformer.h.7.': True, 'transformer.h.8.': True, 'transformer.h.9.': True, 'transformer.h.10.': True, 'transformer.h.11.': True, 'transformer.h.12.': True, 'transformer.h.13.': True, 'transformer.h.14.': True, 'transformer.h.15.': True, 'transformer.h.16.': True, 'transformer.h.17.': True, 'transformer.h.18.': True, 'transformer.h.19.': True, 'transformer.h.20.': True, 'transformer.h.21.': True, 'transformer.h.22.': True, 'transformer.h.23.': True, 'transformer.h.24.': True, 'transformer.h.25.': True, 'transformer.h.26.': True, 'transformer.h.27.': True, 'transformer.h.28.': True, 'transformer.h.29.': True, 'transformer.h.30.': True, 'transformer.h.31.': True, 'transformer.ln_f.': True, 'lm_head.': True} saved layers already found in /root/.cache/huggingface/hub/models--Qwen--Qwen-7B/snapshots/ef3c5c9c57b252f3149c1408daf4d649ec8b6c85/splitted_model.4bit WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference... WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention either BetterTransformer or attn_implementation='sdpa' is available, creating model directly not support prefetching for compression for now. loading with no prepetching mode. WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference... WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention either BetterTransformer or attn_implementation='sdpa' is available, creating model directly running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.85it/s] WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference... WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention either BetterTransformer or attn_implementation='sdpa' is available, creating model directly running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.87it/s] WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference... WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention either BetterTransformer or attn_implementation='sdpa' is available, creating model directly running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.88it/s] WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference... WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention either BetterTransformer or attn_implementation='sdpa' is available, creating model directly running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.87it/s] WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference... WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention either BetterTransformer or attn_implementation='sdpa' is available, creating model directly running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.72it/s]What is the capital of United States? The answer is:
What is the capital of the United States? The answer is: How do I prevent this and make the answer direct without repeating the question?
What is the capital of the United States? The answer is:
How do I prevent this and make the answer direct without repeating the question?
Can those in charge of airllm modify the part of the answer to prevent the question from appearing in the answer and prevent the answer from appearing in the answer and not counting it in the number of tokens to be shown?
Without repeating the question in the answer
from airllm import AutoModel MAX_LENGTH = 128
model = AutoModel.from_pretrained("Qwen/Qwen-7B", compression='4bit') input_text = ['Who invented the electric lamp?'] input_tokens = model.tokenizer(input_text, return_tensors="pt", return_attention_mask=False, truncation=True, max_length=MAX_LENGTH) generation_output = model.generate( input_tokens['input_ids'].cuda(), max_new_tokens=5, use_cache=True, return_dict_in_generate=True)
response = model.tokenizer.decode(generation_output.sequences[0]) cleaned_response = response.replace(input_text[0], "").strip() # Remove the question print(cleaned_response)
either BetterTransformer or attn_implementation='sdpa' is available, creating model directly running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.80it/s]A. Edison
!pip install -U airllm
!pip install -U bitsandbytes
!pip install git+https://github.com/huggingface/transformers.git
!pip install git+https://github.com/huggingface/accelerate.git
!pip install tiktoken
!pip install transformers_stream_generator
from airllm import AutoModel
MAX_LENGTH = 128
could use hugging face model repo id:
model = AutoModel.from_pretrained("Qwen/Qwen-7B", profiling_mode=True)
from airllm import AutoModel
MAX_LENGTH = 128
could use hugging face model repo id:
model = AutoModel.from_pretrained("Qwen/Qwen-7B", profiling_mode=True)
model = AutoModel.from_pretrained("Qwen/Qwen-7B", compression='4bit', delete_original=True # specify '8bit' for 8-bit block-wise quantization )
or use model's local path...
model = AirLLMLlama2("/home/ubuntu/.cache/huggingface/hub/models--garage-bAInd--Platypus2-70B-instruct/snapshots/b585e74bcaae02e52665d9ac6d23f4d0dbc81a0f")
input_text = [
'What is the capital of China?',
input_tokens = model.tokenizer(input_text, return_tensors="pt", return_attention_mask=False, truncation=True, max_length=MAX_LENGTH,
padding=True
generation_output = model.generate( input_tokens['input_ids'].cuda(), max_new_tokens=12, use_cache=True, return_dict_in_generate=True)
model.tokenizer.decode(generation_output.sequences[0])
Fetching 20 files: 100% 20/20 [00:00<00:00, 1217.93it/s] found_layers:{'transformer.wte.': True, 'transformer.h.0.': True, 'transformer.h.1.': True, 'transformer.h.2.': True, 'transformer.h.3.': True, 'transformer.h.4.': True, 'transformer.h.5.': True, 'transformer.h.6.': True, 'transformer.h.7.': True, 'transformer.h.8.': True, 'transformer.h.9.': True, 'transformer.h.10.': True, 'transformer.h.11.': True, 'transformer.h.12.': True, 'transformer.h.13.': True, 'transformer.h.14.': True, 'transformer.h.15.': True, 'transformer.h.16.': True, 'transformer.h.17.': True, 'transformer.h.18.': True, 'transformer.h.19.': True, 'transformer.h.20.': True, 'transformer.h.21.': True, 'transformer.h.22.': True, 'transformer.h.23.': True, 'transformer.h.24.': True, 'transformer.h.25.': True, 'transformer.h.26.': True, 'transformer.h.27.': True, 'transformer.h.28.': True, 'transformer.h.29.': True, 'transformer.h.30.': True, 'transformer.h.31.': True, 'transformer.ln_f.': True, 'lm_head.': True} saved layers already found in /root/.cache/huggingface/hub/models--Qwen--Qwen-7B/snapshots/ef3c5c9c57b252f3149c1408daf4d649ec8b6c85/splitted_model.4bit WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference... WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention either BetterTransformer or attn_implementation='sdpa' is available, creating model directly not support prefetching for compression for now. loading with no prepetching mode. WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference... WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention either BetterTransformer or attn_implementation='sdpa' is available, creating model directly running layers(cuda:0): 100%|██████████| 35/35 [00:13<00:00, 2.68it/s] WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference... WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention either BetterTransformer or attn_implementation='sdpa' is available, creating model directly running layers(cuda:0): 100%|██████████| 35/35 [00:13<00:00, 2.69it/s] WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference... WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention either BetterTransformer or attn_implementation='sdpa' is available, creating model directly running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.83it/s] WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference... WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention either BetterTransformer or attn_implementation='sdpa' is available, creating model directly running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.80it/s] WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference... WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention either BetterTransformer or attn_implementation='sdpa' is available, creating model directly running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.86it/s] WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference... WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention either BetterTransformer or attn_implementation='sdpa' is available, creating model directly running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.84it/s] WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference... WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention either BetterTransformer or attn_implementation='sdpa' is available, creating model directly running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.83it/s] WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference... WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention either BetterTransformer or attn_implementation='sdpa' is available, creating model directly running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.84it/s] WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference... WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention either BetterTransformer or attn_implementation='sdpa' is available, creating model directly running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.82it/s] WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference... WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention either BetterTransformer or attn_implementation='sdpa' is available, creating model directly running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.84it/s] WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference... WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention either BetterTransformer or attn_implementation='sdpa' is available, creating model directly running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.87it/s] WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference... WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention either BetterTransformer or attn_implementation='sdpa' is available, creating model directly running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.86it/s] Who is Napoleon Bonaparte؟" The answer is:\nA:\n\nNapoleon Bon