!pip install -U airllm

!pip install -U bitsandbytes

!pip install git+https://github.com/huggingface/transformers.git

!pip install git+https://github.com/huggingface/accelerate.git

!pip install tiktoken

!pip install transformers_stream_generator

from airllm import AutoModel

MAX_LENGTH = 128

could use hugging face model repo id:

model = AutoModel.from_pretrained("Qwen/Qwen-7B", profiling_mode=True)

from airllm import AutoModel

MAX_LENGTH = 128

could use hugging face model repo id:

model = AutoModel.from_pretrained("Qwen/Qwen-7B", profiling_mode=True)

model = AutoModel.from_pretrained("Qwen/Qwen-7B", compression='4bit', delete_original=True # specify '8bit' for 8-bit block-wise quantization )

or use model's local path...

model = AirLLMLlama2("/home/ubuntu/.cache/huggingface/hub/models--garage-bAInd--Platypus2-70B-instruct/snapshots/b585e74bcaae02e52665d9ac6d23f4d0dbc81a0f")

input_text = [

'What is the capital of China?',

    'Who is Napoleon Bonaparte؟',
]

input_tokens = model.tokenizer(input_text, return_tensors="pt", return_attention_mask=False, truncation=True, max_length=MAX_LENGTH,

padding=True

generation_output = model.generate( input_tokens['input_ids'].cuda(), max_new_tokens=12, use_cache=True, return_dict_in_generate=True)

model.tokenizer.decode(generation_output.sequences[0])

Fetching 20 files: 100% 20/20 [00:00<00:00, 1217.93it/s] found_layers:{'transformer.wte.': True, 'transformer.h.0.': True, 'transformer.h.1.': True, 'transformer.h.2.': True, 'transformer.h.3.': True, 'transformer.h.4.': True, 'transformer.h.5.': True, 'transformer.h.6.': True, 'transformer.h.7.': True, 'transformer.h.8.': True, 'transformer.h.9.': True, 'transformer.h.10.': True, 'transformer.h.11.': True, 'transformer.h.12.': True, 'transformer.h.13.': True, 'transformer.h.14.': True, 'transformer.h.15.': True, 'transformer.h.16.': True, 'transformer.h.17.': True, 'transformer.h.18.': True, 'transformer.h.19.': True, 'transformer.h.20.': True, 'transformer.h.21.': True, 'transformer.h.22.': True, 'transformer.h.23.': True, 'transformer.h.24.': True, 'transformer.h.25.': True, 'transformer.h.26.': True, 'transformer.h.27.': True, 'transformer.h.28.': True, 'transformer.h.29.': True, 'transformer.h.30.': True, 'transformer.h.31.': True, 'transformer.ln_f.': True, 'lm_head.': True} saved layers already found in /root/.cache/huggingface/hub/models--Qwen--Qwen-7B/snapshots/ef3c5c9c57b252f3149c1408daf4d649ec8b6c85/splitted_model.4bit WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference... WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention either BetterTransformer or attn_implementation='sdpa' is available, creating model directly not support prefetching for compression for now. loading with no prepetching mode. WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference... WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention either BetterTransformer or attn_implementation='sdpa' is available, creating model directly running layers(cuda:0): 100%|██████████| 35/35 [00:13<00:00, 2.68it/s] WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference... WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention either BetterTransformer or attn_implementation='sdpa' is available, creating model directly running layers(cuda:0): 100%|██████████| 35/35 [00:13<00:00, 2.69it/s] WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference... WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention either BetterTransformer or attn_implementation='sdpa' is available, creating model directly running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.83it/s] WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference... WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention either BetterTransformer or attn_implementation='sdpa' is available, creating model directly running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.80it/s] WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference... WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention either BetterTransformer or attn_implementation='sdpa' is available, creating model directly running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.86it/s] WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference... WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention either BetterTransformer or attn_implementation='sdpa' is available, creating model directly running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.84it/s] WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference... WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention either BetterTransformer or attn_implementation='sdpa' is available, creating model directly running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.83it/s] WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference... WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention either BetterTransformer or attn_implementation='sdpa' is available, creating model directly running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.84it/s] WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference... WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention either BetterTransformer or attn_implementation='sdpa' is available, creating model directly running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.82it/s] WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference... WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention either BetterTransformer or attn_implementation='sdpa' is available, creating model directly running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.84it/s] WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference... WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention either BetterTransformer or attn_implementation='sdpa' is available, creating model directly running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.87it/s] WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference... WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention either BetterTransformer or attn_implementation='sdpa' is available, creating model directly running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.86it/s] Who is Napoleon Bonaparte؟" The answer is:\nA:\n\nNapoleon Bon

max_new_tokens=xxxxxxxxxxxxxxxxxxxxxx,

The higher the number, the more load on the graphics unit and the longer the time.

generation_output = model.generate( input_tokens['input_ids'].cuda(), max_new_tokens=12, use_cache=True, return_dict_in_generate=True)

model.tokenizer.decode(generation_output.sequences[0])

from airllm import AutoModel

MAX_LENGTH = 128

could use hugging face model repo id:

model = AutoModel.from_pretrained("Qwen/Qwen-7B", profiling_mode=True)

from airllm import AutoModel

MAX_LENGTH = 128

could use hugging face model repo id:

model = AutoModel.from_pretrained("Qwen/Qwen-7B", profiling_mode=True)

model = AutoModel.from_pretrained("Qwen/Qwen-7B", compression='4bit',
delete_original=True # specify '8bit' for 8-bit block-wise quantization )

or use model's local path...

model = AirLLMLlama2("/home/ubuntu/.cache/huggingface/hub/models--garage-bAInd--Platypus2-70B-instruct/snapshots/b585e74bcaae02e52665d9ac6d23f4d0dbc81a0f")

input_text = [

'What is the capital of China?',

    'Who invented the electric light bulb?',
]

input_tokens = model.tokenizer(input_text, return_tensors="pt", return_attention_mask=False, truncation=True, max_length=MAX_LENGTH,

padding=True

generation_output = model.generate( input_tokens['input_ids'].cuda(), max_new_tokens=5, use_cache=True, no_repeat_ngram_size=3, # يمنع تكرار ثلاث كلمات متتابعة repetition_penalty=1.2, # عقوبة للتكرار لتجنب تكرار الكلمات return_dict_in_generate=True)

model.tokenizer.decode(generation_output.sequences[0])

bitsandbytes installed cache_utils installed Fetching 20 files: 100% 20/20 [00:00<00:00, 1025.63it/s] found_layers:{'transformer.wte.': True, 'transformer.h.0.': True, 'transformer.h.1.': True, 'transformer.h.2.': True, 'transformer.h.3.': True, 'transformer.h.4.': True, 'transformer.h.5.': True, 'transformer.h.6.': True, 'transformer.h.7.': True, 'transformer.h.8.': True, 'transformer.h.9.': True, 'transformer.h.10.': True, 'transformer.h.11.': True, 'transformer.h.12.': True, 'transformer.h.13.': True, 'transformer.h.14.': True, 'transformer.h.15.': True, 'transformer.h.16.': True, 'transformer.h.17.': True, 'transformer.h.18.': True, 'transformer.h.19.': True, 'transformer.h.20.': True, 'transformer.h.21.': True, 'transformer.h.22.': True, 'transformer.h.23.': True, 'transformer.h.24.': True, 'transformer.h.25.': True, 'transformer.h.26.': True, 'transformer.h.27.': True, 'transformer.h.28.': True, 'transformer.h.29.': True, 'transformer.h.30.': True, 'transformer.h.31.': True, 'transformer.ln_f.': True, 'lm_head.': True} saved layers already found in /root/.cache/huggingface/hub/models--Qwen--Qwen-7B/snapshots/ef3c5c9c57b252f3149c1408daf4d649ec8b6c85/splitted_model.4bit WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference... WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention either BetterTransformer or attn_implementation='sdpa' is available, creating model directly not support prefetching for compression for now. loading with no prepetching mode. WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference... WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention either BetterTransformer or attn_implementation='sdpa' is available, creating model directly running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.70it/s] WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference... WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention either BetterTransformer or attn_implementation='sdpa' is available, creating model directly running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.78it/s] WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference... WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention either BetterTransformer or attn_implementation='sdpa' is available, creating model directly running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.88it/s] WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference... WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention either BetterTransformer or attn_implementation='sdpa' is available, creating model directly running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.77it/s] WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference... WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention either BetterTransformer or attn_implementation='sdpa' is available, creating model directly running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.87it/s] Who invented the electric light bulb? A. Thomas Edison

How do I prevent the question "Who invented the electric light bulb? A. Thomas Edison" from being repeated and the answer being "A. Thomas Edison" only?

How do I prevent the question from being repeated?

echo=True Is there an instance of echo=True in airllm

Echo the prompt back in the output

Fetching 20 files: 100% 20/20 [00:00<00:00, 1258.81it/s] found_layers:{'transformer.wte.': True, 'transformer.h.0.': True, 'transformer.h.1.': True, 'transformer.h.2.': True, 'transformer.h.3.': True, 'transformer.h.4.': True, 'transformer.h.5.': True, 'transformer.h.6.': True, 'transformer.h.7.': True, 'transformer.h.8.': True, 'transformer.h.9.': True, 'transformer.h.10.': True, 'transformer.h.11.': True, 'transformer.h.12.': True, 'transformer.h.13.': True, 'transformer.h.14.': True, 'transformer.h.15.': True, 'transformer.h.16.': True, 'transformer.h.17.': True, 'transformer.h.18.': True, 'transformer.h.19.': True, 'transformer.h.20.': True, 'transformer.h.21.': True, 'transformer.h.22.': True, 'transformer.h.23.': True, 'transformer.h.24.': True, 'transformer.h.25.': True, 'transformer.h.26.': True, 'transformer.h.27.': True, 'transformer.h.28.': True, 'transformer.h.29.': True, 'transformer.h.30.': True, 'transformer.h.31.': True, 'transformer.ln_f.': True, 'lm_head.': True} saved layers already found in /root/.cache/huggingface/hub/models--Qwen--Qwen-7B/snapshots/ef3c5c9c57b252f3149c1408daf4d649ec8b6c85/splitted_model.4bit WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference... WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention either BetterTransformer or attn_implementation='sdpa' is available, creating model directly not support prefetching for compression for now. loading with no prepetching mode. WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference... WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention either BetterTransformer or attn_implementation='sdpa' is available, creating model directly running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.85it/s] WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference... WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention either BetterTransformer or attn_implementation='sdpa' is available, creating model directly running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.87it/s] WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference... WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention either BetterTransformer or attn_implementation='sdpa' is available, creating model directly running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.88it/s] WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference... WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention either BetterTransformer or attn_implementation='sdpa' is available, creating model directly running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.87it/s] WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference... WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention either BetterTransformer or attn_implementation='sdpa' is available, creating model directly running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.72it/s]What is the capital of United States? The answer is:

What is the capital of the United States? The answer is: How do I prevent this and make the answer direct without repeating the question?

What is the capital of the United States? The answer is:

How do I prevent this and make the answer direct without repeating the question?

Can those in charge of airllm modify the part of the answer to prevent the question from appearing in the answer and prevent the answer from appearing in the answer and not counting it in the number of tokens to be shown?

Without repeating the question in the answer

from airllm import AutoModel MAX_LENGTH = 128

model = AutoModel.from_pretrained("Qwen/Qwen-7B", compression='4bit') input_text = ['Who invented the electric lamp?'] input_tokens = model.tokenizer(input_text, return_tensors="pt", return_attention_mask=False, truncation=True, max_length=MAX_LENGTH) generation_output = model.generate( input_tokens['input_ids'].cuda(), max_new_tokens=5, use_cache=True, return_dict_in_generate=True)

response = model.tokenizer.decode(generation_output.sequences[0]) cleaned_response = response.replace(input_text[0], "").strip() # Remove the question print(cleaned_response)

either BetterTransformer or attn_implementation='sdpa' is available, creating model directly running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.80it/s]A. Edison

lyogavin / airllm

it is run #192

could use hugging face model repo id:

model = AutoModel.from_pretrained("Qwen/Qwen-7B", profiling_mode=True)

could use hugging face model repo id:

model = AutoModel.from_pretrained("Qwen/Qwen-7B", profiling_mode=True)

or use model's local path...

model = AirLLMLlama2("/home/ubuntu/.cache/huggingface/hub/models--garage-bAInd--Platypus2-70B-instruct/snapshots/b585e74bcaae02e52665d9ac6d23f4d0dbc81a0f")

'What is the capital of China?',

padding=True

could use hugging face model repo id:

model = AutoModel.from_pretrained("Qwen/Qwen-7B", profiling_mode=True)

could use hugging face model repo id:

model = AutoModel.from_pretrained("Qwen/Qwen-7B", profiling_mode=True)

or use model's local path...

model = AirLLMLlama2("/home/ubuntu/.cache/huggingface/hub/models--garage-bAInd--Platypus2-70B-instruct/snapshots/b585e74bcaae02e52665d9ac6d23f4d0dbc81a0f")

'What is the capital of China?',

padding=True