lyogavin / airllm

AirLLM 70B inference with single 4GB GPU
Apache License 2.0
4.71k stars 378 forks source link

it is run #192

Open werruww opened 8 hours ago

werruww commented 8 hours ago

!pip install -U airllm

!pip install -U bitsandbytes

!pip install git+https://github.com/huggingface/transformers.git

!pip install git+https://github.com/huggingface/accelerate.git

!pip install tiktoken

!pip install transformers_stream_generator

from airllm import AutoModel

MAX_LENGTH = 128

could use hugging face model repo id:

model = AutoModel.from_pretrained("Qwen/Qwen-7B", profiling_mode=True)

from airllm import AutoModel

MAX_LENGTH = 128

could use hugging face model repo id:

model = AutoModel.from_pretrained("Qwen/Qwen-7B", profiling_mode=True)

model = AutoModel.from_pretrained("Qwen/Qwen-7B", compression='4bit', delete_original=True # specify '8bit' for 8-bit block-wise quantization )

or use model's local path...

model = AirLLMLlama2("/home/ubuntu/.cache/huggingface/hub/models--garage-bAInd--Platypus2-70B-instruct/snapshots/b585e74bcaae02e52665d9ac6d23f4d0dbc81a0f")

input_text = [

'What is the capital of China?',

    'Who is Napoleon Bonaparte؟',
]

input_tokens = model.tokenizer(input_text, return_tensors="pt", return_attention_mask=False, truncation=True, max_length=MAX_LENGTH,

padding=True

)

generation_output = model.generate( input_tokens['input_ids'].cuda(), max_new_tokens=12, use_cache=True, return_dict_in_generate=True)

model.tokenizer.decode(generation_output.sequences[0])

Fetching 20 files: 100%  20/20 [00:00<00:00, 1217.93it/s] found_layers:{'transformer.wte.': True, 'transformer.h.0.': True, 'transformer.h.1.': True, 'transformer.h.2.': True, 'transformer.h.3.': True, 'transformer.h.4.': True, 'transformer.h.5.': True, 'transformer.h.6.': True, 'transformer.h.7.': True, 'transformer.h.8.': True, 'transformer.h.9.': True, 'transformer.h.10.': True, 'transformer.h.11.': True, 'transformer.h.12.': True, 'transformer.h.13.': True, 'transformer.h.14.': True, 'transformer.h.15.': True, 'transformer.h.16.': True, 'transformer.h.17.': True, 'transformer.h.18.': True, 'transformer.h.19.': True, 'transformer.h.20.': True, 'transformer.h.21.': True, 'transformer.h.22.': True, 'transformer.h.23.': True, 'transformer.h.24.': True, 'transformer.h.25.': True, 'transformer.h.26.': True, 'transformer.h.27.': True, 'transformer.h.28.': True, 'transformer.h.29.': True, 'transformer.h.30.': True, 'transformer.h.31.': True, 'transformer.ln_f.': True, 'lm_head.': True} saved layers already found in /root/.cache/huggingface/hub/models--Qwen--Qwen-7B/snapshots/ef3c5c9c57b252f3149c1408daf4d649ec8b6c85/splitted_model.4bit WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference... WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention either BetterTransformer or attn_implementation='sdpa' is available, creating model directly not support prefetching for compression for now. loading with no prepetching mode. WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference... WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention either BetterTransformer or attn_implementation='sdpa' is available, creating model directly running layers(cuda:0): 100%|██████████| 35/35 [00:13<00:00, 2.68it/s] WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference... WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention either BetterTransformer or attn_implementation='sdpa' is available, creating model directly running layers(cuda:0): 100%|██████████| 35/35 [00:13<00:00, 2.69it/s] WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference... WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention either BetterTransformer or attn_implementation='sdpa' is available, creating model directly running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.83it/s] WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference... WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention either BetterTransformer or attn_implementation='sdpa' is available, creating model directly running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.80it/s] WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference... WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention either BetterTransformer or attn_implementation='sdpa' is available, creating model directly running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.86it/s] WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference... WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention either BetterTransformer or attn_implementation='sdpa' is available, creating model directly running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.84it/s] WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference... WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention either BetterTransformer or attn_implementation='sdpa' is available, creating model directly running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.83it/s] WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference... WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention either BetterTransformer or attn_implementation='sdpa' is available, creating model directly running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.84it/s] WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference... WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention either BetterTransformer or attn_implementation='sdpa' is available, creating model directly running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.82it/s] WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference... WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention either BetterTransformer or attn_implementation='sdpa' is available, creating model directly running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.84it/s] WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference... WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention either BetterTransformer or attn_implementation='sdpa' is available, creating model directly running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.87it/s] WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference... WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention either BetterTransformer or attn_implementation='sdpa' is available, creating model directly running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.86it/s] Who is Napoleon Bonaparte؟" The answer is:\nA:\n\nNapoleon Bon

werruww commented 8 hours ago

max_new_tokens=xxxxxxxxxxxxxxxxxxxxxx,

The higher the number, the more load on the graphics unit and the longer the time.

generation_output = model.generate( input_tokens['input_ids'].cuda(), max_new_tokens=12, use_cache=True, return_dict_in_generate=True)

model.tokenizer.decode(generation_output.sequences[0])

werruww commented 8 hours ago

it run on colab t4

werruww commented 8 hours ago

from airllm import AutoModel

MAX_LENGTH = 128

could use hugging face model repo id:

model = AutoModel.from_pretrained("Qwen/Qwen-7B", profiling_mode=True)

from airllm import AutoModel

MAX_LENGTH = 128

could use hugging face model repo id:

model = AutoModel.from_pretrained("Qwen/Qwen-7B", profiling_mode=True)

model = AutoModel.from_pretrained("Qwen/Qwen-7B", compression='4bit',
delete_original=True # specify '8bit' for 8-bit block-wise quantization )

or use model's local path...

model = AirLLMLlama2("/home/ubuntu/.cache/huggingface/hub/models--garage-bAInd--Platypus2-70B-instruct/snapshots/b585e74bcaae02e52665d9ac6d23f4d0dbc81a0f")

input_text = [

'What is the capital of China?',

    'Who invented the electric light bulb?',
]

input_tokens = model.tokenizer(input_text, return_tensors="pt", return_attention_mask=False, truncation=True, max_length=MAX_LENGTH,

padding=True

)

generation_output = model.generate( input_tokens['input_ids'].cuda(), max_new_tokens=5, use_cache=True, no_repeat_ngram_size=3, # يمنع تكرار ثلاث كلمات متتابعة repetition_penalty=1.2, # عقوبة للتكرار لتجنب تكرار الكلمات return_dict_in_generate=True)

model.tokenizer.decode(generation_output.sequences[0])

bitsandbytes installed cache_utils installed Fetching 20 files: 100%  20/20 [00:00<00:00, 1025.63it/s] found_layers:{'transformer.wte.': True, 'transformer.h.0.': True, 'transformer.h.1.': True, 'transformer.h.2.': True, 'transformer.h.3.': True, 'transformer.h.4.': True, 'transformer.h.5.': True, 'transformer.h.6.': True, 'transformer.h.7.': True, 'transformer.h.8.': True, 'transformer.h.9.': True, 'transformer.h.10.': True, 'transformer.h.11.': True, 'transformer.h.12.': True, 'transformer.h.13.': True, 'transformer.h.14.': True, 'transformer.h.15.': True, 'transformer.h.16.': True, 'transformer.h.17.': True, 'transformer.h.18.': True, 'transformer.h.19.': True, 'transformer.h.20.': True, 'transformer.h.21.': True, 'transformer.h.22.': True, 'transformer.h.23.': True, 'transformer.h.24.': True, 'transformer.h.25.': True, 'transformer.h.26.': True, 'transformer.h.27.': True, 'transformer.h.28.': True, 'transformer.h.29.': True, 'transformer.h.30.': True, 'transformer.h.31.': True, 'transformer.ln_f.': True, 'lm_head.': True} saved layers already found in /root/.cache/huggingface/hub/models--Qwen--Qwen-7B/snapshots/ef3c5c9c57b252f3149c1408daf4d649ec8b6c85/splitted_model.4bit WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference... WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention either BetterTransformer or attn_implementation='sdpa' is available, creating model directly not support prefetching for compression for now. loading with no prepetching mode. WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference... WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention either BetterTransformer or attn_implementation='sdpa' is available, creating model directly running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.70it/s] WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference... WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention either BetterTransformer or attn_implementation='sdpa' is available, creating model directly running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.78it/s] WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference... WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention either BetterTransformer or attn_implementation='sdpa' is available, creating model directly running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.88it/s] WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference... WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention either BetterTransformer or attn_implementation='sdpa' is available, creating model directly running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.77it/s] WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference... WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention either BetterTransformer or attn_implementation='sdpa' is available, creating model directly running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.87it/s] Who invented the electric light bulb? A. Thomas Edison

How do I prevent the question "Who invented the electric light bulb? A. Thomas Edison" from being repeated and the answer being "A. Thomas Edison" only?

werruww commented 8 hours ago

How do I prevent the question "Who invented the electric light bulb? A. Thomas Edison" from being repeated and the answer being "A. Thomas Edison" only?

werruww commented 8 hours ago

How do I prevent the question from being repeated?

werruww commented 7 hours ago

echo=True Is there an instance of echo=True in airllm

werruww commented 7 hours ago

Echo the prompt back in the output

werruww commented 7 hours ago

Fetching 20 files: 100%  20/20 [00:00<00:00, 1258.81it/s] found_layers:{'transformer.wte.': True, 'transformer.h.0.': True, 'transformer.h.1.': True, 'transformer.h.2.': True, 'transformer.h.3.': True, 'transformer.h.4.': True, 'transformer.h.5.': True, 'transformer.h.6.': True, 'transformer.h.7.': True, 'transformer.h.8.': True, 'transformer.h.9.': True, 'transformer.h.10.': True, 'transformer.h.11.': True, 'transformer.h.12.': True, 'transformer.h.13.': True, 'transformer.h.14.': True, 'transformer.h.15.': True, 'transformer.h.16.': True, 'transformer.h.17.': True, 'transformer.h.18.': True, 'transformer.h.19.': True, 'transformer.h.20.': True, 'transformer.h.21.': True, 'transformer.h.22.': True, 'transformer.h.23.': True, 'transformer.h.24.': True, 'transformer.h.25.': True, 'transformer.h.26.': True, 'transformer.h.27.': True, 'transformer.h.28.': True, 'transformer.h.29.': True, 'transformer.h.30.': True, 'transformer.h.31.': True, 'transformer.ln_f.': True, 'lm_head.': True} saved layers already found in /root/.cache/huggingface/hub/models--Qwen--Qwen-7B/snapshots/ef3c5c9c57b252f3149c1408daf4d649ec8b6c85/splitted_model.4bit WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference... WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention either BetterTransformer or attn_implementation='sdpa' is available, creating model directly not support prefetching for compression for now. loading with no prepetching mode. WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference... WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention either BetterTransformer or attn_implementation='sdpa' is available, creating model directly running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.85it/s] WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference... WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention either BetterTransformer or attn_implementation='sdpa' is available, creating model directly running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.87it/s] WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference... WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention either BetterTransformer or attn_implementation='sdpa' is available, creating model directly running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.88it/s] WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference... WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention either BetterTransformer or attn_implementation='sdpa' is available, creating model directly running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.87it/s] WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Try importing flash-attention for faster inference... WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm WARNING:transformers_modules.ef3c5c9c57b252f3149c1408daf4d649ec8b6c85.modeling_qwen:Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention either BetterTransformer or attn_implementation='sdpa' is available, creating model directly running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.72it/s]What is the capital of United States? The answer is:

What is the capital of the United States? The answer is: How do I prevent this and make the answer direct without repeating the question?

werruww commented 7 hours ago

What is the capital of the United States? The answer is:

How do I prevent this and make the answer direct without repeating the question?

werruww commented 7 hours ago

Can those in charge of airllm modify the part of the answer to prevent the question from appearing in the answer and prevent the answer from appearing in the answer and not counting it in the number of tokens to be shown?

werruww commented 6 hours ago

Without repeating the question in the answer

from airllm import AutoModel MAX_LENGTH = 128

model = AutoModel.from_pretrained("Qwen/Qwen-7B", compression='4bit') input_text = ['Who invented the electric lamp?'] input_tokens = model.tokenizer(input_text, return_tensors="pt", return_attention_mask=False, truncation=True, max_length=MAX_LENGTH) generation_output = model.generate( input_tokens['input_ids'].cuda(), max_new_tokens=5, use_cache=True, return_dict_in_generate=True)

response = model.tokenizer.decode(generation_output.sequences[0]) cleaned_response = response.replace(input_text[0], "").strip() # Remove the question print(cleaned_response)

either BetterTransformer or attn_implementation='sdpa' is available, creating model directly running layers(cuda:0): 100%|██████████| 35/35 [00:12<00:00, 2.80it/s]A. Edison