Closed ByungKwanLee closed 6 months ago
There are many different formats for quantizing models, so stating that you are trying 4bit is not helpful here without the type of quantization being defined.
Based oh what format the model is quantized in, you will need to use that format's library instead of transformers directly.
for example, if you use AWQ, then you would only use transformers for the tokenizer, not the model, like this:
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer, TextStreamer
model_path = "solidrust/Flora-7B-DPO-AWQ"
system_message = "You are Flora, a helpful AI assistant."
# Load model
model = AutoAWQForCausalLM.from_quantized(model_path,
fuse_layers=True)
tokenizer = AutoTokenizer.from_pretrained(model_path,
trust_remote_code=True)
streamer = TextStreamer(tokenizer,
skip_prompt=True,
skip_special_tokens=True)
# Convert prompt to tokens
prompt_template = """\
<|im_start|>system
{system_message}<|im_end|>
<|im_start|>user
{prompt}<|im_end|>
<|im_start|>assistant"""
prompt = "You're standing on the surface of the Earth. "\
"You walk one mile south, one mile west and one mile north. "\
"You end up exactly where you started. Where are you?"
tokens = tokenizer(prompt_template.format(system_message=system_message,prompt=prompt),
return_tensors='pt').input_ids.cuda()
# Generate output
generation_output = model.generate(tokens,
streamer=streamer,
max_new_tokens=512)
GGUF, EXL2 (GPTQ), and HQQ are other quant formats, and you can find many examples for them on hf.co
Oh I quatized the model with just bitsandbytes 4bit and saved the model by using save pretrained function
Is it compatible with autoawq??
@ByungKwanLee So that we can help you, could you please provide:
transformers-cli env
in the terminal and copy-paste the outputcc @younesbelkada
Thanks! This is a duplicate of https://github.com/TimDettmers/bitsandbytes/issues/1123 - let me close that issue and we can continue the discussion here as it's transformers related. @ByungKwanLee could you elaborate more on the issue ? I second what @amyeroberts and @suparious said
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
System Info
I saved 4bit quantized model
Then, how to load 4bit quantized model directly with 'from_pretrained' ??
It is normal to save Large Models with float16 or float32 or bfloat16.
But in my case, I saved 4bit directly and want to load 4bit quantized model.
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
Wrong Uint8 Value Loaded