Loading directly 4bit quantized model

ByungKwanLee commented 7 months ago

System Info

I saved 4bit quantized model

Then, how to load 4bit quantized model directly with 'from_pretrained' ??

It is normal to save Large Models with float16 or float32 or bfloat16.

But in my case, I saved 4bit directly and want to load 4bit quantized model.

Who can help?

No response

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

save 4 bit quantized model with save_pretrained()
load from_pretrained() with 4bit quantization config

Expected behavior

Wrong Uint8 Value Loaded

suparious commented 7 months ago

There are many different formats for quantizing models, so stating that you are trying 4bit is not helpful here without the type of quantization being defined.

Based oh what format the model is quantized in, you will need to use that format's library instead of transformers directly.

for example, if you use AWQ, then you would only use transformers for the tokenizer, not the model, like this:

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer, TextStreamer

model_path = "solidrust/Flora-7B-DPO-AWQ"
system_message = "You are Flora, a helpful AI assistant."

# Load model
model = AutoAWQForCausalLM.from_quantized(model_path,
                                          fuse_layers=True)
tokenizer = AutoTokenizer.from_pretrained(model_path,
                                          trust_remote_code=True)
streamer = TextStreamer(tokenizer,
                        skip_prompt=True,
                        skip_special_tokens=True)

# Convert prompt to tokens
prompt_template = """\
<|im_start|>system
{system_message}<|im_end|>
<|im_start|>user
{prompt}<|im_end|>
<|im_start|>assistant"""

prompt = "You're standing on the surface of the Earth. "\
        "You walk one mile south, one mile west and one mile north. "\
        "You end up exactly where you started. Where are you?"

tokens = tokenizer(prompt_template.format(system_message=system_message,prompt=prompt),
                  return_tensors='pt').input_ids.cuda()

# Generate output
generation_output = model.generate(tokens,
                                  streamer=streamer,
                                  max_new_tokens=512)

suparious commented 7 months ago

GGUF, EXL2 (GPTQ), and HQQ are other quant formats, and you can find many examples for them on hf.co

ByungKwanLee commented 7 months ago

Oh I quatized the model with just bitsandbytes 4bit and saved the model by using save pretrained function

Is it compatible with autoawq??

amyeroberts commented 7 months ago

@ByungKwanLee So that we can help you, could you please provide:

The running environment: run transformers-cli env in the terminal and copy-paste the output
A minimal code snippet which can reproduce the error

cc @younesbelkada

younesbelkada commented 7 months ago

Thanks! This is a duplicate of https://github.com/TimDettmers/bitsandbytes/issues/1123 - let me close that issue and we can continue the discussion here as it's transformers related. @ByungKwanLee could you elaborate more on the issue ? I second what @amyeroberts and @suparious said

github-actions[bot] commented 6 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / transformers