unslothai / unsloth

Finetune Llama 3.1, Mistral, Phi & Gemma LLMs 2-5x faster with 80% less memory
https://unsloth.ai
Apache License 2.0
15.41k stars 1.04k forks source link

Converting unsloth finetuned model to AWQ using autoawq package. #913

Open fusesid opened 4 weeks ago

fusesid commented 4 weeks ago

Firstly, I saved finetuned LORA model as merged_16bit on my huggingface repo. And i have adapter_config.json and adapter_model.safetensor inside my repo. Now when trying to load with AutoAWQForCausalLM, it searches config.json and throws an issue.

So, what i did is: load the model using FastLanguageModel and applied merge_and_unload method as below:

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = my-model
    max_seq_length = 1024,
    dtype = None,
    load_in_4bit = True,
    resize_model_vocab= 128257,
    token= my-token
)

model = model.merge_and_unload()

tokenizer.push_to_hub(repo)
model.push_to_hub(repo)

Then, i tried to quantize it:

from awq import AutoAWQForCausalLM

quant_config = {
  "zero_point": True,
  "q_group_size": 128,
  "w_bit": 4,
  "version": "GEMM",
}

# Load model
model = AutoAWQForCausalLM.from_pretrained(
  model_path, low_cpu_mem_usage=True, use_cache=False,token=access_token
)
tokenizer = AutoTokenizer.from_pretrained(model_path, token=access_token)

# Quantize
model.quantize(tokenizer, quant_config=quant_config)

But this is showing : RuntimeError: output with shape [8388608, 1] doesn't match the broadcast shape [8388608, 4096]

Am i doing the quantization correctly? Or what is causing this issue? Please help

danielhanchen commented 4 weeks ago

Oh you need to use model.save_pretrained_merged for AWQ conversion

sids07 commented 4 weeks ago

@danielhanchen , yeah i have tried that as well. But the issue is autoawq searches for config.json but we only have adapter_config.json as i have finetuned using LORA peft

mahiatlinux commented 4 weeks ago

@danielhanchen , yeah i have tried that as well. But the issue is autoawq searches for config.json but we only have adapter_config.json as i have finetuned using LORA peft

You are currently only saving the LORA adapters. I think you need to do this to save the full model (with all the needed files like config.json, etc) for conversion to AWQ.

model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)

Right @danielhanchen?

fusesid commented 4 weeks ago

@mahiatlinux, Yeah i have used same command to save my model but it didn't have all the files. Using same command i.e.

model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)

My huggingface repo looks like: Screenshot from 2024-08-14 14-16-07

I guess it just saved model with 16_bit weight which is why file size of safetensors are large.

This was the reason which i shifted to :

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = my-model
    max_seq_length = 1024,
    dtype = None,
    load_in_4bit = True,
    resize_model_vocab= 128257,
    token= my-token
)

model = model.merge_and_unload()

tokenizer.push_to_hub(repo)
model.push_to_hub(repo)

After this my repo looked like: Screenshot from 2024-08-14 14-19-09

But using autoawq for quantization on this model gives issue : RuntimeError: output with shape [8388608, 1] doesn't match the broadcast shape [8388608, 4096]

mahiatlinux commented 4 weeks ago

You have to push it using this to get all the files needed:

model.push_to_hub_merged("model", tokenizer, save_method = "merged_16bit", token = "token")

Try it and let me know @sids07.

mahiatlinux commented 4 weeks ago

@mahiatlinux, Yeah i have used same command to save my model but it didn't have all the files. Using same command i.e.

model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)

My huggingface repo looks like: Screenshot from 2024-08-14 14-16-07

I guess it just saved model with 16_bit weight which is why file size of safetensors are large.

This was the reason which i shifted to :

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = my-model
    max_seq_length = 1024,
    dtype = None,
    load_in_4bit = True,
    resize_model_vocab= 128257,
    token= my-token
)

model = model.merge_and_unload()

tokenizer.push_to_hub(repo)
model.push_to_hub(repo)

After this my repo looked like: Screenshot from 2024-08-14 14-19-09

But using autoawq for quantization on this model gives issue : RuntimeError: output with shape [8388608, 1] doesn't match the broadcast shape [8388608, 4096]

Hmmm... Sorry about that. I don't know. Maybe @danielhanchen knows?

fusesid commented 4 weeks ago

For replicating the issue, you can find unsloth save_model using merged_16bit on : https://huggingface.co/hiiamsid/llama3.1-finetuned-effie and unloaded peft model trained using unsloth on: https://huggingface.co/hiiamsid/effie-llama-merged