Correctly Uploading PEFT/QLoRA model to HuggingFace

taoofstefan commented 4 months ago

Hey,

I am fairly new to fine-tuning my own models and working with HuggingFace. Yesterday I finished fine-tuning a Llama 2 model with my custom dataset, but I couldn't figure out how to properly push it to my HuggingFace profile.

Below is part of the fine-tuning code. Let me know if you need more input.

from peft import PeftModel
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, LlamaTokenizer

# Base model configuration
base_model_id = "meta-llama/Llama-2-7b-hf"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

base_model = AutoModelForCausalLM.from_pretrained(
    base_model_id,  
    quantization_config=bnb_config,  # Same quantization config as before
    device_map="auto",
    trust_remote_code=True,
)

eval_tokenizer = AutoTokenizer.from_pretrained(base_model_id, add_bos_token=True, trust_remote_code=True)

# Load your fine-tuned model
ft_model = PeftModel.from_pretrained(base_model, "llama2-7B-MT-FT-llama2-S/checkpoint-400")

# Merge adapters with the base model
merged_model = ft_model.merge_and_unload()

# Save the merged model to a directory
output_dir = "./merged_model"
merged_model.save_pretrained(output_dir)
eval_tokenizer.save_pretrained(output_dir)

from transformers import AutoModelForCausalLM, LlamaTokenizer

model = AutoModelForCausalLM.from_pretrained(output_dir)

tokenizer = LlamaTokenizer.from_pretrained(output_dir, trust_remote_code=True)

# HuggingFace repository ID
repo_id = f"taoofstefan/{run_name}"

# Push the model and tokenizer to HuggingFace Hub
model.push_to_hub(repo_id, token=True, max_shard_size="5GB", safe_serialization=True)
tokenizer.push_to_hub(repo_id, token=True)

I managed to push something (taoofstefan/llama2-7B-MT-FT-llama2-S), but it doesn't seem to be correct as I get an error when I try to convert it via the HuggingFace website to a GGUF file and doing inference with it gets me terrible results (like the base Llama 2)

It's probably super easy but I cannot seem to figure it out. How do I push the model with the merged adapter (checkpoint-400) to HuggingFace?

Thanks in advance :)

Wauplin commented 4 months ago

Hi @taoofstefan, if you are able to train and save your finetuned model locally (i.e. in a local folder on your machine), then you can upload it to the Hub using the huggingface-cli (see docs). If your question is to know how to use PEFT to finetune + use your model, then it's better to ask on https://github.com/huggingface/peft (potentially with the error stacktrace you are getting, if any). Finally, if the problem is about model results been worse than expected, I would advice to ask around in the HF Discord to find other community members that could help you.

taoofstefan commented 4 months ago

Hello @Wauplin I don't even know which files to upload in the first place. As I am absolutely not sure if my code shared above is the right way to push it to HF. Judging from the fact that I cannot convert the model to a GGUF, I assume I did it wrong.

Wauplin commented 4 months ago

@taoofstefan I'm transferring your issue to the peft repo as it is not a huggingface_hub-related question. If your script comes from the docs then it should be correct. If you want more help, you would need to provide more information (folder structure or a link to the pushed repo for instance). Also I still don't know what is the problem your are facing. If it's a GGUF conversion issue, then it is most likely not an upload issue.

taoofstefan commented 4 months ago

This is the link to the repo I managed to push stuff: repo

My issue is, that I don't even know if the code I used to push is correct. It would be great if someone who has done this before would have a look and let me know. I basically hacked it together from the documentation and tutorials I saw.

I don't know why the conversion to GGUF isn't working. When I try it, I get this error:

Error: Error converting to fp16: b'INFO:hf-to-gguf:Loading model: llama2-7B-MT-FT-llama2-S
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Set model parameters
INFO:hf-to-gguf:gguf: context length = 4096
INFO:hf-to-gguf:gguf: embedding length = 4096
INFO:hf-to-gguf:gguf: feed forward length = 11008
INFO:hf-to-gguf:gguf: head count = 32
INFO:hf-to-gguf:gguf: key-value head count = 32
INFO:hf-to-gguf:gguf: rope theta = 10000.0
INFO:hf-to-gguf:gguf: rms norm epsilon = 1e-05
INFO:hf-to-gguf:gguf: file type = 1
INFO:hf-to-gguf:Set model tokenizer
INFO:gguf.vocab:Setting special token type bos to 1
INFO:gguf.vocab:Setting special token type eos to 2
INFO:gguf.vocab:Setting add_bos_token to True
INFO:gguf.vocab:Setting add_eos_token to False
INFO:hf-to-gguf:Exporting model to 'llama2-7B-MT-FT-llama2-S.fp16.gguf'
INFO:hf-to-gguf:gguf: loading model part 'model.safetensors'
INFO:hf-to-gguf:output.weight, torch.float16 --> F16, shape = {4096, 32000}
INFO:hf-to-gguf:token_embd.weight, torch.float16 --> F16, shape = {4096, 32000}
INFO:hf-to-gguf:blk.0.attn_norm.weight, torch.float16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.0.ffn_down.weight, torch.uint8 --> F32, shape = {22544384}
Traceback (most recent call last):
 File "/home/user/app/llama.cpp/convert-hf-to-gguf.py", line 2865, in 
 main()
 File "/home/user/app/llama.cpp/convert-hf-to-gguf.py", line 2859, in main
 model_instance.write()
 File "/home/user/app/llama.cpp/convert-hf-to-gguf.py", line 325, in write
 self.write_tensors()
 File "/home/user/app/llama.cpp/convert-hf-to-gguf.py", line 1385, in write_tensors
 super().write_tensors()
 File "/home/user/app/llama.cpp/convert-hf-to-gguf.py", line 262, in write_tensors
 for new_name, data in ((n, d.squeeze().numpy()) for n, d in self.modify_tensors(data_torch, name, bid)):
 File "/home/user/app/llama.cpp/convert-hf-to-gguf.py", line 1382, in modify_tensors
 return [(self.map_tensor_name(name), data_torch)]
 File "/home/user/app/llama.cpp/convert-hf-to-gguf.py", line 180, in map_tensor_name
 raise ValueError(f"Can not map tensor {name!r}")
ValueError: Can not map tensor 'model.layers.0.mlp.down_proj.weight.absmax'

Is this because I did the upload wrong? Or because I forgot something to do? I have no idea and I hope someone can help me out.

I hope I managed to explain it better.

Wauplin commented 4 months ago

@taoofstefan your repo seems correct at first glance. You can use it with transformers using the instructions in https://huggingface.co/taoofstefan/llama2-7B-MT-FT-llama2-S/tree/main?library=transformers. If it loads correctly it means the upload was correct. Then if the GGUF conversion is raising this error, it's best to report your error/ask for assistance in either https://github.com/ggerganov/llama.cpp or https://huggingface.co/spaces/ggml-org/gguf-my-repo/discussions. It could also be a very good question to ask in the HF Discord where you will find other users that are doing similar things.

General advice, it's better to try to isolate the problem. Here, testing that it works with transformers would confirm that the finetuning / upload was correct. Since you are using very different projects/tools in your project (peft, transformers, GGUF my repo, llama.cpp, etc.), it's important to know which part is causing trouble.

BenjaminBossan commented 4 months ago

From the PEFT side of things, the code generally looks good. What surprises me though is that the model.safetensors file is only 4GB, whereas the original model size is 13.5GB. After merging a PEFT adapter, the model size should be the same as the size of the base model.

On how to proceed, I agree with Lucain that you should break down your steps and check after each if the model still works.

taoofstefan commented 4 months ago

@BenjaminBossan / @Wauplin first of all thanks for your help so far! I am totally on board on doing this step by step. I honestly didn't know how to articulate my issue in the first place and where it might come from.

I uploaded the notebook with all the code I used for fine-tuning and pushing to HF. I hope this makes my process and potentially my mistakes clearer.

Also, a quick rundown on my goal (this is part of my master thesis):

finetune a Llama 2 7B on a custom dataset to improve function calling capabilities
push it to HF
create a GGUF file for local inference
test the model and compare results to base model

I hope you see some step I did wrong. I am not particularly attached to the way to fine tune the model, but using PEFT and LoRA seemed to be a solid way (QLoRA seemed to help to save space). If there is another efficient wat of doing this and having a small quantized finetuned model in a GGUF file, I am open to do that.

Wauplin commented 4 months ago

Thanks for breaking down the process @taoofstefan :) However, Github issues are usually meant to report bugs or request new features for a given library. Here I feel that what you are looking for is some help about how to use the different tools together -which I'm cannot help with-. Library maintainers are not always the best suited for this kind of questions, so I'd rather look into HF forum and HF discord to discuss this with actual ML practitioners like you that might have encountered similar difficulties.

github-actions[bot] commented 3 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

YaBoyBigPat commented 1 month ago

I'm also having this same issue

huggingface / peft

Correctly Uploading PEFT/QLoRA model to HuggingFace #1812