pytorch / torchtune

A Native-PyTorch Library for LLM Fine-tuning
BSD 3-Clause "New" or "Revised" License
3.54k stars 287 forks source link

QLoRA Inference #1020

Open jeff52415 opened 1 month ago

jeff52415 commented 1 month ago

Can I load QLoRA fine-tuning weights into a Hugging Face model as shown below?

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4'
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path=model_id,  
    #config=config,
    low_cpu_mem_usage=True,
    torch_dtype=torch.float16,
    device_map='auto'
)

model = PeftModel.from_pretrained(model, "qlora_finetune_folder/")

I have changed the Checkpointer to FullModelHFCheckpointer. Essentially, it is loadable & runnable, but I am curious if it reflects the same structure as qlora_llama3_8b. Thanks.

ebsmothers commented 1 month ago

Hi @jeff52415 thanks for opening this issue, this is a really good question. One possible source of discrepancy is the different implementations of NF4 quantization used by torchtune and Hugging Face. To be more explicit, torchtune relies on the NF4Tensor class from torchao in QLoRA instead of the bitsandbytes version from Hugging Face. I need to verify that quantizing a torchtune checkpoint with bitsandbytes yields the same result as quantizing with ao. Let me look into it and get back to you. Also cc @rohan-varma who may have some insights here