Saving to GGUF llama.cpp / merging to 16bit for VLLM

danielhanchen commented 7 months ago

Fully supported! Scroll down on our latest Mistral notebook: https://colab.research.google.com/drive/1Dyauq4kTZoLewQ1cApceUQVNcnnNTzg_?usp=sharing

For 16bit merging:

model.save_pretrained_merged("dir", tokenizer, save_method = "merged_16bit")
model.push_to_hub_merged("dir", tokenizer, save_method = "merged_16bit")

For GGUF merging:

model.save_pretrained_gguf("dir", tokenizer, quantization_method = "q8_0")
model.push_to_hub_gguf("dir", tokenizer, quantization_method = "q4_k_m")
model.push_to_hub_gguf("dir", tokenizer, quantization_method = "f16")

We support all GGUF configs:

Choose for `quantization_method` to be:
"not_quantized"  : "Recommended. Fast conversion. Slow inference, big files.",
"fast_quantized" : "Recommended. Fast conversion. OK inference, OK file size.",
"quantized"      : "Recommended. Slow conversion. Fast inference, small files.",
"f32"     : "Not recommended. Retains 100% accuracy, but super slow and memory hungry.",
"f16"     : "Fastest conversion + retains 100% accuracy. Slow and memory hungry.",
"q8_0"    : "Fast conversion. High resource use, but generally acceptable.",
"q4_k_m"  : "Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K",
"q5_k_m"  : "Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K",
"q2_k"    : "Uses Q4_K for the attention.vw and feed_forward.w2 tensors, Q2_K for the other tensors.",
"q3_k_l"  : "Uses Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K",
"q3_k_m"  : "Uses Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K",
"q3_k_s"  : "Uses Q3_K for all tensors",
"q4_0"    : "Original quant method, 4-bit.",
"q4_1"    : "Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models.",
"q4_k_s"  : "Uses Q4_K for all tensors",
"q5_0"    : "Higher accuracy, higher resource usage and slower inference.",
"q5_1"    : "Even higher accuracy, resource usage and slower inference.",
"q5_k_s"  : "Uses Q5_K for all tensors",
"q6_k"    : "Uses Q8_K for all tensors",

JacksonCakes commented 7 months ago

Hey @danielhanchen, thanks for the amazing work! However, I tried saving 4-bits model but encounter following error:

NotImplementedError: You are calling `save_pretrained` on a 4-bit converted model. This is currently not supported

Here is my implementation:

from unsloth import FastLanguageModel
import torch
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset
max_seq_length = 32768 # Supports RoPE Scaling interally, so choose any!

url = "/home/jackson/unsloth_ft/notebook/output.jsonl"
dataset = load_dataset("json", data_files = {"train" : url}, split = "train")

# Load Llama model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "/home/jackson/OpenHermes-2.5-Mistral-7B/",
    max_seq_length = max_seq_length,
    dtype = None,
    load_in_4bit = True,
)

# Do model patching and add fast LoRA weights
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    use_gradient_checkpointing = True,
    random_state = 3407,
    max_seq_length = max_seq_length,
)

trainer = SFTTrainer(
    model = model,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    tokenizer = tokenizer,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 10,
        max_steps = 1,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 1,
        output_dir = "outputs",
        optim = "adamw_8bit",
        seed = 3407,
    ),
)
trainer.train()
model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)

Do you have any idea?

danielhanchen commented 7 months ago

@JacksonCakes Oh my I might disable 4bit from now on!! My apologies on my side - I think you're the 4th person to use that one!!! I should have not used that as the name - HF 4bit merging only works on the latest HF branch sadly.

Also 4bit only is useful for HF inference. VLLM does not allow 4bit loading, and I advise people to rather use GGUF or pure 16bit saving.

Unless of course you actually want a 4bit saved model, rather than a GGUF 4bit model.

But in a quick patch maybe tomorrow I will error out if your transformers version is not the latest before even allowing you to call "merged_4bit"

JacksonCakes commented 7 months ago

Ah its okay! I was planning to save it in 16-bit but getting errors, that's why I tried to use 4bit saving. But now I am able to save it to 16bit (after update my transformers version to 4.37.1). Thanks for the quick response!!

danielhanchen commented 7 months ago

@JacksonCakes :)

dongxiaolong commented 5 months ago

Fully supported! Scroll down on our latest Mistral notebook: https://colab.research.google.com/drive/1Dyauq4kTZoLewQ1cApceUQVNcnnNTzg_?usp=sharing

For 16bit merging:

model.save_pretrained_merged("dir", tokenizer, save_method = "merged_16bit")
model.push_to_hub_merged("dir", tokenizer, save_method = "merged_16bit")

For GGUF merging:

model.save_pretrained_gguf("dir", tokenizer, quantization_method = "q8_0")
model.push_to_hub_gguf("dir", tokenizer, quantization_method = "q4_k_m")
model.push_to_hub_gguf("dir", tokenizer, quantization_method = "f16")

We support all GGUF configs:

Choose for `quantization_method` to be:
"not_quantized"  : "Recommended. Fast conversion. Slow inference, big files.",
"fast_quantized" : "Recommended. Fast conversion. OK inference, OK file size.",
"quantized"      : "Recommended. Slow conversion. Fast inference, small files.",
"f32"     : "Not recommended. Retains 100% accuracy, but super slow and memory hungry.",
"f16"     : "Fastest conversion + retains 100% accuracy. Slow and memory hungry.",
"q8_0"    : "Fast conversion. High resource use, but generally acceptable.",
"q4_k_m"  : "Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K",
"q5_k_m"  : "Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K",
"q2_k"    : "Uses Q4_K for the attention.vw and feed_forward.w2 tensors, Q2_K for the other tensors.",
"q3_k_l"  : "Uses Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K",
"q3_k_m"  : "Uses Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K",
"q3_k_s"  : "Uses Q3_K for all tensors",
"q4_0"    : "Original quant method, 4-bit.",
"q4_1"    : "Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models.",
"q4_k_s"  : "Uses Q4_K for all tensors",
"q5_0"    : "Higher accuracy, higher resource usage and slower inference.",
"q5_1"    : "Even higher accuracy, resource usage and slower inference.",
"q5_k_s"  : "Uses Q5_K for all tensors",
"q6_k"    : "Uses Q8_K for all tensors",

Is it possible that I load_in_4bit fine-tuned models that can't use merged_16bit as a method produces strange output that can't be used with vllm inference ?

danielhanchen commented 5 months ago

@dongxiaolong So all GGUF methods can't be used with vLLM. vLLM accepts 16 bit - are you saying there are gibberish outputs? Can you provide the model you're using, and as much details as possible (Python version, maybe take a screenshot of Unsloth's training info part)

dongxiaolong commented 5 months ago

@dongxiaolong So all GGUF methods can't be used with vLLM. vLLM accepts 16 bit - are you saying there are gibberish outputs? Can you provide the model you're using, and as much details as possible (Python version, maybe take a screenshot of Unsloth's training info part)

I am using the mistral model for fine-tuning. When I set load_in_4bit=True, the model output is good. However, when I set load_in_4bit=False, the model generates repetitive outputs. The same issue occurs when I follow the notebook to convert the model to vllm format. Even when I prompt to change the lora precision to fp16, it still doesn't work. The situation only improves when I use TGI's bitsandbyte quantization. I suspect there might be an issue with the precision conversion of qlora's lora

unslothai / unsloth

Saving to GGUF llama.cpp / merging to 16bit for VLLM #114