Open danielhanchen opened 7 months ago
Hey @danielhanchen, thanks for the amazing work! However, I tried saving 4-bits model but encounter following error:
NotImplementedError: You are calling `save_pretrained` on a 4-bit converted model. This is currently not supported
Here is my implementation:
from unsloth import FastLanguageModel
import torch
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset
max_seq_length = 32768 # Supports RoPE Scaling interally, so choose any!
url = "/home/jackson/unsloth_ft/notebook/output.jsonl"
dataset = load_dataset("json", data_files = {"train" : url}, split = "train")
# Load Llama model
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "/home/jackson/OpenHermes-2.5-Mistral-7B/",
max_seq_length = max_seq_length,
dtype = None,
load_in_4bit = True,
)
# Do model patching and add fast LoRA weights
model = FastLanguageModel.get_peft_model(
model,
r = 16,
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",],
lora_alpha = 16,
lora_dropout = 0, # Supports any, but = 0 is optimized
bias = "none", # Supports any, but = "none" is optimized
use_gradient_checkpointing = True,
random_state = 3407,
max_seq_length = max_seq_length,
)
trainer = SFTTrainer(
model = model,
train_dataset = dataset,
dataset_text_field = "text",
max_seq_length = max_seq_length,
tokenizer = tokenizer,
packing = False, # Can make training 5x faster for short sequences.
args = TrainingArguments(
per_device_train_batch_size = 2,
gradient_accumulation_steps = 4,
warmup_steps = 10,
max_steps = 1,
fp16 = not torch.cuda.is_bf16_supported(),
bf16 = torch.cuda.is_bf16_supported(),
logging_steps = 1,
output_dir = "outputs",
optim = "adamw_8bit",
seed = 3407,
),
)
trainer.train()
model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
Do you have any idea?
@JacksonCakes Oh my I might disable 4bit from now on!! My apologies on my side - I think you're the 4th person to use that one!!! I should have not used that as the name - HF 4bit merging only works on the latest HF branch sadly.
Also 4bit only is useful for HF inference. VLLM does not allow 4bit loading, and I advise people to rather use GGUF or pure 16bit saving.
Unless of course you actually want a 4bit saved model, rather than a GGUF 4bit model.
But in a quick patch maybe tomorrow I will error out if your transformers version is not the latest before even allowing you to call "merged_4bit"
Ah its okay! I was planning to save it in 16-bit but getting errors, that's why I tried to use 4bit saving. But now I am able to save it to 16bit (after update my transformers version to 4.37.1
). Thanks for the quick response!!
@JacksonCakes :)
Fully supported! Scroll down on our latest Mistral notebook: https://colab.research.google.com/drive/1Dyauq4kTZoLewQ1cApceUQVNcnnNTzg_?usp=sharing
For 16bit merging:
model.save_pretrained_merged("dir", tokenizer, save_method = "merged_16bit") model.push_to_hub_merged("dir", tokenizer, save_method = "merged_16bit")
For GGUF merging:
model.save_pretrained_gguf("dir", tokenizer, quantization_method = "q8_0") model.push_to_hub_gguf("dir", tokenizer, quantization_method = "q4_k_m") model.push_to_hub_gguf("dir", tokenizer, quantization_method = "f16")
We support all GGUF configs:
Choose for `quantization_method` to be: "not_quantized" : "Recommended. Fast conversion. Slow inference, big files.", "fast_quantized" : "Recommended. Fast conversion. OK inference, OK file size.", "quantized" : "Recommended. Slow conversion. Fast inference, small files.", "f32" : "Not recommended. Retains 100% accuracy, but super slow and memory hungry.", "f16" : "Fastest conversion + retains 100% accuracy. Slow and memory hungry.", "q8_0" : "Fast conversion. High resource use, but generally acceptable.", "q4_k_m" : "Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K", "q5_k_m" : "Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K", "q2_k" : "Uses Q4_K for the attention.vw and feed_forward.w2 tensors, Q2_K for the other tensors.", "q3_k_l" : "Uses Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K", "q3_k_m" : "Uses Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K", "q3_k_s" : "Uses Q3_K for all tensors", "q4_0" : "Original quant method, 4-bit.", "q4_1" : "Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models.", "q4_k_s" : "Uses Q4_K for all tensors", "q5_0" : "Higher accuracy, higher resource usage and slower inference.", "q5_1" : "Even higher accuracy, resource usage and slower inference.", "q5_k_s" : "Uses Q5_K for all tensors", "q6_k" : "Uses Q8_K for all tensors",
Is it possible that I load_in_4bit fine-tuned models that can't use merged_16bit as a method produces strange output that can't be used with vllm inference ?
@dongxiaolong So all GGUF methods can't be used with vLLM. vLLM accepts 16 bit - are you saying there are gibberish outputs? Can you provide the model you're using, and as much details as possible (Python version, maybe take a screenshot of Unsloth's training info part)
@dongxiaolong So all GGUF methods can't be used with vLLM. vLLM accepts 16 bit - are you saying there are gibberish outputs? Can you provide the model you're using, and as much details as possible (Python version, maybe take a screenshot of Unsloth's training info part)
I am using the mistral model for fine-tuning. When I set load_in_4bit=True, the model output is good. However, when I set load_in_4bit=False, the model generates repetitive outputs. The same issue occurs when I follow the notebook to convert the model to vllm format. Even when I prompt to change the lora precision to fp16, it still doesn't work. The situation only improves when I use TGI's bitsandbyte quantization. I suspect there might be an issue with the precision conversion of qlora's lora
Fully supported! Scroll down on our latest Mistral notebook: https://colab.research.google.com/drive/1Dyauq4kTZoLewQ1cApceUQVNcnnNTzg_?usp=sharing
For 16bit merging:
For GGUF merging:
We support all GGUF configs: