Open richardxoldman opened 3 months ago
Ok so I was able to solve the problem.
In the gemma-9b finetune notebook there is a cell:
Now if you want to load the LoRA adapters we just saved for inference, set False to True:
if False:
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
max_seq_length = max_seq_length,
dtype = dtype,
load_in_4bit = load_in_4bit,
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
# alpaca_prompt = You MUST copy from above!
inputs = tokenizer(
[
alpaca_prompt.format(
"What is a famous tall tower in Paris?", # instruction
"", # input
"", # output - leave this blank for generation!
)
], return_tensors = "pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
tokenizer.batch_decode(outputs)
I followed this cell (with removing the "if False"), but before inference it is also needed to call get_peft_model function (which was not stated in te Colab Notebook) to make the model work correctly:
model = FastLanguageModel.get_peft_model(
model,
r = LORA_RANK,
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj", "lm_head", "embed_tokens"],
lora_alpha = LORA_ALPHA,
lora_dropout = 0.0,
bias = "none",
use_gradient_checkpointing = "unsloth",
random_state = 3407,
use_rslora = False,
loftq_config = None,
)
If it was only difference in dtype (before calling get_peft_model lm_head/embed_tokens were bfloat16 type, after calling get_peft_model those are float32), Then for my task the dtype make a pretty huge difference for me.
I trained LoRA adapters for unsloth/gemma-2-9b-it-bnb-4bit models, but I also added lm_head, embed_tokens to the adapter
then after training I saved it on my drive using
When I load it using
I got much worse outputs as if only LoRA adapters without lm_head/embed_tokens were loaded. The adapter_model.safetensors file has ~7,3GB (when I was not saving lm_head/embed_tokens it was around 500MB), so I think the lm_head and embed_tokens were saved, but they are not loading correctly.
Adapter_config.json content:
UPDATE:
Actually I checked it and it loaded successfully
But still results during inference after loading the model are much different from what I had before saving the model.