unslothai / unsloth

Finetune Llama 3, Mistral, Phi & Gemma LLMs 2-5x faster with 80% less memory
https://unsloth.ai
Apache License 2.0
12.38k stars 805 forks source link

After LoRa training (or loading the checkpoint) consecutive inference gives different results even if do_sample is False #279

Open ziemowit-s opened 3 months ago

ziemowit-s commented 3 months ago

Hi there,

I noticed another critical bug (at least from mine point of view): after LoRa training and even with do_sample is False, consecutive inference results in different results:

Loading the base model:

from unsloth import FastLanguageModel
import torch

model_name = "unsloth/mistral-7b-instruct-v0.2-bnb-4bit"

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = model_name,
    max_seq_length = 2048,
    dtype = None,
    load_in_4bit = True,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

tokenizer.padding_side='left' # for training right (inference left)
tokenizer.pad_token = tokenizer.eos_token

Setting up LoRa

from unsloth import FastLanguageModel

model = FastLanguageModel.get_peft_model(
    model,
    r = 32
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_alpha = 32,
    lora_dropout = 0.05, 
    bias = "none",    
    use_gradient_checkpointing = True,
    use_rslora = False,  
    loftq_config = None, 
)

model.print_trainable_parameters()

Training:

from transformers import TrainingArguments
from trl import SFTTrainer

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = 2048,
    dataset_num_proc = 6,
    packing = False,
    args = TrainingArguments(
        per_device_train_batch_size = 4,
        gradient_accumulation_steps = 1,
        warmup_steps = 5,
        max_steps = 10000,
        learning_rate = lr,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        save_steps=save_steps,
        logging_steps=logging_steps,
        optim = "adamw_8bit",
        logging_dir=f'logs/{output_dir}_{get_date_time()}',
        weight_decay = 0.005,
       lr_scheduler_type = "linear",
        output_dir = output_dir,
        report_to="tensorboard"
    ),
    )

trainer_stats = trainer.train()

I trained for 10K and than infer:

FastLanguageModel.for_inference(model) # Enable native 2x faster inference

inputs = tokenizer([txt], return_tensors = "pt").to("cuda")
response = model.generate(**inputs, max_new_tokens = max_new_tokens,  do_sample=False).cpu().numpy()

token_ids_list = response.squeeze().tolist()
text = tokenizer.decode(token_ids_list, skip_special_tokens=True)

Even though do_sample is False responses are different (even if I reload the checkpoint)

But if i save to model:

model.save_pretrained_merged("lora", tokenizer, save_method = "lora",)

and than load it:

from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "lora",
    max_seq_length = 2048,
    dtype = None,
    load_in_4bit = True,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

all outputs are consistently the same:

FastLanguageModel.for_inference(model) # Enable native 2x faster inference

inputs = tokenizer([txt], return_tensors = "pt").to("cuda")
response = model.generate(**inputs, max_new_tokens = max_new_tokens,  do_sample=False).cpu().numpy()

token_ids_list = response.squeeze().tolist()
text = tokenizer.decode(token_ids_list, skip_special_tokens=True)
danielhanchen commented 3 months ago

Is this single batched inference?

ziemowit-s commented 3 months ago

if you mean a single element in the batch than yes it is, since txt variable is just a string.

danielhanchen commented 3 months ago

@ziemowit-s Maybe I might have solved it, but unsure with yesterday's patch