`merge_and_unload` for a quantized model ruins its quality

Aktsvigun commented 5 months ago

System Info

transformers version: 4.41.2
Platform: Linux-5.15.0-1044-nvidia-x86_64-with-glibc2.35
Python version: 3.10.0
Huggingface_hub version: 0.23.0
Safetensors version: 0.4.2
Accelerate version: 0.30.1
Accelerate config: - compute_environment: LOCAL_MACHINE
- distributed_type: NO
- mixed_precision: bf16
- use_cpu: False
- debug: False
- num_processes: 1
- machine_rank: 0
- num_machines: 1
- gpu_ids: 0
- rdzv_backend: static
- same_network: True
- main_training_function: main
- enable_cpu_affinity: False
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
PyTorch version (GPU?): 2.3.0+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No

trl==0.9.3

Who can help?

@ArthurZucker, @younesbelkada

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

Hi, I found really strange behaviour when calling .merge_and_unload() method. More precisely, this is a must-have phase if you want to further use the model with other frameworks (e.g. with vllm for inference), but it dramatically impairs the model performance. I tested this in 6 settings on a grammar checking task with Phi-3 model:

QLoRA + bf16=True in training arguments: model quality is severely damaged (0.12 loss for a PeftModel vs 0.56 loss for a merged model)
QLoRA + fp16=True in training arguments: model quality is severely damaged (0.12 loss for a PeftModel vs 0.56 loss for a merged model)
QLoRA without specifying fp16 / bf16 (correct me if I'm wrong, I believe such setting preserves the usage of torch.float32): model quality is severely damaged (0.12 loss for a PeftModel vs 0.56 loss for a merged model)
LoRA + bf16=True in training arguments: model quality is slightly damaged (0.125 loss for a PeftModel vs 0.135 loss for a merged model)
LoRA + fp16=True in training arguments: model quality is NOT damaged (0.11791 loss for a PeftModel vs 0.11796 loss for a merged model, which is due to dtype change)
LoRA without specifying fp16 / bf16: model quality is NOT damaged (0.11793883 vs 0.11793886).

These observations are robust across different tasks, models, and even architectures (e.g. in the example I'm using a CasualLM, yet for sequence classification models these observations hold).

I believe there may be a bug for bf16=True parameter in training arguments. Still, for QLoRA performance decrease occurs for other dtypes as well.

For convenience, I attach the .ipynb notebooks for all the 6 settings (Github won't let me upload .ipynb, so please download these .txt files and change their extension to .ipynb). I used trl here to make it easier to follow the code - I observe absolutely the same behaviour when using a transformers implementation (with TrainingArguments, Trainer, etc.). Below I attach the code for the first setting (I'd call it the most "erroneous" one) with QLoRA + bf16=True:

import numpy as np
from datasets import load_dataset, DatasetDict
from peft import LoraConfig
import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    set_seed,
    BitsAndBytesConfig,
    EarlyStoppingCallback
)
from trl import SFTConfig, SFTTrainer, DataCollatorForCompletionOnlyLM

### Model & tokenizer loading part
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

model_name = 'microsoft/Phi-3-mini-4k-instruct'
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    cache_dir='../al-nlg/cache',
    attn_implementation='eager',
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
    model_name, model_max_length=128
)
tokenizer.pad_token = tokenizer.eos_token

peft_config = LoraConfig(
    r=128,
    lora_alpha=128,
    target_modules=['o_proj', 'qkv_proj', 'gate_up_proj', 'down_proj', 'lm_head'],
    lora_dropout=0.1,
    bias="none",
    task_type="CAUSAL_LM",
)

### Data loading part
data = load_dataset('juancavallotti/multilingual-gec')
data = DatasetDict({
    'train': data['train'].select(range(1000)),
    'eval': data['train'].select(range(1000, 2000))
})

data = data.map(
    lambda x: {
        'messages': [
            {'role': 'user', 'content': x['modified']},
            {'role': 'assistant', 'content': x['sentence']},
        ]

    },
    batched=False,
    remove_columns=data['train'].column_names
)

### Trainer setting part
set_seed(42)

train_args = SFTConfig(
    output_dir='tmp',
    num_train_epochs=1,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    learning_rate=3e-5,
    bf16=True,
    bf16_full_eval=False,
    evaluation_strategy="epoch",
    report_to='none',
    gradient_checkpointing=True,
    gradient_checkpointing_kwargs={"use_reentrant": False},
)

collator = DataCollatorForCompletionOnlyLM(
    instruction_template='<|user|>',
    response_template='<|assistant|>',
    tokenizer=tokenizer,
    mlm=False
)

trainer = SFTTrainer(
    model,
    args=train_args,
    train_dataset=data['train'],
    eval_dataset=data['eval'],
    data_collator=collator,
    peft_config=peft_config
)

trainer.train()

### Evaluation of a PeftModel after training (should coincide with the score we got during `trainer.train`)
trainer.evaluate()['eval_loss']
>>> 0.1241287887096405

### Merge and unload, and re-evaluate for a model after merge
merged_model = trainer.model.merge_and_unload()
trainer.evaluate()['eval_loss']
>>> 0.5563217997550964

qlora_fp16.txt qlora_float32.txt qlora_bf16.txt lora_fp16.txt lora_float32.txt lora_bf16.txt

Expected behavior

I can expect a minor drop in performance but definitely not to have the loss increased 4x times. I bet there are bugs:

In quantization / merge_and_unload implementation
bf16=True produces errors since even without quantization, it increases the model's loss (which does not happen if disabling this option).

Kindly tell me if I can help here further.

ArthurZucker commented 5 months ago

cc @SunMarc if you can have a looK!

SunMarc commented 4 months ago

Hi @Aktsvigun, thanks for this detailed report ! I'll have a look asap ! Did you have this issue ? cc @danielhanchen If you have some time cc @matthewdouglas @Titus-von-Koeller

Titus-von-Koeller commented 4 months ago

Yes, agreed, this is a nice bug report!

@SunMarc Unfortunately, I'm not free for this and the coming weeks, unless it's quite high impact. Gotta focus on bringing the multi-backend-refactor and some related things across the finishing line.

AtsunoriFujita commented 3 months ago

Hi, I am facing the same issue.

github-actions[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

gpadres commented 1 month ago

+1

BenjaminBossan commented 1 month ago

One thing you could try is to load the non-quantized model (or dequantize the quantized model), merge the LoRA weights into the floats, and then quantize the model again.

benjamin-marie commented 1 month ago

The LoRA adapter during fine-tuning is not quantized, after merging and then quantization, they become part of the model and now are quantized. I would expect some natural degradation of the performance. Moreover, I suspect that some of these parameters will be outliers in the model after merging, i.e., more difficult to quantize with a technique like bistandbytes.

After merging, I would recommend a more accurate method like AWQ.

SunMarc commented 1 month ago

One thing you could try is to load the non-quantized model (or dequantize the quantized model), merge the LoRA weights into the floats, and then quantize the model again.

One thing that could be nice to try out is to allow fake quantization in LoRA forward. During the forward, we quantize the weights then dequantize immediately the weights so that the training takes into account the quantization error. This way we might have less degradation after merging. This is something we can probably test with diffusers models cc @sayakpaul

sayakpaul commented 1 month ago

Looks like the perfect timing doesn't exist: https://x.com/RisingSayak/status/1849019148585885815

github-actions[bot] commented 1 week ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / transformers