Open Aktsvigun opened 5 months ago
cc @SunMarc if you can have a looK!
Hi @Aktsvigun, thanks for this detailed report ! I'll have a look asap ! Did you have this issue ? cc @danielhanchen If you have some time cc @matthewdouglas @Titus-von-Koeller
Yes, agreed, this is a nice bug report!
@SunMarc Unfortunately, I'm not free for this and the coming weeks, unless it's quite high impact. Gotta focus on bringing the multi-backend-refactor and some related things across the finishing line.
Hi, I am facing the same issue.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
+1
One thing you could try is to load the non-quantized model (or dequantize the quantized model), merge the LoRA weights into the floats, and then quantize the model again.
The LoRA adapter during fine-tuning is not quantized, after merging and then quantization, they become part of the model and now are quantized. I would expect some natural degradation of the performance. Moreover, I suspect that some of these parameters will be outliers in the model after merging, i.e., more difficult to quantize with a technique like bistandbytes.
After merging, I would recommend a more accurate method like AWQ.
One thing you could try is to load the non-quantized model (or dequantize the quantized model), merge the LoRA weights into the floats, and then quantize the model again.
One thing that could be nice to try out is to allow fake quantization in LoRA forward. During the forward, we quantize the weights then dequantize immediately the weights so that the training takes into account the quantization error. This way we might have less degradation after merging. This is something we can probably test with diffusers models cc @sayakpaul
Looks like the perfect timing doesn't exist: https://x.com/RisingSayak/status/1849019148585885815
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
System Info
transformers
version: 4.41.2trl==0.9.3
Who can help?
@ArthurZucker, @younesbelkada
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Hi, I found really strange behaviour when calling
.merge_and_unload()
method. More precisely, this is a must-have phase if you want to further use the model with other frameworks (e.g. withvllm
for inference), but it dramatically impairs the model performance. I tested this in 6 settings on a grammar checking task with Phi-3 model:bf16=True
in training arguments: model quality is severely damaged (0.12 loss for a PeftModel vs 0.56 loss for a merged model)fp16=True
in training arguments: model quality is severely damaged (0.12 loss for a PeftModel vs 0.56 loss for a merged model)fp16 / bf16
(correct me if I'm wrong, I believe such setting preserves the usage of torch.float32): model quality is severely damaged (0.12 loss for a PeftModel vs 0.56 loss for a merged model)bf16=True
in training arguments: model quality is slightly damaged (0.125 loss for a PeftModel vs 0.135 loss for a merged model)fp16=True
in training arguments: model quality is NOT damaged (0.11791 loss for a PeftModel vs 0.11796 loss for a merged model, which is due to dtype change)fp16 / bf16
: model quality is NOT damaged (0.11793883 vs 0.11793886).These observations are robust across different tasks, models, and even architectures (e.g. in the example I'm using a CasualLM, yet for sequence classification models these observations hold).
I believe there may be a bug for
bf16=True
parameter in training arguments. Still, for QLoRA performance decrease occurs for other dtypes as well.For convenience, I attach the .ipynb notebooks for all the 6 settings (Github won't let me upload .ipynb, so please download these .txt files and change their extension to .ipynb). I used
trl
here to make it easier to follow the code - I observe absolutely the same behaviour when using atransformers
implementation (with TrainingArguments, Trainer, etc.). Below I attach the code for the first setting (I'd call it the most "erroneous" one) with QLoRA +bf16=True
:qlora_fp16.txt qlora_float32.txt qlora_bf16.txt lora_fp16.txt lora_float32.txt lora_bf16.txt
Expected behavior
I can expect a minor drop in performance but definitely not to have the loss increased 4x times. I bet there are bugs:
merge_and_unload
implementationbf16=True
produces errors since even without quantization, it increases the model's loss (which does not happen if disabling this option).Kindly tell me if I can help here further.