Closed Haakooto closed 1 week ago
Thanks a lot for reporting. OLoRA should work with quantization, but as your reproducer shows, there is something amiss.
Interestingly, when using the OLoRA example script (which is not all that different), it can work with quantization, but not with all models it seems. E.g.:
python examples/olora_finetuning/olora_finetuning.py --base_model "facebook/opt-125m" --quantize
runspython examples/olora_finetuning/olora_finetuning.py --base_model "TinyLlama/TinyLlama-1.1B-Chat-v1.0" --quantize
errors: File "/home/name/work/forks/peft/examples/olora_finetuning/olora_finetuning.py", line 166, in <module>
train(
File "/home/name/work/forks/peft/examples/olora_finetuning/olora_finetuning.py", line 130, in train
trainer.train()
File "/home/name/work/clones/transformers/src/transformers/trainer.py", line 1945, in train
return inner_training_loop(
^^^^^^^^^^^^^^^^^^^^
File "/home/name/work/clones/transformers/src/transformers/trainer.py", line 2286, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[...]
File "/home/name/work/forks/peft/src/peft/tuners/lora/bnb.py", line 467, in forward
result = self.base_layer(x, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/accelerate/hooks.py", line 169, in new_forward
output = module._old_forward(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/bitsandbytes/nn/modules.py", line 477, in forward
out = bnb.matmul_4bit(x, self.weight.t(), bias=bias, quant_state=self.weight.quant_state)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/bitsandbytes/autograd/_functions.py", line 579, in matmul_4bit
return MatMul4Bit.apply(A, B, out, bias, quant_state)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/torch/autograd/function.py", line 574, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/name/anaconda3/envs/peft/lib/python3.11/site-packages/bitsandbytes/autograd/_functions.py", line 509, in forward
output = torch.nn.functional.linear(A, F.dequantize_4bit(B, quant_state).to(A.dtype).t(), bias)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: mat1 and mat2 shapes cannot be multiplied (4096x2048 and 256x2048)
0%|
I didn't have time to investigate yet, will probably have time next week. Pinging @tokenizer-decode
Weird. We don't do anything model specific, we rely on PEFT's quantization. It should be related to that.
I did some further research, though unfortunately I still haven't found the solution.
One difference that I spotted between the scripts is the usage of bnb_4bit_quant_storage
, but even when removed, there is still an error.
Didding deeper, it appears that something goes wrong during the dequantization step:
The output is a 2048x2048 shaped float tensor. However, I think it should be a float tensor of shape 2097152x1 because bnb uses flat tensors (and 2097152 == 2048**2 / 2). For reference, I checked the LoftQ implementation and there I get exactly this flat shape. It is unclear to me why the shapes are different, even though both code paths call bnb.functional.dequantize_4bit(weight.data, weight.quant_state)
. I checked the quant_state
too but AFAICT it's the same between the two.
Anyway, because of the wrong shape, we assign an incorrectly shaped (and all zeros) tensor as the new bnb weight here:
which is most likely the reason for the shape error later during forward.
When I have time to further investigate, I'll let you know. If any of you have an idea what I'm missing, please let me know.
I poked around in the olora_init-function in peft/src/peft/tuners/lora/layer.py
. As @BenjaminBossan mentions, the difference between my script and the olora-example script is the bnb_4bit_quant_storage
-flag in the BitsAndBytesConfig. Both scripts crash because of shape-mismatch, both with bnb_4bit_quant_storage=None
(default) and with bnb_4bit_quant_storage=torch.bfloat16
(which I used because of this page).
However, the relevant shapes are different. As can be seen in the tracebacks posted by me and @BenjaminBossan, with bnb_4bit_quant_storage=torch.bfloat16
, mat2 has size (1x1), while with bnb_4bit_quant_storage=None
, it has size (NxM). This is because olora_init does not identify the Linear4bit-instance as quantised as the only check for this is its dtype, which is bfloat16, causing the original weights not to be dequantised before computing the QR-decomposition.
While this does not shed light on the ultimate issue, this is another bug.
Thanks for the further investigation @Haakooto. I have a tentative solution in #2011, it would be great if you could test it. I consider this WIP as I'm uncertain if the way the PR fixes the issue is optimal.
Finally had a chance to look at it now. This does indeed work. I noticed the solution only covers olora, not pissa. It was straight forward to copy the code that fix the issue over. Though if further improvements are comming, as you say, this is not urgent. Thank you!
Thanks @Haakooto for testing, this is good to know. Let's focus on OLoRA for now and deal with PiSSA later.
I discovered a bug in the PR as it didn't correctly deal with 8bit bnb, but that should now also be fixed.
I have also met this issue with bnb+olora
@44670 The PR is still not merged, but you could install from the https://github.com/BenjaminBossan/peft/tree/fix-olora-bnb
branch if you want to test it right now.
System Info
Who can help?
@BenjaminBossan @sayakpaul
Information
Tasks
examples
folderReproduction
Problem araises when using "olora" or "pissa" as adapter initialisation when loading model in 4bit.
Expected behavior
RuntimeError: mat1 and mat2 shapes cannot be multiplied (3x2048 and 1x1)
Traceback (not full traceback):
While I have not read the code to make sure, I suspect the issue can be explained by this: The quantisation process happens before the application of lora-adapter. This flattens the original weight matrices. When initialising the lora-adapter with random entries, this flattening causes no problem as (I suspect) only the original shape is used to create it. However, olora and pissa initialisation computes the QR and SVD decomposition of the original matrix. It seems that the flattening is not respected, so the decomposition is calculated from the (N x 1)-tensor. The resulting lora_A-matrix is a (1x1) tensor, while the base-tensor is filled with
nan
. This causes no issues when initialising, and the code only crashes when an input is passed to the model.