unslothai / unsloth

Finetune Llama 3.2, Mistral, Phi, Qwen & Gemma LLMs 2-5x faster with 80% less memory
https://unsloth.ai
Apache License 2.0
17.99k stars 1.25k forks source link

"multiplicative LoRAs" for LLMs? #991

Closed jukofyork closed 2 months ago

jukofyork commented 2 months ago

@danielhanchen

Sorry to raise an issue but I don't know any other way to contact you and thought if anyone knows about this is might be you! :)

I just wondered if you've come across any research related "multiplicative LoRAs" for LLMs?

I have found these two papers that use the same idea for image models (where LoRA seems way more popular):

https://arxiv.org/abs/2306.07280 https://arxiv.org/abs/2311.06243

But they parametrise using a block diagonal matrix instead of a pair of factor matrices, which likely makes sense for images (where block diagonal is related to small-window 1D convolution filters), but not for LLMs...

EDIT: I actually just found these two paper are implemented in the PERF library: boft and oft, but I assume this is really aimed at image models only?

I'm specifically interested in the 2-matrix factorisation of these, and I outlined why I think this would be particularly interesting for the outputs of the down_proj matrix here:

https://github.com/turboderp/exllamav2/discussions/500#discussioncomment-10532104

But would like to save myself a lot of work if this has been implemented for LLMs already or if there is already a LoRA PERF format that stores these "multiplicative LoRAs".

Have you come across this before?


I think I can easily write the code to extract these in a similar way to Thomas Gauthier's LoRD - just with a couple of extra steps, where A is the base model matrix and B is the delta of the fine-turned matrix:

A + B = A (I + X) = A I + A X = A + A X B = A X A^+ B = A^+ A X = X

(or likely solve using least-squares instead of the pseudo-inverse)

and then take the SVD of X:

X = U^T S V

truncate it:

U'^T S' V'

and either save as a custom "multiplicative LoRA" needing custom code to merge back on to A.

or multiply out:

B' = A U'^T S' * V'

Then take the SVD of B' and save a really large rank version of it in the same way as LoRD already does - meaning it would still work as a standard "additive LoRA" but unfortunately take up way more space than needed due to B' being close to full rank.

jukofyork commented 2 months ago

I followed those two papers back to find this: https://arxiv.org/abs/2004.04690

and then this links all the ideas together: https://arxiv.org/abs/2405.17484

They are all to do with image-model LoRAs though:

image

but equation (5) clearly shows their "Householder Reflection Adaptation" is the same thing with a different parametrisation:

image

I'll close this now as I think I have my answer and spend a couple of days reading all these now! :)