Improving generalization of LoRA with wise-ft

SimJeg commented 1 month ago

Feature request

Include the wise-ft method in LoraModel models.

Motivation

Wise-ft interpolates between the weights $W{base}$ of a base model and the weights $W{ft}$ of a finetuned model using the following formula: $$W{wise} = (1-\alpha) * W{base} + \alpha * W_{ft} \tag{1}$$ where $\alpha \in [0,1]$. This approach effectively balances the trade-off between the zero-shot capabilities of the base model ($\alpha=0$) and the task-specific performance of the finetuned model ($\alpha=1$). It is particulary well suited when there is a distribution shift between the training data used for fine-tuning, and the test data used during inference.

For LoRA, as $W{ft} = W{base} + W{delta}$, we can rewrite the formula (1) as: $$W{wise} = W{base} + \alpha * W{delta} \tag{2}$$

Note that the formula above does not apply wise-ft correctly to DoRA, as for DoRA, $W{ft} = m{dora} * (W{base} + W{delta})$.

We successfully applied this method to win the 5 tracks of the Amazon KDD Cup 2024, achieving performance improvements of up to 1.5%. In this competition, no training data was provided, hence there was a distribution shift between the training data built by the participants and the hidden test data used for evaluation.

Your contribution

There are several ways to implement wise-ft for LoRA adapters:

(no code) Update the configuration of the adapters and rescale the parameter lora_alpha by α * lora_alpha (which is not an int anymore)
(no code) Rescale the adapter weights by $\sqrt{\alpha}$ if no bias is used, as $W_{delta} = W_A \cdot W_B$. We did that in KDD cup.
Add a new attribute to the LoraLayer layers to update the scaling parameter of the adapters
Add a new method in the LoraModel class that updates the scaling parameter of the LoRA adapters as follow:

for module in self.model.modules():
    if isinstance(module, LoraLayer):
        module.scaling = dict((k, v * alpha) for k, v in module.scaling.items())
       # might be better to redefined scaling = lora_alpha / r OR sqrt(r) * alpha ? How to access use_rslora ?

Add a new argument to the merge method of LoraLayer and to the merge_and_unload method of LoraModel to apply wise-ft when merging the adapters.

The advantage of methods 3. and 5. would be to properly apply wise-ft to DoRA models as well (similar to the formula (1)).

I believe wise-ft can be applied to many use cases where the user knows there might be a shift between the training and test distributions or a need to recover the zero-shot capabilities of the base model.

BenjaminBossan commented 1 month ago

Thanks for proposing this method. As you described, it could be useful to have and should not be very difficult to implement.

Regarding your proposal of how to implement these:

Changing the config after the PEFT model is initialized won't have any effect.
I'd prefer to leave the weights untouched.
For LoRA, we already have methods to set the scale. These are not really advertised, as they were added specifically for the diffusers integration. By themselves, these methods are not very user friendly, we would need something that does not require the user to loop through all layers.
Could build on top of 3.
Not really a fan of baking this into merging, because there could be reasons for not wanting to use this feature without merging. It should, however, be possible to use this feature and also to merge, since merging already takes the scaling into account.

To me, it's not clear why 3 and 5 work with DoRA but 4 wouldn't. Could you elaborate on that?

Regarding rslora: After initialization, the scaling will have the value according to rslora. Since alpha and rank are known, the new scaling could be calculated.

If I could come up with a proposal of my own: What do you think about a context manager:

with set_adapter_scale(peft_model, 0.5):
    output = peft_model(**inputs)

The reason why I like this is because the context manager would ensure that the scales are always restored to their original values upon leaving the context. Otherwise, I could imagine could getting messy real quick, making it hard for users to go back to the original values (short of completely reloading the model).

SimJeg commented 1 month ago

Hello @BenjaminBossan ,

Thanks for your quick answer.

About DoRA, $W{ft} = m{dora} (W{base} + W{delta})$ and $W_{wise} = (1-\alpha) W{base} + \alpha * W{ft} $ so $W{wise} = (1-\alpha + \alpha * m{dora}) W_{base} + \alpha m_{dora} W{delta} $ which is different from $W{wise} = W{base} + \alpha * m{dora} * W_{delta} $

So the trick I mentioned do not work. It might work but it's not wise-ft. With methods 3. and 5. you can manually we-write how you combine the orig_weights and new_weights. But this make the implementation more complex. One solution to keep it simple would be to send a warning to the user if use_dora=True to tell it's not doing exactly what they think and re-work the implementation if people find it useful for LoRA and open issues to make it work with DoRA too.

About your proposal, it's really nice, I also like the fact that it's temporary. I guess it would work if you want to get the merged model too ?

with set_adapter_scale(peft_model, 0.5):
    merged_model = peft_model.merge_and_unload()

I have interest in this because in the KDD cup, we had to merge the models to later quantize them.

Following your proposal, how would you advertise the method ?

BenjaminBossan commented 1 month ago

With methods 3. and 5. you can manually we-write how you combine the orig_weights and new_weights.

I see, thanks. Yes, let's try to avoid this if possible.

But this make the implementation more complex. One solution to keep it simple would be to send a warning to the user if use_dora=True to tell it's not doing exactly what they think and re-work the implementation if people find it useful for LoRA and open issues to make it work with DoRA too.

Honestly, I'd even be fine with raising an error and telling users that this method is not supported for DoRA. We could go with a warning if we can show that it still works reasonably well with DoRA (I haven't read the linked paper yet, so not sure if it was already tested).

I guess it would work if you want to get the merged model too ?

Yes, exactly.

I have interest in this because in the KDD cup, we had to merge the models to later quantize them.

Does that mean you'd be interested in contributing this to PEFT? ;-)

Following your proposal, how would you advertise the method ?

Do you mean how we let users know about this type of new feature? Generally, there should be documentation and examples for users to inform themselves and discover this. Moreover, we will mention the new method in the release notes and maybe tweet about this. Of course, you can also advertise it anyway you like.

SimJeg commented 1 month ago

I'm confident it will work reasonably well with DoRA too as for $\alpha=0$ and $\alpha=1$ it returns the right results. However I do not have any experimental data to prove it to you.

Does that mean you'd be interested in contributing this to PEFT? ;-)

Well the KDD cup is over haha, I should have said, "it would have been useful for KDD cup".

Do you mean how we let users know about this type of new feature?

yes exactly and what you propose is great :)

BenjaminBossan commented 1 month ago

Okay, too bad :-] Still I agree it would be useful to have. I'll put this item on the backlog then and also add a few labels to invite contributors to add this feature.

SimJeg commented 1 month ago

I might work on it but I don't have immediate bandwidth

Le mar. 23 juil. 2024, 11:42, Benjamin Bossan @.***> a écrit :

Okay, too bad :-] Still I agree it would be useful to have. I'll put this item on the backlog then and also add a few labels to invite contributors to add this feature.

— Reply to this email directly, view it on GitHub https://github.com/huggingface/peft/issues/1940#issuecomment-2244738909, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADE64VLYHBZYQO7K2272S2TZNYQQFAVCNFSM6AAAAABLHZGAXOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENBUG4ZTQOJQHE . You are receiving this because you authored the thread.Message ID: @.***>

ariG23498 commented 1 month ago

Hey folks! I would love to contribute this feature to the library. Unfortunately I am new to PEFT (both the library and the method). I have a basic intuition as to what is happening.

Following the conversation I think what needs to be done is to build a context manager which handles the scaling? How would you like me to approach this?

BenjaminBossan commented 1 month ago

Thanks a lot for offering to work on the feature. Besides implementing the scaling and context manager itself, I these TODOs come to mind:

Ensuring that the original scaling is always restored correctly
Performing checks on whether the model supports this feature to be able to error our as early as possible if it doesn't
Possibly supporting transformers and diffusers models with PEFT adapters, not only PEFT models (could be a separate PR though)
Documentation
Comprehensive testing (probably the hardest part to be honest)

If you feel this is something you can do (of course with my support), feel free to go ahead and create a PR. Doing a draft PR for early feedback is also an option.

BenjaminBossan commented 1 month ago

@SimJeg Aritra implemented the feature and it's been merged. Maybe you can give it a try by installing from source.

from peft.helpers import set_adapter_scale

with set_adapter_scale(model, 0.5):
    outputs = model(**inputs)

SimJeg commented 1 month ago

Hello,

Many thanks to @ariG23498 for working on this feature and @BenjaminBossan for reviewing it. I used the code at the end of this message for some sanity checks:

with set_adapter_scale(model, 0), the logits are the same than with the base model
with set_adapter_scale(model, 1), the logits are the same than with the merged model
in between, there is a high correlation between the rescaled logits and the ones of the base / merged model

It works... but not perfectly. I had to add a time.sleep to make it work. I believe the issue is that the context manager might not restore the original scaling values "fast enough" but I don't know what the cleaner way to solve the issue.

Beyond this bug, I have 2 suggestions.

Rename the context manager. To me set_adapter_scale(model, 0.5) would set the adapter scale to 0.5 while what it does is rescaling an the scaling factor by a factor 0.5. Maybe rescale_adapter_scale ?
Wise-ft does not appear in the docstring while it was the motivation for this feature. This technique is not widely known and it's not obvious that you just need to rescale the scaling factor to apply it. Here is a proposal (not sure it respects what is expected for formatting)

Code for sanity check ((remove time.sleep line to reproduce the bug):

"""Context manager to temporarily rescale the scaling of the LoRA adapter in a model.
The original scaling values are restored when the context manager exits. This context manager works with the
transformers and diffusers models that have directly loaded LoRA adapters.

For LoRA, applying this context manager with alpha in [0, 1] is strictly equivalent to applying [wise-ft](https://arxiv.org/abs/2109.01903) (see [#1940](https://github.com/huggingface/peft/issues/1940) for details).
It can improve the performances of the model if there is a distribution shift between the training data used 
for fine-tuning, and the test data used during inference."""

import torch
from time import sleep
import matplotlib.pyplot as plt
from transformers import AutoModelForCausalLM, AutoTokenizer

from peft import PeftModel
from peft.helpers import set_adapter_scale

device = "..."
model_id = "mistralai/Mistral-7B-v0.1"
peft_model_id = "typeof/zephyr-7b-beta-lora"
tokenizer_id = "HuggingFaceH4/zephyr-7b-beta"

# Load tokens
tokenizer = AutoTokenizer.from_pretrained(tokenizer_id)
messages = [{"role": "user", "content": "What's your name ?"}]
tokens = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to(
    device
)

outputs = {}

# Model without adapter
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto").to(device)
with torch.no_grad():
    outputs["base"] = model(tokens).logits[0, -1].float().cpu()

# Model with adapter
model = PeftModel.from_pretrained(model, peft_model_id)
with torch.no_grad():
    outputs["merged"] = model(tokens).logits[0, -1].float().cpu()

# Model with re-scaled adapter
plt.figure(figsize=(20, 5))
n = 6
for i, scale in enumerate(range(n)):
    s = scale / (n - 1)
    sleep(1)
    with set_adapter_scale(model, s), torch.no_grad():
        outputs[f"scale {s}"] = model(tokens).logits[0, -1].float().cpu()
        plt.subplot(2, n, i + 1)
        plt.scatter(outputs["base"], outputs[f"scale {s}"], s=1)
        plt.title(f"Scale {s}")
        if s == 0.0:
            plt.ylabel("base logits")
        plt.subplot(2, n, n + i + 1)
        plt.scatter(outputs["merged"], outputs[f"scale {s}"], s=1)
        if s == 0.0:
            plt.ylabel("merged logits")

assert torch.allclose(outputs["base"], outputs["scale 0.0"])
assert torch.allclose(outputs["merged"], outputs["scale 1.0"])
plt.show()

result

BenjaminBossan commented 1 month ago

I had to add a time.sleep to make it work. I believe the issue is that the context manager might not restore the original scaling values "fast enough" but I don't know what the cleaner way to solve the issue.

Could you please elaborate on that? I tried your script and got the same results with and without sleep(1). What error do you get, what device are you using?

Otherwise, the suggestions look reasonable to me, @ariG23498 what do you think?

ariG23498 commented 1 month ago

I like the renaming and docstring suggestion. I can open a PR for the same.

SimJeg commented 1 month ago

@BenjaminBossan I'm running it on a macbook with device="mps". The issue I get is simply:

assert torch.allclose(outputs["merged"], outputs["scale 1.0"])
AssertionError

When I get this error, the plot shows it's not a small error (e.g. adding atol=1e-4 would solve it).

BenjaminBossan commented 1 month ago

I like the renaming and docstring suggestion. I can open a PR for the same.

That would be great, thanks.

I'm running it on a macbook with device="mps". The issue I get is simply:

Wait, and that goes away when you add sleep? Or is that a separate issue.

SimJeg commented 1 month ago

Here is the plot I get without sleep (moving plt.show before). Yes with sleep the error goes away.

result_nosleep

BenjaminBossan commented 1 month ago

Hmm, this is super strange. I double checked, my graphs look identical with and without sleep (tested both CUDA and CPU). Maybe this is an MPS-specific issue? Perhaps we should add a warning to the docstring then.

SimJeg commented 1 month ago

I just tried with CPU and I don't get the error so it's very likely related to the mps device. I tried to add torch.mps.synchronize in different places of the code, or to use torch.no_grad() on a different line but it did not help.

BenjaminBossan commented 1 month ago

Interesting. If anyone else could try on their machine, so that we can collect more data on this issue, it would be great.

Anyway, for now I guess the best we can do is to document this, trying to find the root cause will be very hard. I also wonder if more PEFT code is affected by this, as the context manager isn't doing anything fancy that isn't also happening in other parts of PEFT.

SimJeg commented 1 month ago

@BenjaminBossan do you know when will be the next release of PEFT ? I'd like to present this feature during the KDD conference on 28th August

SimJeg commented 1 month ago

@BenjaminBossan The method add_weighted_adapter here does something very similar. However it seems it works only for a PeftModel so a model where you already added an adapter and want to add another one (maybe I'm wrong). The models from transformers do have a add_adapter method but no add_weighted_adapter method. Should I open another issue in the transformers repo ? What do you recommend ?

BenjaminBossan commented 1 month ago

do you know when will be the next release of PEFT ? I'd like to present this feature during the KDD conference on 28th August

There is no concrete plan yet but there is a good chance that there will be something before August 28th. But you could always install PEFT directly from source (you can pin the hash for reproducibility).

The method add_weighted_adapter here does something very similar. However it seems it works only for a PeftModel so a model where you already added an adapter and want to add another one (maybe I'm wrong).

This method is really for merging different LoRA adapters together into a new one. Yes, technically I could just pass a single adapter and choose a weight != 1 to scale that adapter, but that's certainly not the intended use case. Moreover, this will create a new copy, which requires more memory. Therefore, I consider this method orthogonal.

The models from transformers do have a add_adapter method but no add_weighted_adapter method. Should I open another issue in the transformers repo ? What do you recommend ?

The idea for the transformers integration of PEFT was never to have full feature parity, but rather to provide only the most commonly used methods. If you want to make use of the full breadth of features, it is recommended to use PEFT directly. It needs to be installed anyway for the transformers integration. Also, the context manager added by Aritra should also work with transformers models.

github-actions[bot] commented 6 days ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

huggingface / peft