Open EngEmmanuel opened 2 weeks ago
Update: I've found some discussions on a similar issue
Indeed, for MultiheadAttention
, we have to jump through some hoops to make it work. Hopefully that PR can be merged soon, but there might still be some edge cases we haven't accounted for. If you can give that branch a try and report back if it worked for you, that would help us determine if the branch works correctly.
Thanks for your reply and your work on this library. Below are the results I am seeing for different inputs to the "target_modules" arg :
main branch:
"out_proj": the Runtime Error in the original problem
len([m for m in peft_model.modules() if isinstance(m, PeftMha)]))
equates to 0
"attn": ValueError: Target module MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True) ) is not supported. Currently, only the following modules are supported: `torch.nn.Linear`, `torch.nn.Embedding`, `torch.nn.Conv2d`, `transformers.pytorch_utils.Conv1D`.
your branch:
"out_proj": the Runtime Error in the original problem
len([m for m in peft_model.modules() if isinstance(m, PeftMha)]))
equates to 12
"attn": Trains! (Thank you)
Q1)So I understand the consequences for my purpose, could I please clarify a few things? Looking at my model print in the original issue, am I right in thinking that "outproj" corresponds to the W{o} matrix, and "attn" corresponds to W{q,k,v,o}. If so, the above result would mean I basically have to lora all the attention weight matrices instead of having the option to select just the query matrix (W{q}) for example?
Q2) What's the main difference between inject_adapter_in_model(...)
and get_peft_model(...)
?
I just want a model that has extra lora layers added and that I can then immediately start training. Is there a perk to using either for my purpose?
To your questions:
inject_adapter_in_model
is a low level API that adds the PEFT layers to the model but leaves the model as is otherwise. For get_peft_model
, you get a PeftModel
instance back that wraps the original model. This PeftModel
has a bunch of convenience methods you will most likely want to use at a later point, like merging the layers. If you're sure you don't need those, you can use inject_adapter_in_model
instead but I would recommend get_peft_model
.
System Info
peft = 0.13.2 python = 3.12.7 transformers = 4.45.2
Who can help?
@sayakpaul
I am using
inject_adapter_model(...)
to finetune a model from OpenCLIP using LoRA layers. I am able to finetune the model by modifyingLinear()
layers and other supported types as expected. However, there is a model that I am currently training that has an attention module called "out_proj" that has the following layer typeNonDynamicallyQuantizableLinear(Linear)
. I may be mistaken but from my understanding of the source code forNonDynamicallyQuantizableLinear
(https://github.com/pytorch/pytorch/blob/main/torch/nn/modules/linear.py#L136), I should be able to treat it as just a typicaltorch.nn.Linear
layer for my purposes. However, I always get the following error: "RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn". The lora layers are also added as expected.The layers are successfully added when I add it via target_modules and also if I use
register_custom_modules
with the following mappingtorch.nn.modules.linear.NonDynamicallyQuantizableLinear
->peft.tuners.lora.layer.Linear
. However, neither case trains. Furthermore, the model trains when I include any other layers e.g. a fully-connected one that's of typetorch.nn.Linear
. target_modules =Any idea why this may be the case? Your help would be truly appreciated
Information
Tasks
examples
folderReproduction
Train step:
Model structure near a layer of interest: ('transformer.resblocks.11.attn', <class 'torch.nn.modules.activation.MultiheadAttention'>) ('transformer.resblocks.11.attn.out_proj', <class 'torch.nn.modules.linear.NonDynamicallyQuantizableLinear'>) ('transformer.resblocks.11.ls_1', <class 'torch.nn.modules.linear.Identity'>)
Injection code:
Expected behavior
I would expect it to begin training. Here are the first few print outs of atypical run