huggingface / peft

๐Ÿค— PEFT: State-of-the-art Parameter-Efficient Fine-Tuning.
https://huggingface.co/docs/peft
Apache License 2.0
15.84k stars 1.53k forks source link

fine-tuning OpenClip with Hugingface's PEFT (such as LoRA) #761

Open KyanChen opened 1 year ago

KyanChen commented 1 year ago

Feature request

fine-tuning OpenClip with Hugingface's PEFT (such as LoRA)

Motivation

fine-tuning OpenClip with Hugingface's PEFT (such as LoRA)

Your contribution

refer to https://github.com/KyanChen/MakeMultiHeadNaive/tree/master for help!

BenjaminBossan commented 1 year ago

Sorry, could you please provide more details? Are you looking for help how to achieve that or are you suggesting that it doesn't work right now?

KyanChen commented 1 year ago

Now, Hugingface's PEFT (such as LoRA) can not finetune the linear layer of torch.nn.MultiHeadAttention based transformer model (such as OpenCLIP). If I must use the LoRA, I should replace the torch.nn.MultiHeadAttention layer with a self-implemented naive MultiHeadAttention layer. Can you help to integrate it to the official PEFT lib?

BenjaminBossan commented 1 year ago

I see, thanks for explaining. Indeed, right now, it is impossible as a user to change what type of LoRA layer is being used. We have ideas about exposing a "low level" API that would allow users more fine-grained control, including the possibility to allow using custom layers, as you suggest. I cannot say yet if it will really work out and when it's ready, but I'll let you know.

duchenzhuang commented 1 year ago

Thanks for your efforts!

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

ambroser53 commented 8 months ago

I'd like to bump this, being unable to put LoRA weights on anything that uses nn.MultiheadAttention is a real pain and using a naive implementation is clunky and cumbersome. Seems strange that LoRA-Torch can do it but not peft.

BenjaminBossan commented 8 months ago

Hey, I created a PR to add MHA: #1324. The implementation was a bit tricky because this layer is not very "friendly" for LoRA-adaptation, but I think I got it working.

For now, this is just a rough draft, so it would be great if you could test it and tell me if it works your use case. To install from this branch, run:

python -m pip install git+https://github.com/BenjaminBossan/peft.git@feat-add-lora-multihead-attention

So far, I did the following testing:

import torch
from torch import nn
import open_clip
from peft import LoraConfig, get_peft_model
from PIL import Image
import requests

model, preprocess = open_clip.create_model_from_pretrained('hf-hub:laion/CLIP-ViT-g-14-laion2B-s12B-b42K')
tokenizer = open_clip.get_tokenizer('hf-hub:laion/CLIP-ViT-g-14-laion2B-s12B-b42K')
peft_model = get_peft_model(model, config)
opt = torch.optim.SGD(peft_model.parameters(), 0.1)

# text encoder
text = tokenizer(["a diagram", "a dog", "a cat"])
text_features = peft_model.encode_text(text)
loss = text_features.sum()
loss.backward()
opt.step()

# image encoder
url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)
image = preprocess(image).unsqueeze(0)
image_features = model.encode_image(image)
image_features.sum().backward()
opt.step()
BenjaminBossan commented 8 months ago

@ambroser53 I think the linked LoRA-torch library has some bugs. For instance:

import torch, loratorch
import torch.nn as nn

model_torch = loratorch.Linear(5, 6, r=4, lora_alpha=1)
loratorch.mark_only_lora_as_trainable(model_torch)
print(model_torch.state_dict().keys())
# prints odict_keys(['weight', 'bias', 'w_lora_A', 'w_lora_B'])

optimizer_torch = torch.optim.SGD(model_torch.parameters(), lr=0.1)

for _ in range(3):
    model_torch.train()
    x = torch.rand(2, 5)

    loss2 = model_torch(x).sum()
    optimizer_torch.zero_grad()
    loss2.backward()
    optimizer_torch.step()

print(model_torch.state_dict().keys())
# odict_keys(['bias', 'w_lora_A', 'w_lora_B'])
# note the missing 'weight' key!

As you can see, the weight is dropped from the state_dict, making it impossible to save the model. Same is true for named_parameters(). So if you're using this package, you should be aware of this.

ambroser53 commented 8 months ago

Hey @BenjaminBossan cheers for the fork I'll run some tests on Tuesday. I realised that LoRATorch was a bit buggy after I started trying to combine it with peft's LoraLayer but if there's a way to do it without it that'd be much better.

BenjaminBossan commented 8 months ago

@ambroser53 Did you have time to give it a try?

ambroser53 commented 8 months ago

Hi sorry I meant to get back to you sooner. It appears the layers are placed on the nn.MultiheadAttention blocks just fine on my model. My use case is very complicated though as its a custom architecture so I will need to get back to you on how effective it is and whether the openclip finetuning is bottlenecked or non-performative in some way. Once I have these answers I'll report back.

BenjaminBossan commented 8 months ago

Great, thanks for testing. Do you have an ETA for when these tests finish?

Regarding performance, I would expect a larger overhead than for simple LoRA layers like Linear because of the merging-unmerging roundtrip we have to take, but I'm not sure if it makes a difference grand scheme of things.

ambroser53 commented 8 months ago

Should get initial results early next week if theres no disasters.

Out of curiousity is said overheard computational or memory?

BenjaminBossan commented 8 months ago

Should get initial results early next week if theres no disasters.

Thanks!

Out of curiousity is said overheard computational or memory?

It should be computational only. However, since we take the same approach here as LoRA-torch, it shouldn't be better or worse than using that.

ambroser53 commented 8 months ago

I've dug deeper in my testing. Mine is a very specific case where LoRA weights are only placed on specific layers and the model is mixed quantisation so the placement needed further tinkering. However, now that I've specifically made sure which layers are getting where they need to there's a logic error that seems to only occur some of the time. Essentially, say you have nn.MultiheadAttention called attn then it will have the submodule attn.out_proj which is a nn.Linear (or at least it should be, there's this weird NonDynamicQuantisableWhatever going on but lets not get into that). If you have target_modules on your LoraConfig that point to both attn and attn.out_proj then if attn gets turned into a LoraLayer first then when it tries to find attn.out_proj it's now under attn.base_layer.out_proj.

It doesn't look like the out_proj is taken into account by the merge and unmerge which seems to be more to do with the in_proj_weight. In the implementation of nn.MultiheadAttention it doesn't actually use the forward of said out_proj and only passes the weight and bias tensors. I thought this could be fixed just by forcing it to put the LoraLayer on attn.out_proj before attn but I think this would create problems due to the way nn.MultiheadAttention never calls forward which would then neglect the lora weights entirely.

Could there be a simple fix to just do the same as there is on in_proj_weight for out_proj.weight?

BenjaminBossan commented 8 months ago

Thanks a lot @ambroser53, your analysis is 100% correct. I pushed a new commit to the PR that now takes into account out_proj.

As is, we now apply LoRA to both in_proj and out_proj. There is currently no way to specify only in_proj or only out_proj. That wouldn't be easy to achieve, we would probably have to implement a new argument (or even multiple) on LoraConfig to allow that, which seems a bit overkill for this rather niche feature. My reasoning for applying LoRA to both instead of only in_proj is that recently the consensus seems to converge towards applying LoRA to as many Linear layers as possible. LMK what you think.

I'll be out of office starting next week, so that PR may stall for a while unless one of the other maintainers has time to take over. Still, please try out this new PR and give us feedback if it works for you.

ambroser53 commented 8 months ago

No that sounds perfect I don't think having one or the other would make sense. I should be able to give it a go now and give results next week.

BenjaminBossan commented 8 months ago

I should be able to give it a go now and give results next week.

Nice. If you can give some early feedback today, I may still have time to react to it :)

ambroser53 commented 8 months ago

This may be a problem with my own complex set up so could be out of scope here but does peft automatically cast parameters to int8 if the underlying model is loaded in int8? Asking since part of the model is in int8 but the rest is skipped via int8_quant_skip_modules this is because now with out_proj implemented it's throwing an error when calling get_peft_model within _restore_weights for lora.MultiheadAttention because registering the out_proj as "weight" seems to have it cast as int8 when it's supposed to have been skipped and left as float16. Have any insights or will mixed quantisation be something wholly unwieldy I'm unlikely to find a quick fix for?

BenjaminBossan commented 8 months ago

Hmm, normally the weights should not be automatically cast to int8. If you have some way to reproduce this error, I could investigate.

Looking at this issue in general, I think, however, that this implementation will not work correctly with quantized weights. As is, we merge the LoRA weights into the base weights. When the latter are quantized, this requires special treatment, similar to the bnb layers we have for LoRA, a normal merge would surely fail. So I think we would need a completely separate MHA class for quantized layers.

I'm not exactly sure what it is that you're doing with quantization, but as you've remarked earlier, the out_proj actually uses NonDynamicallyQuantizableLinear, which from my understanding exists to prevent some kind of error with quantization. I wonder if that could be related.

ambroser53 commented 8 months ago

I understand that but the point is that the MHA aren't quantised at all. The confusing part is that the MHA and out_proj nn.Linear are being passed to int8_quant_skip_modules. It should be okay for now I'll train on two cards since it can't all fit on one. Hopefully have some results soon.

BenjaminBossan commented 8 months ago

I understand that but the point is that the MHA aren't quantised at all.

Ah I see, that is indeed very strange and should not happen.

The confusing part is that the MHA and out_proj nn.Linear are being passed to int8_quant_skip_modules

Can you point me to a reference for int8_quant_skip_modules?

ambroser53 commented 8 months ago

Here's the code for bitsandbytesconfig configuration object where you can specify int8_quant_skip_modules but there's no further documentation than what is in the initialisation comment. It does seem to be working as prior to calling get_peft_config the correct modules are in the correct datatype.

I'll try and get together a code sample that reproduces (this code I'm referring to right now is a proprietary for a company)

ambroser53 commented 8 months ago

One more potential bug. It seems that when using get_peft_model on a large model with an MHA inside, it puts the internal parameters (i.e. in_proj_weight and out_proj.weight) in the MHA as requires_grad=True. Its actually really hard to force it it to not be true and I don't quite know why. I wonder whether its because of the nested LoraLayers or something missing in terms of ensuring the base weights dont require gradients that is present in other LoraLayers

ambroser53 commented 8 months ago

It is very bizarre. The following code is from my script. attn_pool.attn is the (only) MHA:

model.base_model.model.model.vision_model.attn_pool.attn.base_layer.in_proj_weight.requires_grad = False
model.base_model.model.model.vision_model.attn_pool.attn.base_layer.out_proj.base_layer.weight.requires_grad = False

trainable_params = [name for name, param in model.named_parameters() if param.requires_grad]

print(model.base_model.model.model.vision_model.attn_pool.attn.base_layer.in_proj_weight.requires_grad)

This outputs true and both the in_proj_weight and out_proj.weight will be in trainable_params. It's almost like iterating through the module names causes the to be made trainable. This doesn't happen with any other parameters in the wrapped model only these two that reside in the MHA.

ambroser53 commented 8 months ago

This repo is a self contained case that reproduces the error when using the MHA peft branch

This takes priority over the int8 stuff.

BenjaminBossan commented 7 months ago

Hi @ambroser53 I'm back in office. Thanks a lot for figuring out this bug and providing a reproducer. I could identify the issue and it should now be fixed. When running your example locally, I now get the correct gradients. Please take a look.

It's almost like iterating through the module names causes the to be made trainable.

This was indeed the case! The reason for this is explained here:

https://github.com/huggingface/peft/pull/1324/files#diff-24a141c266b7b714ae8fcc470f31bc283f7b0f5a671bbf6d5f092741fc374104R899-R903

Here's the code for bitsandbytesconfig configuration object

Sorry, did you mean to include a link here?

ambroser53 commented 7 months ago

Thank you for this. If I get a chance I'll test it in my use case. Problem is it's been a while so we started using a different approach so not sure how fast I'll be. Do you think this hacky way will have any additional speed inefficiencies?

The link I meant to put was this one here: https://github.com/huggingface/transformers/blob/v4.37.2/src/transformers/utils/quantization_config.py#L150

BenjaminBossan commented 7 months ago

If I get a chance I'll test it in my use case. Problem is it's been a while so we started using a different approach so not sure how fast I'll be.

Whenever you have some time, or some code you could share for us to test, it would be great.

Do you think this hacky way will have any additional speed inefficiencies?

Not 100% sure. I'd say probably not because _restore_weights is only called when named_modules, modules, or state_dict is called, so not during regular training or inference. But the top priority is to get this running at all (as long as it's not super slow), performance could be improved later if necessary.

The link I meant to put was this one here:

I see. It appears the option was renamed to llm_int8_skip_modules, which is why I couldn't find it. I skimmed the bnb code base and from what I can tell, the full names are matched (so not like target_modules in PEFT), so maybe that's the reason why it didn't work? Otherwise, I'd need code to reproduce the issue to investigate further.

github-actions[bot] commented 6 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

BenjaminBossan commented 6 months ago

not stale

PoopBear1 commented 5 months ago

Thank you for this amazing contribution!
I am currently research on this topic, can I use the multihead lora on the original CLIP as well?

BenjaminBossan commented 5 months ago

I am currently research on this topic, can I use the multihead lora on the original CLIP as well?

Do you mean this model? I just checked it and it doesn't use the MultiHeadAttention layer, so the discussion here does not apply to that model. However, it does use normal Linear layers, so using LoRA from PEFT should work.

PoopBear1 commented 5 months ago

Thank you so much for prompt reply.

Here I am referring to the VIT-B/16 based CLIP (ViT-B/16). It indeed utilizes MultiHeadAttention, and its qvk are parameters instead of being part of a linear layer. However, its output projection is of the NonDynamicallyQuantizableLinear type, which cannot be applied to LoRA either. So, I am curious if this issue can handle this situation as well.

Appreciated!

I am currently research on this topic, can I use the multihead lora on the original CLIP as well?

Do you mean this model? I just checked it and it doesn't use the MultiHeadAttention layer, so the discussion here does not apply to that model. However, it does use normal Linear layers, so using LoRA from PEFT should work.

BenjaminBossan commented 5 months ago

I haven't checked that model, but if it uses the standard nn.MultiHeadAttention from PyTorch, then it would require the PR #1324 to run. If you want, you could give it a try and report back the results.

github-actions[bot] commented 4 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

sailfish009 commented 4 months ago

It's still buggy. I think we should keep it open until it's fixed. I'm reproducing the issue in at least 2 pieces of code that I can test.

  1. https://github.com/xirui24/finetune-clip-with-fashion-dataset

  2. https://www.kaggle.com/code/zacchaeus/clip-finetune

BenjaminBossan commented 4 months ago

@sailfish009 Could you please provide more details? What is buggy exactly, what error did you get, what PEFT branch did you try?

sailfish009 commented 4 months ago

@BenjaminBossan I tested this by applying the code in the link below. https://velog.io/@blackeyes0u0/LoRA%EB%A5%BC-%ED%86%B5%ED%95%9C-PEFT

A runtime error occurs: (has no attribute 'weight')

This code works fine for similar clip models like the one below. https://dienhoa.github.io/dhblog/posts/finetune_clip.html

If you have working sample code(based on official one), I'd be happy to test it.

## lora.py
class LoRA_Config:
    def __init__(self, r, lora_alpha, lora_dropout, merge_weights, target_modules):
        self.r = r
        self.lora_alpha = lora_alpha
        self.lora_dropout = lora_dropout
        self.merge_weights = merge_weights
        self.target_modules = target_modules

class LoRALayer(nn.Module):
    def __init__(self, original_layer, config: LoRA_Config):
        super(LoRALayer, self).__init__()
        self.original_layer = original_layer
        input_dim = original_layer.weight.size(1)
        output_dim = original_layer.weight.size(0)

        # Initialize and then apply kaiming_uniform_
        lora_A_tensor = torch.empty(input_dim, config.r)
        torch.nn.init.kaiming_uniform_(lora_A_tensor)
        self.lora_A = nn.Parameter(lora_A_tensor)

        self.lora_B = nn.Parameter(torch.zeros(config.r, output_dim))
        self.scaling = config.lora_alpha/config.r
        if config.lora_dropout > 0:
            self.dropout = nn.Dropout(p=config.lora_dropout)
        else:
            self.dropout = lambda x: x  # No-op

    def forward(self, x):
        # Apply dropout before the matrix multiplication
        A_dropout = self.dropout(self.lora_A)
        B_dropout = self.dropout(self.lora_B)
        W_prime = self.original_layer.weight + self.scaling*A_dropout @ B_dropout
        return F.linear(x, W_prime, self.original_layer.bias)

        # ์‹คํŒจํ•œ ์ฝ”๋“œ์ด๋‹ค.
    # def forward(self,x):
    #     delta_W = self.dropout(self.lora_B(self.lora_A(x)))
    #     W = self.original_layer(x)
    #     return self.scaling*delta_W + W

# ์—๋Ÿฌ๋œฌ ์ด์œ 
# self.lora_A์™€ self.lora_B๋Š” nn.Parameter ๊ฐ์ฒด๋กœ ์ •์˜๋˜์–ด ์žˆ์œผ๋ฉฐ,
# ์ด๋“ค์€ ์ง์ ‘์ ์œผ๋กœ ํ˜ธ์ถœ ๊ฐ€๋Šฅํ•œ ๊ฐ์ฒด๊ฐ€ ์•„๋‹™๋‹ˆ๋‹ค. ๋Œ€์‹  ์ด ํŒŒ๋ผ๋ฏธํ„ฐ๋“ค์„ ํ–‰๋ ฌ๊ณฑ (matrix multiplication) ์—ฐ์‚ฐ์— ์‚ฌ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. PyTorch์—์„œ๋Š” torch.matmul ํ•จ์ˆ˜ ๋˜๋Š” @ ์—ฐ์‚ฐ์ž๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ–‰๋ ฌ๊ณฑ์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ฒ˜์Œ init์—์„œ ์ž˜ ์„ค์ •ํ•ด์•ผ๋จ

    def __repr__(self):
        return f'{self.__class__.__name__}(\n  (original_layer): {self.original_layer},\n  (lora_A): Parameter of size {self.lora_A.size()},\n  (lora_B): Parameter of size {self.lora_B.size()}\n)'

def print_trainable_parameters(model):
    trainable_params = 0
    all_param = 0
    #for param in model.parameters():
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad: # True์ด๋ฉด learnable parameter์— ์ถ”๊ฐ€
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable: {100 * trainable_params / all_param:.2f} %"
    )
    return trainable_params, all_param

# ์ถ”๊ฐ€๋กœ ๋ ˆ์ด๋Ÿฌ ๋งˆ๋‹ค ์ ์šฉํ•ดํ•˜๋Š” ํ•จ์ˆ˜
def apply_lora_to_model(model, config):
    for name, module in model.named_modules():
        hierarchy = name.split('.')
        if len(hierarchy) > 1:  # Ensure the module is not the top-level module
            parent_module = model
            for submodule_name in hierarchy[:-1]:  # Navigate to the parent module
                parent_module = getattr(parent_module, submodule_name)

            layer_name = hierarchy[-1]
            for target_module in config.target_modules:
                if target_module in layer_name:
                    original_layer = getattr(parent_module, layer_name)
                    if isinstance(original_layer, nn.Linear):
                        setattr(parent_module, layer_name, LoRALayer(original_layer, config))
                        print(f"Replaced {name} with LoRALayer")
    return model

# Apply LoRA modifications to the model
model = apply_lora_to_model(model, lora_config)

# ์ถ”๊ฐ€๋กœ ๋กœ๋ผ ๋ ˆ์ด์–ด๋งŒ ํ™œ์„ฑํ™”์‹œํ‚ค๋Š” ํ•จ์ˆ˜
def mark_only_lora_as_trainable(model: nn.Module, bias: str = 'none') -> None:
    for n, p in model.named_parameters():
        if 'lora_' not in n:
            p.requires_grad = False
    if bias == 'none':
        return
    elif bias == 'all':
        for n, p in model.named_parameters():
            if 'bias' in n:
                p.requires_grad = True
    elif bias == 'lora_only':
        for m in model.modules():
            if isinstance(m, LoRALayer) and \
                hasattr(m, 'bias') and \
                m.bias is not None:
                    m.bias.requires_grad = True
    else:
        raise NotImplementedError

def print_trainable_parameters(model):
    trainable_params = 0
    all_param = 0
    #for param in model.parameters():
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad: # True์ด๋ฉด learnable parameter์— ์ถ”๊ฐ€
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param:.2f}"
    )
    return trainable_params, all_param
# from peft import get_peft_model#, LoraConfig, TaskType
from transformers import CLIPModel,AutoProcessor, AutoTokenizer
from lora import LoRA_Config,LoRALayer

model_name ="openai/clip-vit-base-patch32"
tokenizer = AutoTokenizer.from_pretrained(model_name)
preprocessor = AutoProcessor.from_pretrained(model_name)
model = CLIPModel.from_pretrained(
    model_name,
    # load_in_8bit=True,
    # device_map='auto',
)

# Check the original number of parameters
origin_num = sum(p.numel() for p in model.parameters())
print("Original number of parameters:", origin_num)

# Configuration for LoRA
lora_config = LoRA_Config(
    r=16, 
    lora_alpha=64, 
    lora_dropout=0.03, 
    merge_weights=False, 
    target_modules=["q_proj", "v_proj", "k_proj", "out_proj"],
)

# Apply LoRA to the model
model = apply_lora_to_model(model, lora_config)
mark_only_lora_as_trainable(model,bias='lora_only')
_,basic_model_params_num =  print_trainable_parameters(model)

### output์ด ์•„๋ž˜์ฒ˜๋Ÿผ ๋‚˜์˜ค๋ฉด ์„ฑ๊ณต!
trainable params: 1328640 || all params: 157005953 || trainable: 1.42 %
BenjaminBossan commented 4 months ago

@sailfish009 Sorry, I can't read that blogpost, but it seems to be using some custom code based on (some rather old) PEFT code. In general, if you want to apply LoRA to OpenCLIP, you have to use PEFT based on the PR #1324. This is because OpenCLIP uses nn.MultiHeadAttention, which we don't support in PEFT but which is added in said PR.

sailfish009 commented 4 months ago

@BenjaminBossan Thank you. I checked with the branch you provided, and it's working fine. There was one error, but it was solved by following the link below. https://github.com/TimDettmers/bitsandbytes/issues/240

miguelalba96 commented 4 months ago

I use the LAION huggingface CLIP checkpoints instead of the ones from open clip, then you will be able to use the PEFT package without any effort

https://huggingface.co/collections/laion/openclip-laion-2b-64fcade42d20ced4e9389b30

BenjaminBossan commented 4 months ago

I checked with the branch you provided, and it's working fine.

Thanks for testing.

mm-tpx commented 2 weeks ago

I am not sure why this issue is marked as closed cause from my understanding PR to add multihead is not merged to main :(

mm-tpx commented 1 week ago

Are there any plans to merge https://github.com/huggingface/peft/pull/1324 in the nearest future?

BenjaminBossan commented 1 week ago

Thanks for the reminder @mm-tpx.

Just to explain: The PR is not merged yet as the solution is kind of hacky (due to how MHA is implemented in torch) and people who tested it have reported a few issues, although I tried my best to address them over time, as witnessed in the discussion. The more confirmation I get that people used this implementation successfully, the higher the confidence that the "hack" works and the PR can be merged.

So if you gave that PR a try and it worked for you, please let me know about it. If not, it would be great if you could test it out and give me feedback.