MobileViT does not work with Inference with different LoRA adapters in the same batch

saeid93 commented 3 months ago

System Info

Python 3.11.9 transformers==4.40.2 peft==0.11.2

Who can help?

@BenjaminBossan

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[X] An officially supported task in the examples folder
[ ] My own task or dataset (give details below)

Reproduction

MobileVit model is not compatible with using multiple adapters in the same batch. Inferencing batches using multiple adapters using the adapter_names in the batch will trigger the following exception:

https://github.com/huggingface/peft/blob/273acf059e0f1f8bff1a3889f901475e9eb3b7ee/src/peft/tuners/lora/layer.py#L308

The root cause is that during the unfolding operation in the transformers library MobileVit the first dimension of the input is changed from batch_size, ... is changed to batch_size * patch_size**2, ... which makes it inconsistent with the adapter_names dimensions which is of length of batch_size and each entry refers to each of the batch items' adapter.

Expected behavior

I solved this by a hack that modifies the adapter_names input size before sending it to the model and reverting it back to the original size for the classifier. It makes the entries proportional to the size made during the unfolding operation.

Also, we already discussed that there is a bug https://github.com/huggingface/peft/issues/1960 other than this MobileViT specific problem. Below script is the modifications needed both for https://github.com/huggingface/peft/issues/1960 and the mentioned problem together.

However, this is just a hack and I think this should work out of the box. I'm happy to investigate further when I get a chance to first solve https://github.com/huggingface/peft/issues/1960 .

# -------- changing the size of the adapter_names input ----------
        if model.base_model.model.base_model_prefix == "mobilevit":
            patch_size = model.config.patch_size
            multiply = patch_size ** 2
            resized_adapters_names = []
            for item in batch["adapter_names"]:
                multiplied = [item] * multiply
                resized_adapters_names += multiplied
            batch["adapter_names"] = resized_adapters_names
        outputs = model(**batch)

# -------- rest of the code ----------

"""
added this to solve https://github.com/huggingface/peft/issues/1960
"""

from typing import Any, Optional, Union
import torch
from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
from peft.peft_model import PeftModel
from transformers.modeling_outputs import ImageClassifierOutput, ImageClassifierOutputWithNoAttention
from transformers import ViTForImageClassification, MobileViTForImageClassification
from functools import partial

class PeftModelFixed(PeftModel):
    def forward(self, *args: Any, **kwargs: Any):
        """
        Forward pass of the model.
        """
        with self._enable_peft_forward_hooks(*args, **kwargs):
            # TODO removed this to avoid mixing
            # kwargs = {k: v for k, v in kwargs.items() if k not in self.special_peft_forward_args}
            return self.get_base_model()(*args, **kwargs)

class MobileViTForImageClassificationFixed(MobileViTForImageClassification):
    def forward(
        self,
        pixel_values: Optional[torch.Tensor] = None,
        output_hidden_states: Optional[bool] = None,
        labels: Optional[torch.Tensor] = None,
        return_dict: Optional[bool] = None,
        **kwargs # TODO added kwargs
    ) -> Union[tuple, ImageClassifierOutputWithNoAttention]:
        r"""
        labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
            Labels for computing the image classification/regression loss. Indices should be in `[0, ...,
            config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss). If
            `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
        """
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        # TODO here
        outputs = self.mobilevit(pixel_values, output_hidden_states=output_hidden_states, return_dict=return_dict)

        pooled_output = outputs.pooler_output if return_dict else outputs[1]

        # TODO here
        adapter_names = kwargs["adapter_names"]
        patch_size = self.config.patch_size
        multiply = patch_size ** 2
        adapter_names_original = []
        for i in range(0, len(adapter_names), multiply):
            adapter_names_original.append(adapter_names[i])
        logits = self.classifier(self.dropout(pooled_output), adapter_names=adapter_names_original)

        loss = None
        if labels is not None:
            if self.config.problem_type is None:
                if self.num_labels == 1:
                    self.config.problem_type = "regression"
                elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
                    self.config.problem_type = "single_label_classification"
                else:
                    self.config.problem_type = "multi_label_classification"

            if self.config.problem_type == "regression":
                loss_fct = MSELoss()
                if self.num_labels == 1:
                    loss = loss_fct(logits.squeeze(), labels.squeeze())
                else:
                    loss = loss_fct(logits, labels)
            elif self.config.problem_type == "single_label_classification":
                loss_fct = CrossEntropyLoss()
                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
            elif self.config.problem_type == "multi_label_classification":
                loss_fct = BCEWithLogitsLoss()
                loss = loss_fct(logits, labels)

        if not return_dict:
            output = (logits,) + outputs[2:]
            return ((loss,) + output) if loss is not None else output

        return ImageClassifierOutputWithNoAttention(
            loss=loss,
            logits=logits,
            hidden_states=outputs.hidden_states,
        )

def peftforward(self, *args, **kwargs):
    if self.disable_adapters or (self.active_adapter not in self.modules_to_save):
        return self.original_module(*args, **kwargs)

    # TODO changed to support LoRA
    adapter_names = kwargs["adapter_names"]
    kwargs = {}
    batch = args[0]
    unique_adapters = set(adapter_names)
    sub_batch_indices_list = []
    for adapter in unique_adapters:
        sub_batch_indices_list.append([index for index, item in enumerate(adapter_names) if item == adapter])

    results = [0 for i in range(len(batch))]
    for i, active_adapter in enumerate(unique_adapters):
        sub_batch = batch[sub_batch_indices_list[i]]
        output = self.modules_to_save[active_adapter](*(sub_batch,), **kwargs)
        for index, j in enumerate(sub_batch_indices_list[i]):
            results[j] = output[index]
    return torch.stack(results)

def change_forward_dynamically(model: PeftModel):
    model.classifier.forward = partial(peftforward, model.classifier)
    return model
        if not return_dict:
            output = (logits,) + outputs[1:]
            return ((loss,) + output) if loss is not None else output

        return ImageClassifierOutput(
            loss=loss,
            logits=logits,
            hidden_states=outputs.hidden_states,
            attentions=outputs.attentions,
        )

class MobileViTForImageClassificationFixed(MobileViTForImageClassification):
    def forward(
        self,
        pixel_values: Optional[torch.Tensor] = None,
        output_hidden_states: Optional[bool] = None,
        labels: Optional[torch.Tensor] = None,
        return_dict: Optional[bool] = None,
        **kwargs # TODO added kwargs
    ) -> Union[tuple, ImageClassifierOutputWithNoAttention]:
        r"""
        labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
            Labels for computing the image classification/regression loss. Indices should be in `[0, ...,
            config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss). If
            `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
        """
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        # TODO here
        outputs = self.mobilevit(pixel_values, output_hidden_states=output_hidden_states, return_dict=return_dict)

        pooled_output = outputs.pooler_output if return_dict else outputs[1]

        # TODO here
        adapter_names = kwargs["adapter_names"]
        patch_size = self.config.patch_size
        multiply = patch_size ** 2
        adapter_names_original = []
        for i in range(0, len(adapter_names), multiply):
            adapter_names_original.append(adapter_names[i])
        logits = self.classifier(self.dropout(pooled_output), adapter_names=adapter_names_original)

        loss = None
        if labels is not None:
            if self.config.problem_type is None:
                if self.num_labels == 1:
                    self.config.problem_type = "regression"
                elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
                    self.config.problem_type = "single_label_classification"
                else:
                    self.config.problem_type = "multi_label_classification"

            if self.config.problem_type == "regression":
                loss_fct = MSELoss()
                if self.num_labels == 1:
                    loss = loss_fct(logits.squeeze(), labels.squeeze())
                else:
                    loss = loss_fct(logits, labels)
            elif self.config.problem_type == "single_label_classification":
                loss_fct = CrossEntropyLoss()
                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
            elif self.config.problem_type == "multi_label_classification":
                loss_fct = BCEWithLogitsLoss()
                loss = loss_fct(logits, labels)

        if not return_dict:
            output = (logits,) + outputs[2:]
            return ((loss,) + output) if loss is not None else output

        return ImageClassifierOutputWithNoAttention(
            loss=loss,
            logits=logits,
            hidden_states=outputs.hidden_states,
        )

def peftforward(self, *args, **kwargs):
    if self.disable_adapters or (self.active_adapter not in self.modules_to_save):
        return self.original_module(*args, **kwargs)

    # TODO changed to support LoRA
    adapter_names = kwargs["adapter_names"]
    kwargs = {}
    batch = args[0]
    unique_adapters = set(adapter_names)
    sub_batch_indices_list = []
    for adapter in unique_adapters:
        sub_batch_indices_list.append([index for index, item in enumerate(adapter_names) if item == adapter])

    results = [0 for i in range(len(batch))]
    for i, active_adapter in enumerate(unique_adapters):
        sub_batch = batch[sub_batch_indices_list[i]]
        output = self.modules_to_save[active_adapter](*(sub_batch,), **kwargs)
        for index, j in enumerate(sub_batch_indices_list[i]):
            results[j] = output[index]
    return torch.stack(results)

def change_forward_dynamically(model: PeftModel):
    model.classifier.forward = partial(peftforward, model.classifier)
    return model

BenjaminBossan commented 3 months ago

However, this is just a hack and I think this should work out of the box. I'm happy to investigate further when I get a chance to first solve #1960 .

Thanks a lot again for this detailed analysis, and again I would be very happy to accept a PR to fix this. Regarding the question of how to fix this: I wonder if it would be easier to change the logic inside of _check_forward_args. Maybe we can relax the len(x) != len(adapter_names) check or even (re)move it entirely if that's enough to fix the situation.

saeid93 commented 3 months ago

No problem! About your question, as far as I understand removing the check won't solve the problem since there will be a mismatch between input size and the number of items in the adapter_names. Therefore in the lora_layer the LoRA weights are only applied proportional to the number of items in adapter_names (same as the input batch size) which is fine in most models. However, in MobileVit due to the unfolding operation the dimension of the inputs is different from the number of the batch size, since adapter_names length matches the batch the LoRA is not applied to all the inputs.

I also checked the accuracy without multi LoRA inference and it seems that my above explanation can also be validated by looking at accuracies. Removing the check does not match the accuracy of single LoRA inference.

BenjaminBossan commented 3 months ago

I see, I thought it would be possible to remove the check or at least make it optional. The user then needs to ensure that the correct adapter_names are passed so that they are lined up with the way that MobileVit unfolds the 0th dimension. The changes to MobileViTForImageClassification would probably still be necessary (though the mixed batch feature is intended for inference only, not sure if that simplifies things). But I'm probably missing something. Anyway, a PR to fix the situation would be welcome.

github-actions[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

saeid93 commented 1 month ago

Hi @BenjaminBossan , My recent pull request will not solve this issue. That pull requests solved https://github.com/huggingface/peft/issues/1960 . This one is a MobileViT specific problem which I was thinking to look for a potential solution next. Please let me know if you think it is no longer necessary, I just want to let you know that it is not relevant to #1990.

BenjaminBossan commented 1 month ago

This issue was auto-closed by merging said PR because you wrote

I'm happy to go through models one by one and also fix #1967

and GH automatically parses "fix #XXXX" and closes the corresponding issue :)

Yes, let's leave this open. If you have time, it would be great if you could work on solving this issue here as well. We can further discuss potential solutions.

saeid93 commented 1 month ago

ah, I see :) Sure, I'll have a look when I get a chance.

saeid93 commented 4 weeks ago

Hi @BenjaminBossan,

I looked into this issue in more depth, but I'm still a bit unsure of the best way to implement a solution. I explored three different approaches, but each has its own challenges, which I've explained below. I would appreciate your opinion on these and any other solutions you might suggest.

Background

As mentioned above, the problem is that the unfolding operation changes the dimensions of the input in MobileViT. As a result, we need to scale the adapter_names in proportion to the patch size of the input:

# -------- Adjusting the size of the adapter_names input ----------
if model.base_model.model.base_model_prefix == "mobilevit":
    patch_size = model.config.patch_size
    multiply = patch_size ** 2
    resized_adapters_names = []
    for item in batch["adapter_names"]:
        multiplied = [item] * multiply
        resized_adapters_names += multiplied
    batch["adapter_names"] = resized_adapters_names
outputs = model(**batch)

Note that after the fixes in https://github.com/huggingface/peft/pull/1990, this solution will no longer work out of the box since the MobileViT part expects the modified format above, while the classifier part expects the original length for the adapter_names input.

Solution 1

Attempt to modify the code to change the adapter_names layer differently for the ViT part and the classifier part.

This solution aims to apply the same workaround I'm currently doing (subclassing MobileViT) but without subclassing, instead injecting the modified logic dynamically—similar to this approach—by using a pre-hook that can adjust adapter_names if the model type is mobilevit. The challenge I encountered was determining how much to scale adapter_names. We need access to the patch_size variable in a function like this one, which can then be added as a pre-hook. However, passing down the patch_size variable required substantial changes to the existing function signatures, which complicates this approach.

Solution 2

Rewrite how the PEFT library applies LoRA layers.

In this approach, I considered rewriting the SelfAttention layer of MobileViT to account for the size change when LoRA is applied, potentially by adding a dispatcher for MobileViT. However, this required significant changes to how LoRA layers are added, which could potentially disrupt other parts of the model.

Solution 3

Reimplement MobileViT with an inherited function in the PEFT library (similar to the workaround I used earlier, but with modifications to account for the fixes in #1990). The downside of this solution is that it involves adding special-case logic for a specific model type in the PEFT library, which feels overly hacky.

Please let me know if you have any suggestions for a better approach or any comments on the solutions discussed. I'm happy to proceed based on your recommendations.

BenjaminBossan commented 3 weeks ago

Thanks for digging deeper into this issue and thinking of a view possible solutions. As you discussed, each of them has its own drawbacks so it's not clear how to proceed.

Something that came to my mind is the following solution: Let's say we have n entries in adapter_names but len(x) is k * n. Could we just "broadcast" adapter_names to repeat each sample k times? Since this would be done on a per layer basis, this should hopefully not interfere with layers that don't need it. Of course, this is a bit "magic" and could potentially misfire when the two are just fitting by accident, but maybe we can live with that.

LMK what you think of this solution.

saeid93 commented 3 weeks ago

Thank you for the suggestion! I'll take a closer look when I get the chance. One quick question that comes to mind: since adapter_names is a list and doesn't support broadcasting, are you suggesting that everything should be converted to NumPy arrays before attempting this solution?

BenjaminBossan commented 3 weeks ago

No, I don't really mean broadcasting in the sense of numpy, hence why I wrote "broadcasting" :) What I mean is repeating the same items multiple times. Simplified code would be something like this:

adapter_names = ["a", "b", "a", "c"]
x = range(12)  # 3 times the size of adapter_names
quot, remainder = divmod(len(adapter_names), len(x))  # 3, 0
if remainder != 0:
    raise ...
adapter_names = sum([[i] * quot for i in adapter_names], [])
print(adapter_names)  # ['a', 'a', 'a', 'b', 'b', 'b', 'a', 'a', 'a', 'c', 'c', 'c']

saeid93 commented 3 weeks ago

Thank you for your clarification, I'll work on it when I get a chance.

BenjaminBossan commented 3 weeks ago

Great, thanks. Of course I might be missing something and one of your proposals could make more sense.

github-actions[bot] commented 3 hours ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

huggingface / peft