Clarification on the Difference Between LoRA Implementations

monk1337 commented 6 months ago

Hi @rasbt , I was going through the code you provided and I am confused about a part.

In the LoRA implementation, your code looks like this:

class LoRALayer(nn.Module):
    def __init__(self, in_dim, out_dim, rank, alpha):
        super().__init__()
        std_dev = 1 / torch.sqrt(torch.tensor(rank).float())
        self.A = nn.Parameter(torch.randn(in_dim, rank) * std_dev)
        self.B = nn.Parameter(torch.zeros(rank, out_dim))
        self.alpha = alpha

    def forward(self, x):
        x = self.alpha * (x @ self.A @ self.B)
        return x

class LinearWithLoRA(nn.Module):
    def __init__(self, linear, rank, alpha):
        super().__init__()
        self.linear = linear
        self.lora = LoRALayer(
            linear.in_features, linear.out_features, rank, alpha
        )

    def forward(self, x):
        return self.linear(x) + self.lora(x)

Here, x represents the pre-trained model layer. Now, when comparing it to the lit-gpt LoRA implementation

https://github.com/Lightning-AI/lit-gpt/blob/f241d94df59d82b2017bfdcd3800ac8779eb45f5/lit_gpt/lora.py#L173

the model weights are being passed to a linear layer and then added with LoRA weights.

I'm confused about where this extra linear layer comes from in lit-gpt code?, because in your implementation, rather than passing the weights to any layer, we are adding LoRA directly to the model's linear layer. Could you please clarify the difference between these two approaches and explain the purpose of the additional linear layer in the lit-gpt implementation?

rasbt commented 6 months ago

This is a good point. I think you are wondering about why there is this self.linear layer initialized on line 121, right?

In my implementation in this repo, we assume we have loaded a given model and then apply LoRA on top of the existing linear layers.

In Lit-GPT, we would first initialize a model including the LoRA linear layers, and then we load the pretrained weights into the model. So that linear layer in line 121 above would get the weights from the checkpoint files then. This is done via line 163 (https://github.com/Lightning-AI/lit-gpt/blob/f241d94df59d82b2017bfdcd3800ac8779eb45f5/finetune/lora.py#L163) in the finetune/lora.py script:

I agree it's more complicated. I think the reason was to make the model initialization more efficient memory wise this way.

monk1337 commented 6 months ago

Thank you, @rasbt , for your scratch implementations. I am really enjoying them. Yes, You got it, I am confused about line 121's linear layer transformation. Please forgive my naive questions, as I am still a little bit in doubt. I understood the first part of your post's reply. In your implementation, we are directly applying the LoRA on the existing model's layer like this

However, in the LiT-GPT implementation, even if the weights are coming from the checkpoint files still x values are going through a linear transformation, which seems to be missing in the scratch implementation. because if we follow Lit-GPT then this scratch implementation should be like this:

class LinearWithLoRA(nn.Module):
    def __init__(self, in_features, out_features, rank, alpha):
        super().__init__()
        # Original linear layer
        self.linear = nn.Linear(in_features, out_features)
        # LoRA layer
        self.lora = LoRALayer(in_features, out_features, rank, alpha)

    def forward(self, x):
        # Combining the outputs of the original linear layer and LoRA layer
        return self.linear(x) + self.lora(x)

I didn't get this part

In Lit-GPT, we would first initialize a model including the LoRA linear layers, and then we load the pre-trained weights into the model. So that linear layer in line 121 above would get the weights from the checkpoint files then.

Could you please clarify this part in a bit more detail?

rasbt commented 6 months ago

I think your concern is that

in your code, we use random weights here?

class LinearWithLoRA(nn.Module):
    def __init__(self, in_features, out_features, rank, alpha):
        super().__init__()
        # Original linear layer
        self.linear = nn.Linear(in_features, out_features) # <------------- HERE
        # LoRA layer
        self.lora = LoRALayer(in_features, out_features, rank, alpha)

    def forward(self, x):
        # Combining the outputs of the original linear layer and LoRA layer
        return self.linear(x) + self.lora(x)

So, if you were to use this implementation in Lit-GPT, this would get initialized like you show above, but then there is a follow-up step where the weights from the original pretrained models' linear layer are loaded into that self.linear layer.

I tried to outline it here. Please don't hesitate to ask for more clarification:

d-kleine commented 5 months ago

@monk1337 @rasbt Just a short information that an official implementation of DoRA has been released in the meantime: https://github.com/nbasyl/DoRA

rasbt commented 5 months ago

Nice, thanks for sharing. But wait a sec, there is no code in this repo, and they simply use HF in the Readme?

d-kleine commented 5 months ago

Nice, thanks for sharing. But wait a sec, there is no code in this repo, and they simply use HF in the Readme?

Yes, it seems like it has only been implemented to HF PEFT for now, see here. There, it seems like they have used LoRA as a base and added the changes from DoRA, similar to your implementation in the blog article.

As written in the README, the official code will soon be released at https://github.com/NVlabs soon. This makes sense to me as most of the authors worked at Nvidia (and probably conducted their research on DoRA there), as the DoRA paper implies.

d-kleine commented 5 months ago

Officially released now at https://github.com/NVlabs/DoRA

rasbt commented 4 months ago

Wohoo finally!!

rasbt / dora-from-scratch

Clarification on the Difference Between LoRA Implementations #2