Open monk1337 opened 6 months ago
This is a good point. I think you are wondering about why there is this self.linear
layer initialized on line 121, right?
In my implementation in this repo, we assume we have loaded a given model and then apply LoRA on top of the existing linear layers.
In Lit-GPT, we would first initialize a model including the LoRA linear layers, and then we load the pretrained weights into the model. So that linear layer in line 121 above would get the weights from the checkpoint files then. This is done via line 163 (https://github.com/Lightning-AI/lit-gpt/blob/f241d94df59d82b2017bfdcd3800ac8779eb45f5/finetune/lora.py#L163) in the finetune/lora.py
script:
I agree it's more complicated. I think the reason was to make the model initialization more efficient memory wise this way.
Thank you, @rasbt , for your scratch implementations. I am really enjoying them. Yes, You got it, I am confused about line 121's linear layer transformation. Please forgive my naive questions, as I am still a little bit in doubt. I understood the first part of your post's reply. In your implementation, we are directly applying the LoRA on the existing model's layer like this
However, in the LiT-GPT implementation, even if the weights are coming from the checkpoint files still x
values are going through a linear transformation, which seems to be missing in the scratch implementation. because if we follow Lit-GPT then this scratch implementation should be like this:
class LinearWithLoRA(nn.Module):
def __init__(self, in_features, out_features, rank, alpha):
super().__init__()
# Original linear layer
self.linear = nn.Linear(in_features, out_features)
# LoRA layer
self.lora = LoRALayer(in_features, out_features, rank, alpha)
def forward(self, x):
# Combining the outputs of the original linear layer and LoRA layer
return self.linear(x) + self.lora(x)
I didn't get this part
In Lit-GPT, we would first initialize a model including the LoRA linear layers, and then we load the pre-trained weights into the model. So that linear layer in line 121 above would get the weights from the checkpoint files then.
Could you please clarify this part in a bit more detail?
I think your concern is that
in your code, we use random weights here?
class LinearWithLoRA(nn.Module):
def __init__(self, in_features, out_features, rank, alpha):
super().__init__()
# Original linear layer
self.linear = nn.Linear(in_features, out_features) # <------------- HERE
# LoRA layer
self.lora = LoRALayer(in_features, out_features, rank, alpha)
def forward(self, x):
# Combining the outputs of the original linear layer and LoRA layer
return self.linear(x) + self.lora(x)
So, if you were to use this implementation in Lit-GPT, this would get initialized like you show above, but then there is a follow-up step where the weights from the original pretrained models' linear layer are loaded into that self.linear
layer.
I tried to outline it here. Please don't hesitate to ask for more clarification:
@monk1337 @rasbt Just a short information that an official implementation of DoRA has been released in the meantime: https://github.com/nbasyl/DoRA
Nice, thanks for sharing. But wait a sec, there is no code in this repo, and they simply use HF in the Readme?
Nice, thanks for sharing. But wait a sec, there is no code in this repo, and they simply use HF in the Readme?
Yes, it seems like it has only been implemented to HF PEFT for now, see here. There, it seems like they have used LoRA as a base and added the changes from DoRA, similar to your implementation in the blog article.
As written in the README, the official code will soon be released at https://github.com/NVlabs soon. This makes sense to me as most of the authors worked at Nvidia (and probably conducted their research on DoRA there), as the DoRA paper implies.
Officially released now at https://github.com/NVlabs/DoRA
Wohoo finally!!
Hi @rasbt , I was going through the code you provided and I am confused about a part.
In the LoRA implementation, your code looks like this:
Here, x represents the pre-trained model layer. Now, when comparing it to the lit-gpt LoRA implementation
https://github.com/Lightning-AI/lit-gpt/blob/f241d94df59d82b2017bfdcd3800ac8779eb45f5/lit_gpt/lora.py#L173
the model weights are being passed to a linear layer and then added with LoRA weights.
I'm confused about where this extra linear layer comes from in lit-gpt code?, because in your implementation, rather than passing the weights to any layer, we are adding LoRA directly to the model's linear layer. Could you please clarify the difference between these two approaches and explain the purpose of the additional linear layer in the lit-gpt implementation?