Open Vital1162 opened 4 days ago
I think because in LoRA, you use alpha for the learning rate of the LoRA, which is defined by
$$ LR_{LoRA} = \frac{\alpha}{\sqrt{r}} \times LR $$
But in finetuning, you might want to aggresively update the adapter since your data is usually fewer than pretrain.
Probably my intuition is as long as the result of $\frac{\alpha}{\sqrt{r}}$ is more than one then you good to go
thank you for your response @Erland366, but does dataset size affect these parameters?
I've heard in the Discord that if you have smaller dataset, then use smaller rank and alpha. But I haven't tested this a lot myself
How does $\alpha$ in Lora affect performance in training? I usually see everyone set to $2r$. But why? About the rank, I always set it to 128-256 if the dataset quantity is good.