Closed hschaeufler closed 2 months ago
The scale parameter is exactly the factor $\frac{\alpha} {r}$ you noticed in the original LoRA paper. You can see how it is used here. In terms of guidelines to set it:
That's just my own experience you may have more luck searching other literature / implementations to see how they handle it.
The scale parameter is exactly the factor α r you noticed in the original LoRA paper. You can see how it is used here. In terms of guidelines to set it:
- I haven't noticed the fine-tuning to be too sensitive to the value within reason. It sometimes impacts the rate of learning so it can be worth tuning for your use case.
- It can make a difference for fusing if there is precision loss when you fuse adapters into the base model. That's one reason it is set on the high-side in MLX LM.
That's just my own experience you may have more luck searching other literature / implementations to see how they handle it.
Thank you for your quick reply. That helps me a lot.
Does this mean that alpha is calculated at runtime on the basis of the scale and rank parameters using the formula alpha = r scale? In your example, alpha would then be 8 20 = 160?
Does this mean that alpha is calculated at runtime on the basis of the scale and rank parameters using the formula alpha = r scale? In your example, alpha would then be 8 20 = 160?
We don't ever explicitly compute $\alpha$. The scale parameter is used directly W + scale * a @ b.T
. If you wanted to work out the implied $\alpha$ you could multiply the scale
by the lora rank you are using e.g. scale * rank
.
Does this mean that alpha is calculated at runtime on the basis of the scale and rank parameters using the formula alpha = r scale? In your example, alpha would then be 8 20 = 160?
We don't ever explicitly compute α . The scale parameter is used directly
W + scale * a @ b.T
. If you wanted to work out the implied α you could divide thescale
by the lora rank you are using e.g.scale / rank
.
Thank you for your answer. Sorry if I have to ask again. Are you sure that I have to divide scale by rank? You wrote above:
The scale parameter is exactly the factor α / r you noticed in the original LoRA paper. ...
So when scale = αlpha / rank
I would expect that if I transform the formula to alpha the correct formula would be alpha = scale * rank
? Or am I wrong?
I would like to orientate myself on the recommendations of various papers and train with alpha = 8 and rank = 16. According to scale = αlpha / rank
alpha = rank * scale = 16 * scale = 8
so scale would be 0.5
.
However, if I calculate alpha = scale / rank, scale would be 128 (8 = 128 / 16
)?
What would be correct here?
So when scale = αlpha / rank I would expect that if I transform the formula to alpha the correct formula would be alpha = scale * rank? Or am I wrong?
Sorry that was a typo! You are right.
What would be correct here?
0.5 is the right answer there.
In the example config lora_config.yaml a hyperparamter called scale is defined. What does this mean or are there any descriptions in the technical or research literature? Which values are recommended as optimal values in the research literature? Unfortunately, I couldn't really find anything about it. Most articles describe as scale the rank devided by alpha factor when searching for scale and recommend values between 0.5 and 2. But I have seen that in ml_lm.lora 20 is defined as default value for scale, so I assume the scale hyperparameter is something else? It would be really cool if someone could provide some details on this.
Furthermore, I get an error if I omit the value? Don't know if this is wanted.