details on scale parameter for LoRA

hschaeufler commented 2 months ago

In the example config lora_config.yaml a hyperparamter called scale is defined. What does this mean or are there any descriptions in the technical or research literature? Which values are recommended as optimal values in the research literature? Unfortunately, I couldn't really find anything about it. Most articles describe as scale the rank devided by alpha factor when searching for scale and recommend values between 0.5 and 2. But I have seen that in ml_lm.lora 20 is defined as default value for scale, so I assume the scale hyperparameter is something else? It would be really cool if someone could provide some details on this.

Furthermore, I get an error if I omit the value? Don't know if this is wanted.

  File ‘/Users/admin/.local/share/virtualenvs/dartgen-xL9Vbcux/lib/python3.12/site-packages/mlx_lm/tuner/utils.py’, line 147, in linear_to_lora_layers
    lora_layers = [(k, to_lora(m)) for k, m in l.named_modules() if k in keys]
                       ^^^^^^^^^^
  File ‘/Users/admin/.local/share/virtualenvs/dartgen-xL9Vbcux/lib/python3.12/site-packages/mlx_lm/tuner/utils.py’, line 84, in to_lora
    scale=config[‘scale’],
          ~~~~~~^^^^^^^^^

lora_parameters:
  keys: ["self_attn.q_proj", "self_attn.v_proj"]
  scale: 20

awni commented 2 months ago

The scale parameter is exactly the factor $\frac{\alpha} {r}$ you noticed in the original LoRA paper. You can see how it is used here. In terms of guidelines to set it:

I haven't noticed the fine-tuning to be too sensitive to the value within reason. It sometimes impacts the rate of learning so it can be worth tuning for your use case.
It can make a difference for fusing if there is precision loss when you fuse adapters into the base model. That's one reason it is set on the high-side in MLX LM.

That's just my own experience you may have more luck searching other literature / implementations to see how they handle it.

hschaeufler commented 2 months ago

The scale parameter is exactly the factor α r you noticed in the original LoRA paper. You can see how it is used here. In terms of guidelines to set it:

I haven't noticed the fine-tuning to be too sensitive to the value within reason. It sometimes impacts the rate of learning so it can be worth tuning for your use case.

It can make a difference for fusing if there is precision loss when you fuse adapters into the base model. That's one reason it is set on the high-side in MLX LM.

That's just my own experience you may have more luck searching other literature / implementations to see how they handle it.

Thank you for your quick reply. That helps me a lot.

Does this mean that alpha is calculated at runtime on the basis of the scale and rank parameters using the formula alpha = r scale? In your example, alpha would then be 8 20 = 160?

awni commented 2 months ago

Does this mean that alpha is calculated at runtime on the basis of the scale and rank parameters using the formula alpha = r scale? In your example, alpha would then be 8 20 = 160?

We don't ever explicitly compute $\alpha$. The scale parameter is used directly W + scale * a @ b.T. If you wanted to work out the implied $\alpha$ you could multiply the scale by the lora rank you are using e.g. scale * rank.

hschaeufler commented 2 months ago

Does this mean that alpha is calculated at runtime on the basis of the scale and rank parameters using the formula alpha = r scale? In your example, alpha would then be 8 20 = 160?

We don't ever explicitly compute α . The scale parameter is used directly W + scale * a @ b.T. If you wanted to work out the implied α you could divide the scale by the lora rank you are using e.g. scale / rank.

Thank you for your answer. Sorry if I have to ask again. Are you sure that I have to divide scale by rank? You wrote above:

The scale parameter is exactly the factor α / r you noticed in the original LoRA paper. ...

So when scale = αlpha / rank I would expect that if I transform the formula to alpha the correct formula would be alpha = scale * rank? Or am I wrong?

I would like to orientate myself on the recommendations of various papers and train with alpha = 8 and rank = 16. According to scale = αlpha / rank alpha = rank * scale = 16 * scale = 8 so scale would be 0.5.

However, if I calculate alpha = scale / rank, scale would be 128 (8 = 128 / 16)?

What would be correct here?

awni commented 2 months ago

So when scale = αlpha / rank I would expect that if I transform the formula to alpha the correct formula would be alpha = scale * rank? Or am I wrong?

Sorry that was a typo! You are right.

What would be correct here?

0.5 is the right answer there.

ml-explore / mlx-examples

details on scale parameter for LoRA #982