Add support for RSLoRA and DoRA

Both of these options are configurable using the regular lora adapter, but have different duties! They can be enabled together but have strengths in different circumstances.

Rank Stabilized LoRA (RSLoRA)

When set to True, we use Rank-Stabilized LoRA which sets the adapter scaling factor to lora_alpha/math.sqrt(r), since it was proven to work better. Otherwise, it will use the original default value of lora_alpha/r.

In equation form:

Normal LoRA layers: W0X + (lora_alpha/r)(BAX) where W0 is the base model, BA are the lora weight matrices and X is the input from the embedding layer/previous transformer layer.
LoRA layers with RSLoRA enabled: W0X + (lora_alpha/sqrt(r))(BAX)

In particular, this is useful when using larger ranks since it prevents the gradient from collapsing as rank increases, which may result in higher ranks actually leading to better performance (not true by default today and in the original LoRA paper). Paper: https://arxiv.org/pdf/2312.03732.pdf.

Weight-Decomposed Low-Rank Adaptation (DoRA)

This technique decomposes the updates of the weights into two parts, magnitude and direction. Direction is handled by normal LoRA, whereas the magnitude is handled by a separate learnable parameter. This can improve the performance of LoRA, especially at low ranks. Right now, DoRA only supports non-quantized linear layers. DoRA introduces a bigger overhead than pure LoRA. For more information, see https://arxiv.org/abs/2402.09353.

In practice, this is what the difference looks like when a model is loaded in with regular LoRA vs DoRA. In particular, note the introduction of a new lora_magnitude_vector learnable layer of size rank when DoRA is enabled.

Tiny-Random Llama with LoRA

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): LlamaForCausalLM(
      (model): LlamaModel(
        (embed_tokens): Embedding(32000, 8, padding_idx=0)
        (layers): ModuleList(
          (0): LlamaDecoderLayer(
            (self_attn): LlamaAttention(
              (q_proj): lora.Linear(
                (base_layer): Linear(in_features=8, out_features=8, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=8, out_features=8, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=8, out_features=8, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (k_proj): Linear(in_features=8, out_features=8, bias=False)
              (v_proj): lora.Linear(
                (base_layer): Linear(in_features=8, out_features=8, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=8, out_features=8, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=8, out_features=8, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (o_proj): Linear(in_features=8, out_features=8, bias=False)
              (rotary_emb): LlamaRotaryEmbedding()
            )
            (mlp): LlamaMLP(
              (gate_proj): Linear(in_features=8, out_features=32, bias=False)
              (up_proj): Linear(in_features=8, out_features=32, bias=False)
              (down_proj): Linear(in_features=32, out_features=8, bias=False)
              (act_fn): SiLU()
            )
            (input_layernorm): LlamaRMSNorm()
            (post_attention_layernorm): LlamaRMSNorm()
          )
        )
        (norm): LlamaRMSNorm()
      )
      (lm_head): Linear(in_features=8, out_features=32000, bias=False)
    )
  )
)

Tiny-Random Llama with DoRA

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): LlamaForCausalLM(
      (model): LlamaModel(
        (embed_tokens): Embedding(32000, 8, padding_idx=0)
        (layers): ModuleList(
          (0): LlamaDecoderLayer(
            (self_attn): LlamaAttention(
              (q_proj): lora.Linear(
                (base_layer): Linear(in_features=8, out_features=8, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=8, out_features=8, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=8, out_features=8, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ParameterDict(  (default): Parameter containing: [torch.FloatTensor of size 8])
              )
              (k_proj): Linear(in_features=8, out_features=8, bias=False)
              (v_proj): lora.Linear(
                (base_layer): Linear(in_features=8, out_features=8, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=8, out_features=8, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=8, out_features=8, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ParameterDict(  (default): Parameter containing: [torch.FloatTensor of size 8])
              )
              (o_proj): Linear(in_features=8, out_features=8, bias=False)
              (rotary_emb): LlamaRotaryEmbedding()
            )
            (mlp): LlamaMLP(
              (gate_proj): Linear(in_features=8, out_features=32, bias=False)
              (up_proj): Linear(in_features=8, out_features=32, bias=False)
              (down_proj): Linear(in_features=32, out_features=8, bias=False)
              (act_fn): SiLU()
            )
            (input_layernorm): LlamaRMSNorm()
            (post_attention_layernorm): LlamaRMSNorm()
          )
        )
        (norm): LlamaRMSNorm()
      )
      (lm_head): Linear(in_features=8, out_features=32000, bias=False)
    )
  )
)

ludwig-ai / ludwig