Closed arnavgarg1 closed 7 months ago
4 files ± 0 4 suites ±0 10m 0s :stopwatch: - 17m 48s 12 tests - 2 972 9 :heavy_check_mark: - 2 962 3 :zzz: - 9 0 :x: - 1 40 runs - 2 960 28 :heavy_check_mark: - 2 953 12 :zzz: - 6 0 :x: - 1
Results for commit f2945d43. ± Comparison against base commit 867d699a.
:recycle: This comment has been updated with latest results.
Both of these options are configurable using the regular lora adapter, but have different duties! They can be enabled together but have strengths in different circumstances.
Rank Stabilized LoRA (RSLoRA)
When set to True, we use Rank-Stabilized LoRA which sets the adapter scaling factor to
lora_alpha/math.sqrt(r)
, since it was proven to work better. Otherwise, it will use the original default value oflora_alpha/r
.In equation form:
W0X + (lora_alpha/r)(BAX)
where W0 is the base model, BA are the lora weight matrices and X is the input from the embedding layer/previous transformer layer.W0X + (lora_alpha/sqrt(r))(BAX)
In particular, this is useful when using larger ranks since it prevents the gradient from collapsing as rank increases, which may result in higher ranks actually leading to better performance (not true by default today and in the original LoRA paper). Paper: https://arxiv.org/pdf/2312.03732.pdf.
Weight-Decomposed Low-Rank Adaptation (DoRA)
This technique decomposes the updates of the weights into two parts, magnitude and direction. Direction is handled by normal LoRA, whereas the magnitude is handled by a separate learnable parameter. This can improve the performance of LoRA, especially at low ranks. Right now, DoRA only supports non-quantized linear layers. DoRA introduces a bigger overhead than pure LoRA. For more information, see https://arxiv.org/abs/2402.09353.
In practice, this is what the difference looks like when a model is loaded in with regular LoRA vs DoRA. In particular, note the introduction of a new
lora_magnitude_vector
learnable layer of size rank when DoRA is enabled.Tiny-Random Llama with LoRA
Tiny-Random Llama with DoRA