Open catid opened 5 months ago
Thanks for sharing the results and advice.
I have tested adding decompressed MoRA to the weight before, but it can be slow in large language models, which needs to copy the entire weight during the forward pass (maybe this can further optimized, since MoRA can directly copy its weight into origin linear instead of multiplication of two matrices like LoRA to merge back).
For ReMoRA + DoRA, are you adding DoRA and MoRA in a linear layer, which seems to use larger trainable parameters than ReMoRA? However, the idea of using both MoRA and LoRA in a linear layer seems interesting, and this might take advantage of both of them.
Thank you for sharing your results. In return I will share my own:
If you reformulate the code so that during the forward pass, it adds the decompressed MoRa weights into the nn.Linear weights, then you reduce the number of multiplies to the normal number. Furthermore, it becomes compatible with DoRa. In my testing, alternating between repeat and repeat_interleave (ReMoRa) improves on MoRa continued training, and ReMoRa + DoRa improves on ReMoRa.