Open peterjc123 opened 3 months ago
We also observed the same phenomenon, maybe we should increase galore_scale
for faster convergence in SFT
@hiyouga Yeah, I have to set scale
to 4.0
to get the similar loss curve of that with LoRA/DoRA with lora_rank=64, lora_alpha=16
(for the first 100 iters).
After some trials, I am able to find a set of hyper-parameters that performs on par with LoRA/DoRA with regard to training loss (with lora_rank=64, lora_alpha=16, lora_dropout=0.5
).
lr: 3e-5
galore_rank: 256
galore_update_proj_gap: 200
galore_scale: 4.0
galore_proj_type: std
weight_decay: 1e-4
lr_schedule: cosine
mark
Thanks for the great work. One thing I'm curious about is that does it actually work well on SFT for LLMs? It is not covered in the paper, as well. I tried the following parameters on a 2B-sized model, but it leads to very slow convergence. Could you please give me some advice?