jiaweizzhao / GaLore

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
Apache License 2.0
1.24k stars 131 forks source link

Hyperparameters for SFT? #15

Open peterjc123 opened 3 months ago

peterjc123 commented 3 months ago

Thanks for the great work. One thing I'm curious about is that does it actually work well on SFT for LLMs? It is not covered in the paper, as well. I tried the following parameters on a 2B-sized model, but it leads to very slow convergence. Could you please give me some advice?

lr: 5e-5
galore_rank: 64
galore_update_proj_gap: 200
scale: 0.25
proj_type: std
hiyouga commented 3 months ago

We also observed the same phenomenon, maybe we should increase galore_scale for faster convergence in SFT

peterjc123 commented 3 months ago

@hiyouga Yeah, I have to set scale to 4.0 to get the similar loss curve of that with LoRA/DoRA with lora_rank=64, lora_alpha=16 (for the first 100 iters).

peterjc123 commented 3 months ago

After some trials, I am able to find a set of hyper-parameters that performs on par with LoRA/DoRA with regard to training loss (with lora_rank=64, lora_alpha=16, lora_dropout=0.5).

lr: 3e-5
galore_rank: 256 
galore_update_proj_gap: 200
galore_scale: 4.0
galore_proj_type: std
weight_decay: 1e-4
lr_schedule: cosine
pixeli99 commented 1 month ago

mark