jiaweizzhao / GaLore

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
Apache License 2.0
1.24k stars 131 forks source link

Does galore save gradient memory? #53

Open jinqixiao opened 2 weeks ago

jinqixiao commented 2 weeks ago

Dear Author, I am truly grateful for your outstanding work. Please allow me to raise a small question regarding the memory of gradient: As I understand it, the LOMO method can only ensure that gradients are updated layer-by-layer, but the gradient memory for each weight matrix is not compressed. The shape size remains consistent with the original weight. I'm not sure if I'm misusing it.

jiaweizzhao commented 4 days ago

That's correct. LOMO does not directly compress gradient. GaLore should be able to compress gradient to reduce its memory (less memory requirement if we disable LOMO and enable gradient accumulation). We will include it in our next version.