jiaweizzhao / GaLore

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
Apache License 2.0
1.24k stars 131 forks source link

Confusion about the paper #14

Closed CrazyElements closed 3 months ago

CrazyElements commented 3 months ago

Impressive and insightful work, hooray to the authors! Recently I read your paper, but I'm comfused about the following parts.

  1. In the abstract, you discuss how memory-reduction approaches like LoRA underperform full-rank training, for they constrain search space to a low-rank subspace. I quite agree with this perspective. But in the methodology section, the gradient is computed by projection and backprojection, which results in $\Delta W$ being in a low rank subspace as well, which seems to potentially conflict with the initial motivation. Could you please elaborate on how Galore aligns with the overarching goals of the study?
  2. Despite the reduction in optimizer state, performing SVD on gradients introduces a considerable demand on peak memory usage due to the high-dimensional nature of the singular vectors U and V. I'm not sure if I understood the method properly.

Sorry if I ask stupid questions. Thank you for your time and consideration.

RobertBiehl commented 3 months ago
  1. Not the author, but I understood it that their argument is that the gradients can be shown to be low rank (in many cases) while the weights aren't.
  2. I wondered the same thing. I guess one could stagger the computation for each parameter group. E.g. you have 200 parameter groups, want to update all every 200 steps, then you update each group when their if self.ortho_matrix is None or iter % self.update_proj_gap == group_idx: Update: staggering probably doesn’t even affect the peak. I would assume that the peak(s) aren’t problematic as each parameter group is handled independently. So memory can be reused. Would be great to quantify that somehow.
CrazyElements commented 3 months ago

Hi @RobertBiehl, Understood what you said. Thank you for your response!