jiaweizzhao / GaLore

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
Apache License 2.0
1.24k stars 131 forks source link

Double approximation of second moment in Adafactor #8

Open threewayhandshake opened 3 months ago

threewayhandshake commented 3 months ago

Adafactor originally does its own an approximation of second moment. But when GaLore is enabled, that approximation is done based on the shrunken grad by GaLore instead of the raw grad. Possibly this behavior may have a slightly negative impact.

jiaweizzhao commented 3 months ago

Thanks for pointing it out and this is correct. Vanilla Adafactor could face this issue but GaLore still reduces memory for Adafactor with momentum version.

threewayhandshake commented 3 months ago

Thank you for your reply. And I'm sorry for that my comment is not clear on the main point.

My understanding is that GaLore can reduce the memory cost for exp_avg (when beta1 is not None), but not for exp_avg_sq because Adafactor originally calculates 1-rank approximation for exp_avg_sq. (First of all, if this understanding is wrong, my comments are almost meaningless.)

Therefore using GaLore for exp_avg_sq might have the risk of increasing the approximation error rather than the benefit of memory efficiency.

So I thought it might be better to have a bypass mechanism that calculates the approximation of exp_avg_sq directly from the full gradient instead of the GaLore-approximated gradient.

But honestly, I have no idea whether the double approximation can really have a negative effect, and if so, whether it is non-negligible.

Thus I think there are three possible actions for this issue:

  1. add a bypass mechanism because there is the need to deal it
  2. hold off a decision because my concern is correct but the need is not clear
  3. close this because no need right now (or my understanding is partially wrong)

Any of these would be acceptable to me.