jiaweizzhao / GaLore

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
Apache License 2.0
1.24k stars 131 forks source link

Seems not compatible with DeepSpeed (perhaps also FSDP) #2

Open SparkJiao opened 4 months ago

SparkJiao commented 4 months ago

Hi, appreciate to your awesome work!

When I trying to introduce GaLore AdamW optimizer to Gemma training, it seems that it is not compatible with deepspeed with Zero stage as both 0 and 1: image

I guess this is because DeepSpeed's BF16_Optimizer will flatten the parameters for memory efficiency. Perhaps this will also affect the usage of FSDP.

pengzhangzhi commented 4 months ago

had the same problem.... Any solutions? @jiaweizzhao

jiaweizzhao commented 3 months ago

Thanks for your interest. We are getting in touch with FSDP team and will update it soon.

thepowerfuldeez commented 3 months ago

Hi! Up on this, is this feature currently in development?

jiaweizzhao commented 3 months ago

Yes, this feature is still in development. Please stay tuned!