jiaweizzhao / GaLore

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
Apache License 2.0
1.24k stars 131 forks source link

The first optimizer.step() execution cost extremely long time #16

Closed xikaluo closed 3 months ago

xikaluo commented 3 months ago

Hello, thank you for providing the implementation of the paper. When I run the code, I found that when the optimizer.step() is called for the first time, it would take extremely long time. For me, when pretrain llama_1b model on one A100 with batch_size == 1, running optimizer.step() for the first time cost me 70 seconds. But the time became normal (30ms) after the first step. Is this because of some tensor-register step?

jiaweizzhao commented 3 months ago

It is because of SVD operations that compute the projection matrices at the beginning