jiaweizzhao / GaLore

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
Apache License 2.0
1.43k stars 148 forks source link

the problem of warmup step and num training step #62

Closed BIGKnight closed 2 months ago

BIGKnight commented 2 months ago

In the file torchrun_main.py, it appears that the number of warmup steps is evaluated based on global steps, whereas the number of training steps is tracked by update steps. This suggests that the number of warmup steps should be divided by the gradient accumulation steps in terms of update steps. For example, if gradient_accumulations = 32 and warmup_steps = 15,000, the warmup would effectively stop at update_step = 469 (disregarding the current issue of double-counting scheduler steps in the code). Is this behavior expected?