In the file torchrun_main.py, it appears that the number of warmup steps is evaluated based on global steps, whereas the number of training steps is tracked by update steps. This suggests that the number of warmup steps should be divided by the gradient accumulation steps in terms of update steps. For example, if gradient_accumulations = 32 and warmup_steps = 15,000, the warmup would effectively stop at update_step = 469 (disregarding the current issue of double-counting scheduler steps in the code). Is this behavior expected?
In the file torchrun_main.py, it appears that the number of warmup steps is evaluated based on global steps, whereas the number of training steps is tracked by update steps. This suggests that the number of warmup steps should be divided by the gradient accumulation steps in terms of update steps. For example, if gradient_accumulations = 32 and warmup_steps = 15,000, the warmup would effectively stop at update_step = 469 (disregarding the current issue of double-counting scheduler steps in the code). Is this behavior expected?