Closed Bostoncake closed 1 week ago
An update: I tried with --max-train-steps 1000
, it seems that the model stops updating somewhere between step 100 and 200. Checkpoints saved after 200 (200 included) are identical when checked with the cmp
command.
An update: It seems that it has something to do with the scheduler in accelerate_configs/zero3_offload.json
. I set --max-train-steps 100
and print the learning rate value in each iteration. I get the following result:
rank7_input.log
As you can see, the learning rate decays to 0 ever since step 13. So nearly 90% of the training process is useless.
I suggest use a constant schedule following arXiv:2402.10171. Which is, set accelerate_configs/zero3_offload.json
as the following:
{
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": 2e-5,
"warmup_max_lr": 2e-5,
"warmup_num_steps": 0,
"warmup_type": "linear"
}
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": [0.9, 0.95],
"eps": 1e-8,
"weight_decay": 0.1
}
},
}
Thanks for spotting this out!! This is really weird...
I can verify on my end that the lr indeed becomes zero after several steps ...
I mainly follow Yarn's training and its deepspeed config and did not expect this would happen.
I thought arXiv:2402.10171 also used a WarmupDecayLR ? https://github.com/FranxYao/Long-Context-Data-Engineering/issues/8#issuecomment-2003917697
@Bostoncake I've updated the config based on your suggestions. Do you have any idea why this could happen?
@Bostoncake I've updated the config based on your suggestions. Do you have any idea why this could happen?
I think it might has something to do with "total_num_steps": "auto"
. Maybe you need to explicitly specify the number of total steps when initializing the scheduler. I haven't tried this, though.
@jzhang38 @Bostoncake It is because scheduler.step()
is called num_gpu times for each step. So when --max-train-steps 1000
, lr decays to 0 at 125 steps with 8 GPUs.
For now the best solution IMHO is indeed to specify min_lr
in zero3_offload.json instead of using auto
.
@jzhang38 @Bostoncake It is because
scheduler.step()
is called num_gpu times for each step. So when--max-train-steps 1000
, lr decays to 0 at 125 steps with 8 GPUs.For now the best solution IMHO is indeed to specify
min_lr
in zero3_offload.json instead of usingauto
.
+1, when I use 2*A100, the model parameters are no longer updated starting from max_train_step/2.
I am training with the following script:
The difference between the above command and
train_scripts/EasyContext-1M-Llama-2-7B.sh
is that I changed the --max-train-steps and --rope-theta. Additionally, I modified the if block in Line 161-165 intrain.py
to enable model saving every 100 steps (I set --save_interval=100):All saved models are evaluated on the latest version of lm_eval. I found that all models saved after step 400 (step 400 included) are identical. That is, when checked with
cmp <model_at_step_400-0000X-of-0000X.safetensors> <model_at_step_T(T>400)-0000X-of-0000X.safetensors>
, no errors are given. Besides, when evaluated on lm_eval, these models give identical results on all datasets tested (including MMLU, TQA, Hellaswag, Winogrande, etc.).The models are all trained on 8 A800 GPUs (80G) and this issue can be reproduced on different model structures (YARN, which is LLaMA-2 with a different positional embedding). I wonder if you have any insights towards this issue. Thanks!