Closed youningnihaobang closed 1 month ago
Hey @youningnihaobang, thanks for the report ! Can you share the deepspeed config and TraningArguments config you used before and after ? Also, could you share a minimal reproducer ? That would help us fix the issue !
Hey @youningnihaobang, thanks for the report ! Can you share the deepspeed config and TraningArguments config you used before and after ? Also, could you share a minimal reproducer ? That would help us fix the issue !
Sure,Maybe i can't to duplicate the minimal reproducer,But I will do my best to provide additional content,like config. before Change DeepSpeed Config:
# DeepSpeed Json Config:
{
"train_micro_batch_size_per_gpu": "auto",
"gradient_accumulation_steps": "auto",
"autotuning":{
"enabled":false
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": [
0.8,
0.999
],
"eps": "auto",
"weight_decay": "auto"
}
},
"scheduler": {
"type": "WarmupCosineLR",
"params": {
"warmup_num_steps": "auto",
"total_num_steps": "auto"
}
},
"fp16": {
"enabled": true,
"auto_cast": false
},
"b16": {
"enabled": false
},
"zero_optimization": {
"stage": 3,
"stage3_gather_16bit_weights_on_model_save": true
}
}
After:
# DeepSpeed Json Config:
{
"train_micro_batch_size_per_gpu": "auto",
"gradient_accumulation_steps": "auto",
"autotuning":{
"enabled":false
},
"zero_optimization": {
"stage": 3,
"stage3_gather_16bit_weights_on_model_save": true
}
}
TraningArguments not Different during Training. using default's config like this:
Seq2SeqTrainingArguments(
deepspeed=DS_CONFIG,
model=MODEL_PATH ,output_dir=OUTPUT_DIR,
per_device_train_batch_size=15,
ddp_backend=hccl,gradient_accumulation_steps=2,
fp16_opt_level=O3 ,
do_eval=false,do_train,
do_predict=false,num_train_epochs=30,
learning_rate=1e-3,adam_beta1=0.8,
adam_beta2=0.999,weight_decay=3e-06,fp16,logging_steps=5,save_strategy="steps",save_steps=SAVE_STEPS,
logging_dir=OUTPUT_DIR/tensorboard_log,
resume_from_checkpoint=RESUME_MODEL_PATH,report_to='tensorboard')
Thanks for sharing ! Could you also try with this config to see if it works ? cc @muellerzr
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
System Info
If DeepSpeed Config has optimizer/scheduler/fp16 config,will showing warning and loss Not Converges in training: tried to get lr value before scheduler/optimizer started stepping, returning lr=0
Then i deleted the config that optimizer/scheduler/fp16 in Deepspeed Config, and config by TraningArguments. He no longer displays this warning and converges normally.
transformers
version: 4.42.4Who can help?
@muellerzr @SunMarc @muellerzr
Reproduction
Expected behavior
not show warning and Converges .