Closed jiangix-paper closed 1 year ago
Hi @jiangix-paper, thanks for raising this issue.
Without a code snippet that we can use to reproduce the issue on our end, more information about the running environment e.g. deepspeed version, hardward (run transformers-cli env
in the terminal and copy-paste the output) and more details about what's observed (specific numbers / outputs) it's not possible for us to help you.
@amyeroberts Sorry for incomplete details. My deepspeed config file is as follows: The deepspeed version is 0.9.0 Run "transformers-cli env", the output are as follows:
My training arguments are as follows:
First, I run the following code to get a deepspeed saved model: The saved model files are as follows:
The loss are as follows:
But when i resume from the saved checkpoint using trainer.train(resume_from_checkpoint="xxx"), I expected the learning rate continue from the step 10 (1.4999e-05) and the loss should continue from that(10.4141). But I found the learning rate is from scratch.
Finally, I load the "zero_pp_rank_0_mp_rank_00_model_states.pt" in checkpoint 10. I found the lr_scheduler is None. Although I do not define the lr_scheduler in deepspeed config file, I define it in training arguments. Why is lr_scheduler not be saved?
Thanks a lot. If it lack the other details, please contact me.
Can you help me please. Thanks a lot. @ydshieh
@jiangix-paper I am not familiar with deepspeed. But I can tag someone in the team.
However, please don't upload screenshot as code snippet. Use text format (and in a good formatting too) so we can copy paste. Otherwise, consider using a cola notebook.
Sorry for that. I will paste my code in text format. My deepspeed config is :
{
"bf16": {
"enabled": "auto"
},
"zero_optimization": {
"stage": 3,
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 1,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}
The training args are:
run_cmd="torchrun --master_addr localhost --nnodes 1 --nproc_per_node 8 --master_port 9001 \
pretrain.py \
--deepspeed ${deepspeed_config_file} \
--config_name ${llama_path} \
--tokenizer_name_or_path ${llama_path} \
--validation_split_percentage 0.000001 \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 4 \
--do_train \
--seed 2023 \
--num_train_epochs 1 \
--lr_scheduler_type cosine \
--learning_rate 0.00015 \
--max_grad_norm 1.0 \
--weight_decay 0.1 \
--warmup_ratio 0.01 \
--logging_strategy steps \
--logging_steps 1 \
--save_strategy steps \
--save_total_limit 100 \
--save_steps 1000 \
--bf16 True \
--tf32 True \
--optim adamw_apex_fused \
--adam_beta1 0.9 \
--adam_beta2 0.95 \
--report_to tensorboard \
--evaluation_strategy no \
--gradient_accumulation_steps 1 \
--preprocessing_num_workers 100 \
--block_size 2048 \
--output_dir ${output_dir} \
--overwrite_output_dir \
--ddp_timeout 360000 \
--logging_first_step True \
--torch_dtype bfloat16 \
--gradient_checkpointing True \
--ddp_find_unused_parameters False"
The pretrain.py code is:
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset if training_args.do_train else None,
eval_dataset=eval_dataset if training_args.do_eval else None,
tokenizer=tokenizer,
data_collator=fault_tolerance_data_collator,
compute_metrics=compute_metrics if training_args.do_eval and not is_torch_tpu_available() else None,
preprocess_logits_for_metrics=preprocess_logits_for_metrics
if training_args.do_eval and not is_torch_tpu_available()
else None,
)
rank0_print('Start Training')
if training_args.do_train:
checkpoint = None
if training_args.resume_from_checkpoint is not None:
checkpoint = training_args.resume_from_checkpoint
elif last_checkpoint is not None:
checkpoint = last_checkpoint
train_result = trainer.train(resume_from_checkpoint=checkpoint)
trainer.save_model()
trainer.save_state()
Can you help me to tag someone in your team? @ydshieh Thanks a lot
@jiangix-paper Thank you for updating.
pretrain.py
is not self-complete. Please including the necessary import statements and all the variable definitions that are used${llama_path}
is missing: please specify it.But looking at
if training_args.resume_from_checkpoint is not None:
checkpoint = training_args.resume_from_checkpoint
elif last_checkpoint is not None:
checkpoint = last_checkpoint
train_result = trainer.train(resume_from_checkpoint=checkpoint)
Have you verified that checkpoint
passed to trainer.train
has the desired value?
But looking at
if training_args.resume_from_checkpoint is not None: checkpoint = training_args.resume_from_checkpoint elif last_checkpoint is not None: checkpoint = last_checkpoint train_result = trainer.train(resume_from_checkpoint=checkpoint)
Have you verified that
checkpoint
passed totrainer.train
has the desired value?
I have checked the checkpoint, and I find the lr_scheduler in checkpoint is None. But I specified lr_scheduler_type in the parameter settings as 'cosine'。I do not know why it is not saved.
Nice! Would you like to fill more missing info. so we can take a look 🙏 . Probably this issue is not even with DeepSpeed (?)
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
This is a bug of huggingface. When using deepspeed, we can use hf to create lr_scheluler by passing lr_scheduler_type
to training_args or specify scheduler
in ds_config. Under the first condion, when resuming from the checkpoint, hf will skip the hf lr_schelduler pipeplie but call deepspeed to restore lr_scheluler. However, the lr_scheluler is not saved into the weights as deepspeed does not know where the hf lr_scheluler is.
The newest version of huggingface now fixed it.
check this: load scheduler from resuming checkpoint: https://github.com/huggingface/transformers/blob/5936c8c57ccb2bda3b3f28856a7ef992c5c9f451/src/transformers/trainer.py#L1750 then: https://github.com/huggingface/transformers/blob/5936c8c57ccb2bda3b3f28856a7ef992c5c9f451/src/transformers/trainer.py#L2503-L2514
In the old version (4.32.1): The loading is skipped...
Sadly, up to now, the latest version 4.33.2 breaks it again. See my issue raised here: https://github.com/huggingface/transformers/issues/26384
System Info
transformers 4.30.2 pytorch 2.0.1
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I use deepspeed stage 3 and huggingface trainer to resume from my past checkpoint(finish running step 1000). My warm_up steps is 2000. My total training epoch is 1. But when I resume from my past checkpoint, the learning rate is started from scratch。 I except it is started from the learning rate in step 1000.
Expected behavior
Thanks