discontinuity learning rate while resume from checkpoint

jiangix-paper commented 1 year ago

System Info

transformers 4.30.2 pytorch 2.0.1

Who can help?

No response

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

I use deepspeed stage 3 and huggingface trainer to resume from my past checkpoint(finish running step 1000). My warm_up steps is 2000. My total training epoch is 1. But when I resume from my past checkpoint, the learning rate is started from scratch。 I except it is started from the learning rate in step 1000.

Expected behavior

Thanks

amyeroberts commented 1 year ago

Hi @jiangix-paper, thanks for raising this issue.

Without a code snippet that we can use to reproduce the issue on our end, more information about the running environment e.g. deepspeed version, hardward (run transformers-cli env in the terminal and copy-paste the output) and more details about what's observed (specific numbers / outputs) it's not possible for us to help you.

jiangix-paper commented 1 year ago

@amyeroberts Sorry for incomplete details. My deepspeed config file is as follows: The deepspeed version is 0.9.0 Run "transformers-cli env", the output are as follows:

My training arguments are as follows:

First, I run the following code to get a deepspeed saved model: The saved model files are as follows:

The loss are as follows:

But when i resume from the saved checkpoint using trainer.train(resume_from_checkpoint="xxx"), I expected the learning rate continue from the step 10 (1.4999e-05) and the loss should continue from that(10.4141). But I found the learning rate is from scratch.

Finally, I load the "zero_pp_rank_0_mp_rank_00_model_states.pt" in checkpoint 10. I found the lr_scheduler is None. Although I do not define the lr_scheduler in deepspeed config file, I define it in training arguments. Why is lr_scheduler not be saved?

Thanks a lot. If it lack the other details, please contact me.

jiangix-paper commented 1 year ago

Can you help me please. Thanks a lot. @ydshieh

ydshieh commented 1 year ago

@jiangix-paper I am not familiar with deepspeed. But I can tag someone in the team.

However, please don't upload screenshot as code snippet. Use text format (and in a good formatting too) so we can copy paste. Otherwise, consider using a cola notebook.

jiangix-paper commented 1 year ago

Sorry for that. I will paste my code in text format. My deepspeed config is :

{
  "bf16": {
    "enabled": "auto"
  },
  "zero_optimization": {
    "stage": 3,
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 1e9,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_gather_16bit_weights_on_model_save": true
  },
  "gradient_accumulation_steps": "auto",
  "gradient_clipping": "auto",
  "steps_per_print": 1,
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "wall_clock_breakdown": false
}

The training args are:

run_cmd="torchrun --master_addr localhost --nnodes 1 --nproc_per_node 8 --master_port 9001 \
    pretrain.py \
    --deepspeed ${deepspeed_config_file} \
    --config_name ${llama_path} \
    --tokenizer_name_or_path ${llama_path} \
    --validation_split_percentage 0.000001 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --do_train \
    --seed 2023 \
    --num_train_epochs 1 \
    --lr_scheduler_type cosine \
    --learning_rate 0.00015 \
    --max_grad_norm 1.0 \
    --weight_decay 0.1 \
    --warmup_ratio 0.01 \
    --logging_strategy steps \
    --logging_steps 1 \
    --save_strategy steps \
    --save_total_limit 100 \
    --save_steps 1000 \
    --bf16 True \
    --tf32 True \
    --optim adamw_apex_fused \
    --adam_beta1 0.9 \
    --adam_beta2 0.95 \
    --report_to tensorboard \
    --evaluation_strategy no \
    --gradient_accumulation_steps 1 \
    --preprocessing_num_workers 100 \
    --block_size 2048 \
    --output_dir ${output_dir} \
    --overwrite_output_dir \
    --ddp_timeout 360000 \
    --logging_first_step True \
    --torch_dtype bfloat16 \
    --gradient_checkpointing True \
    --ddp_find_unused_parameters False"

The pretrain.py code is:

trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset if training_args.do_train else None,
        eval_dataset=eval_dataset if training_args.do_eval else None,
        tokenizer=tokenizer,
        data_collator=fault_tolerance_data_collator,
        compute_metrics=compute_metrics if training_args.do_eval and not is_torch_tpu_available() else None,
        preprocess_logits_for_metrics=preprocess_logits_for_metrics
        if training_args.do_eval and not is_torch_tpu_available()
        else None,
    )
    rank0_print('Start Training')
    if training_args.do_train:
        checkpoint = None
        if training_args.resume_from_checkpoint is not None:
            checkpoint = training_args.resume_from_checkpoint
        elif last_checkpoint is not None:
            checkpoint = last_checkpoint
        train_result = trainer.train(resume_from_checkpoint=checkpoint)
        trainer.save_model()
        trainer.save_state()

Can you help me to tag someone in your team？ @ydshieh Thanks a lot

ydshieh commented 1 year ago

@jiangix-paper Thank you for updating.

pretrain.py is not self-complete. Please including the necessary import statements and all the variable definitions that are used
${llama_path} is missing: please specify it.
datasets seem to be missing

ydshieh commented 1 year ago

But looking at

        if training_args.resume_from_checkpoint is not None:
            checkpoint = training_args.resume_from_checkpoint
        elif last_checkpoint is not None:
            checkpoint = last_checkpoint
        train_result = trainer.train(resume_from_checkpoint=checkpoint)

Have you verified that checkpoint passed to trainer.train has the desired value?

jiangix-paper commented 1 year ago

But looking at

        if training_args.resume_from_checkpoint is not None:
            checkpoint = training_args.resume_from_checkpoint
        elif last_checkpoint is not None:
            checkpoint = last_checkpoint
        train_result = trainer.train(resume_from_checkpoint=checkpoint)

Have you verified that checkpoint passed to trainer.train has the desired value?

I have checked the checkpoint, and I find the lr_scheduler in checkpoint is None. But I specified lr_scheduler_type in the parameter settings as 'cosine'。I do not know why it is not saved.

ydshieh commented 1 year ago

Nice! Would you like to fill more missing info. so we can take a look 🙏 . Probably this issue is not even with DeepSpeed (?)

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

wizyoung commented 1 year ago

This is a bug of huggingface. When using deepspeed, we can use hf to create lr_scheluler by passing lr_scheduler_type to training_args or specify scheduler in ds_config. Under the first condion, when resuming from the checkpoint, hf will skip the hf lr_schelduler pipeplie but call deepspeed to restore lr_scheluler. However, the lr_scheluler is not saved into the weights as deepspeed does not know where the hf lr_scheluler is. The newest version of huggingface now fixed it.

check this: load scheduler from resuming checkpoint: https://github.com/huggingface/transformers/blob/5936c8c57ccb2bda3b3f28856a7ef992c5c9f451/src/transformers/trainer.py#L1750 then: https://github.com/huggingface/transformers/blob/5936c8c57ccb2bda3b3f28856a7ef992c5c9f451/src/transformers/trainer.py#L2503-L2514

In the old version (4.32.1): The loading is skipped...

wizyoung commented 1 year ago

Sadly, up to now, the latest version 4.33.2 breaks it again. See my issue raised here: https://github.com/huggingface/transformers/issues/26384

huggingface / transformers