macheng6 commented 2 days ago

System Info

transformers 4.41.1

Who can help?

No response

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

When I set max_steps=100 and save_steps=200, I found that the trainer could not save the trained weights and was blocked somewhere, causing the program to be unable to stop. the code is: ` if training_args.do_train: model.gradient_checkpointing_enable() model.enable_input_require_grads() trainer.train(args.resume_from_checkpoint)

For convenience, we also re-save the tokenizer to the same directory,

    # so that you can share your model easily on huggingface.co/models =)
    if trainer.is_world_process_zero():
        trainer.save_model()
        tokenizer.save_pretrained(training_args.output_dir)

`

Expected behavior

I hope that :

either an error message will be displayed, prompting the user that max_steps<save_steps,
or the last trained weights can be saved.

amyeroberts commented 2 days ago

cc @muellerzr @SunMarc

macheng6 commented 2 days ago

When I removed trainer.is_world_process_zero(), the code ran normally, but if it is not removed, the code blocks here state_dict = self.accelerator.get_state_dict(self.deepspeed) of the function trainer.save_model()

SunMarc commented 2 days ago

Hi @macheng6, are you sure that this is an issue with max_steps and save_steps ? If you set max_steps > save_steps, do you get the issue and is the the code belowtrainer.is_world_process_zero()running fine ? If you have a minimal reproducer, that would be great !

huggingface / transformers

When max_steps < save_steps with deepspeed zero3 stage #31624