huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
128.29k stars 25.45k forks source link

When max_steps < save_steps with deepspeed zero3 stage #31624

Open macheng6 opened 2 days ago

macheng6 commented 2 days ago

System Info

transformers 4.41.1

Who can help?

No response

Information

Tasks

Reproduction

When I set max_steps=100 and save_steps=200, I found that the trainer could not save the trained weights and was blocked somewhere, causing the program to be unable to stop. the code is: ` if training_args.do_train: model.gradient_checkpointing_enable() model.enable_input_require_grads() trainer.train(args.resume_from_checkpoint)

For convenience, we also re-save the tokenizer to the same directory,

    # so that you can share your model easily on huggingface.co/models =)
    if trainer.is_world_process_zero():
        trainer.save_model()
        tokenizer.save_pretrained(training_args.output_dir)

`

Expected behavior

I hope that :

amyeroberts commented 2 days ago

cc @muellerzr @SunMarc

macheng6 commented 2 days ago

When I removed trainer.is_world_process_zero(), the code ran normally, but if it is not removed, the code blocks here state_dict = self.accelerator.get_state_dict(self.deepspeed) of the function trainer.save_model()

SunMarc commented 2 days ago

Hi @macheng6, are you sure that this is an issue with max_steps and save_steps ? If you set max_steps > save_steps, do you get the issue and is the the code belowtrainer.is_world_process_zero()running fine ? If you have a minimal reproducer, that would be great !