Open macheng6 opened 2 days ago
cc @muellerzr @SunMarc
When I removed trainer.is_world_process_zero()
, the code ran normally, but if it is not removed, the code blocks here state_dict = self.accelerator.get_state_dict(self.deepspeed)
of the function trainer.save_model()
Hi @macheng6, are you sure that this is an issue with max_steps and save_steps ? If you set max_steps > save_steps, do you get the issue and is the the code belowtrainer.is_world_process_zero()
running fine ? If you have a minimal reproducer, that would be great !
System Info
transformers 4.41.1
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
When I set max_steps=100 and save_steps=200, I found that the trainer could not save the trained weights and was blocked somewhere, causing the program to be unable to stop. the code is: ` if training_args.do_train: model.gradient_checkpointing_enable() model.enable_input_require_grads() trainer.train(args.resume_from_checkpoint)
For convenience, we also re-save the tokenizer to the same directory,
`
Expected behavior
I hope that :