Open helloworld1 opened 3 weeks ago
@helloworld1 I believe you need to use checkpoint-2000/pytorch_model_fsdp_0
explicitly, not just checkpoint-2000
, since we can't tell if you mean the model or the optimizer (and we support both)
(We can probably make that more explicit by changing that error a bit, cc @SunMarc )
I can now successfully merge the weights. Thanks!
Noticed that the weights are in pickle format not safetensors. I created this PR to default to safetensors https://github.com/huggingface/accelerate/pull/2853
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
When I try to merge checkpoint I got the following error
The checkpoint itself can be restored from
trainer.train(resume_from_checkpoint="./checkpoint-2000"
.The content of the checkpoint looks like below
CC: @muellerzr
Expected behavior
The merge-weight should be able to merge the weights into FULL_STATE_DICT model.