Closed fabianlim closed 9 months ago
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Hello @fabianlim, I think the PR https://github.com/huggingface/transformers/pull/28297 should resolve this.
@pacman100 yes I think so too, closing this issue.
System Info
transformers==4.35.2 accelerate==0.23.0 peft==0.5.0
accelerate.yaml
Who can help?
@pacman100 Following the recommendation https://huggingface.co/docs/trl/v0.7.4/en/sft_trainer#training-adapters to install a
PeftSavingCallback
to ensure thatadapter.bin
is saved. This will be the case when usingFSDP
since it is not aPretrainedModel
, in which only thestate_dict
will be saved.The recommendation above works great for saving the checkpoint, but does not work when resuming the checkpoint. This is because
model_wrapped
is neither aPretrainedModel
noraPEFTModel
, and theif-else
conditions inTrainer._load_from_checkpoint
will go all the way toload_sharded_checkpoint
. This results in the following error:The second issue with the recommendation, is that the FSDP optimizer sates are not saved in the
PeftSavingCallback
, so it will not be a clean fix.I was wondering if you may have any thoughts on this. A possible hacky solution will be to override
Trainer._load_from_checkpoint
and useFSDP.summon_full_params
to unshard the LoRA weights, and then callload_adapter
, but it doesnt sound very clean given that it will not resume the FSDP optimizer.Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
accelerate.yaml
configurations. After at least 100 steps, when anadapter.bin
checkpoint has been populated, stop.trainer.train(resume_from_checkpoint=True)
.Expected behavior
resume_from_checkpoint=True
will resume the PEFT checkpoint recorded byPeftSavingCallback
.