Open alvarobartt opened 1 month ago
cc @muellerzr and @SunMarc, with larger models becoming the norm it seems like a worthwhile issue to tackle
Hey @alvarobartt, thanks for the detailed report ! I was able to reproduce and fix the issue. Check the above PR.
System Info
transformers
version: 4.40.2Additionally:
Who can help?
@muellerzr and @pacman100, also cc @philschmid and @lewtun as a follow up on a recent conversation about this issue
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
The script that I'm using can be found at trl/examples/scripts/sft.py, but any fine-tuning script in
trl
at least, will not push the model to the Hugging Face Hub after every epoch and only push the tokenizer and configuration files instead, when using thesave_strategy="epoch"
,push_to_hub=True
,hub_strategy="every_save"
.To run the script mentioned above under the same settings:
I've seen that happening for
SFTTrainer
,DPOTrainer
, andORPOTrainer
in both single and multi-GPU setups. What's pushed to the Hub after every epoch is the following:The model is indeed properly pushed when calling
trainer.push_to_hub
explicitly once the training has finished.Expected behavior
Ideally when setting
save_strategy="epoch"
,push_to_hub=True
,hub_strategy="every_save"
, assuming that the Hugging Face authentication is properly done, the model weights available under thecheckpoint-<STEP_NUM>
directory within theoutput_dir
should be pushed along with the rest of the files (tokenizer and configuration).But apparently only the latter are uploaded while the model is not. So ideally, that combination of flags should also upload the model to the Hub after every epoch.
We've reproduced that using smaller models and it does work as expected, but as long as the model is either over 5GB or requires sharding it won't work.
Feel free to let me know if there's anything else you'd like me to do to help debug this issue further! Thanks in advance 🤗