huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
128.72k stars 25.53k forks source link

`hub_strategy="every_save"` won't push the model to the Hub if large #30724

Open alvarobartt opened 1 month ago

alvarobartt commented 1 month ago

System Info

Additionally:

Who can help?

@muellerzr and @pacman100, also cc @philschmid and @lewtun as a follow up on a recent conversation about this issue

Information

Tasks

Reproduction

[!NOTE] The script used to identify this issue is not an official transformers script but a script from trl, but since the SFTTrainer, DPOTrainer, and such, from trl just subclass the Trainer, I've decided to open the issue here following Philpp's recommendation.

The script that I'm using can be found at trl/examples/scripts/sft.py, but any fine-tuning script in trl at least, will not push the model to the Hugging Face Hub after every epoch and only push the tokenizer and configuration files instead, when using the save_strategy="epoch", push_to_hub=True, hub_strategy="every_save".

To run the script mentioned above under the same settings:

python sft.py --model_name_or_path="mistralai/Mistral-7B-v0.1" --report_to="tensorboard" --learning_rate=5e-5 --dataset_name="timdettmers/openassistant-guanaco" --dataset_train_split="train" --dataset_test_split="test" --torch_dtype="bfloat16" --per_device_train_batch_size=16 --gradient_accumulation_steps=2 --output_dir="sft_openassistant-guanaco" --logging_steps=1 --num_train_epochs=3 --push_to_hub --gradient_checkpointing --hub_strategy="every_save" --hub_private_repo --save_strategy="epoch" --hub_repo_id="hub-strategy-every-save-mistral-sft" --optim=adamw_bnb_8bit

I've seen that happening for SFTTrainer, DPOTrainer, and ORPOTrainer in both single and multi-GPU setups. What's pushed to the Hub after every epoch is the following:

image

The model is indeed properly pushed when calling trainer.push_to_hub explicitly once the training has finished.

Expected behavior

Ideally when setting save_strategy="epoch", push_to_hub=True, hub_strategy="every_save", assuming that the Hugging Face authentication is properly done, the model weights available under the checkpoint-<STEP_NUM> directory within the output_dir should be pushed along with the rest of the files (tokenizer and configuration).

But apparently only the latter are uploaded while the model is not. So ideally, that combination of flags should also upload the model to the Hub after every epoch.

We've reproduced that using smaller models and it does work as expected, but as long as the model is either over 5GB or requires sharding it won't work.

Feel free to let me know if there's anything else you'd like me to do to help debug this issue further! Thanks in advance 🤗

LysandreJik commented 2 days ago

cc @muellerzr and @SunMarc, with larger models becoming the norm it seems like a worthwhile issue to tackle

SunMarc commented 2 days ago

Hey @alvarobartt, thanks for the detailed report ! I was able to reproduce and fix the issue. Check the above PR.