`hub_strategy="every_save"` won't push the model to the Hub if large

alvarobartt commented 1 month ago

System Info

transformers version: 4.40.2
Platform: Linux-5.15.0-87-generic-x86_64-with-glibc2.35
Python version: 3.10.12
Huggingface_hub version: 0.23.0
Safetensors version: 0.4.3
Accelerate version: 0.30.0
Accelerate config: not found
PyTorch version (GPU?): 2.1.1+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: Yes (A100 80GB SXM)
Using distributed or parallel set-up in script?: NA

Additionally:

trl version: 0.8.7.dev0

Who can help?

@muellerzr and @pacman100, also cc @philschmid and @lewtun as a follow up on a recent conversation about this issue

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

[!NOTE] The script used to identify this issue is not an official transformers script but a script from trl, but since the SFTTrainer, DPOTrainer, and such, from trl just subclass the Trainer, I've decided to open the issue here following Philpp's recommendation.

The script that I'm using can be found at trl/examples/scripts/sft.py, but any fine-tuning script in trl at least, will not push the model to the Hugging Face Hub after every epoch and only push the tokenizer and configuration files instead, when using the save_strategy="epoch", push_to_hub=True, hub_strategy="every_save".

To run the script mentioned above under the same settings:

python sft.py --model_name_or_path="mistralai/Mistral-7B-v0.1" --report_to="tensorboard" --learning_rate=5e-5 --dataset_name="timdettmers/openassistant-guanaco" --dataset_train_split="train" --dataset_test_split="test" --torch_dtype="bfloat16" --per_device_train_batch_size=16 --gradient_accumulation_steps=2 --output_dir="sft_openassistant-guanaco" --logging_steps=1 --num_train_epochs=3 --push_to_hub --gradient_checkpointing --hub_strategy="every_save" --hub_private_repo --save_strategy="epoch" --hub_repo_id="hub-strategy-every-save-mistral-sft" --optim=adamw_bnb_8bit

I've seen that happening for SFTTrainer, DPOTrainer, and ORPOTrainer in both single and multi-GPU setups. What's pushed to the Hub after every epoch is the following:

The model is indeed properly pushed when calling trainer.push_to_hub explicitly once the training has finished.

Expected behavior

Ideally when setting save_strategy="epoch", push_to_hub=True, hub_strategy="every_save", assuming that the Hugging Face authentication is properly done, the model weights available under the checkpoint-<STEP_NUM> directory within the output_dir should be pushed along with the rest of the files (tokenizer and configuration).

But apparently only the latter are uploaded while the model is not. So ideally, that combination of flags should also upload the model to the Hub after every epoch.

We've reproduced that using smaller models and it does work as expected, but as long as the model is either over 5GB or requires sharding it won't work.

Feel free to let me know if there's anything else you'd like me to do to help debug this issue further! Thanks in advance 🤗

LysandreJik commented 2 days ago

cc @muellerzr and @SunMarc, with larger models becoming the norm it seems like a worthwhile issue to tackle

SunMarc commented 2 days ago

Hey @alvarobartt, thanks for the detailed report ! I was able to reproduce and fix the issue. Check the above PR.

huggingface / transformers