huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
135.09k stars 27.03k forks source link

Clarification on saving model checkpoints #32639

Closed vidyasiv closed 1 day ago

vidyasiv commented 3 months ago

System Info

- `transformers` version: 4.45.0.dev0
- Platform: Linux-5.15.0-117-generic-x86_64-with-glibc2.35
- Python version: 3.10.12
- Huggingface_hub version: 0.24.5
- Safetensors version: 0.4.4
- Accelerate version: 0.33.0
- Accelerate config:    not found
- PyTorch version (GPU?): 2.4.0+cu121 (False)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: No

Who can help?

@muellerzr @SunMarc

Information

Tasks

Reproduction

cd examples/pytorch/question-answering/ python run_qa.py \ --model_name_or_path google-bert/bert-base-uncased \ --dataset_name squad \ --do_train \ --do_eval \ --per_device_train_batch_size 12 \ --learning_rate 3e-5 \ --num_train_epochs 2 \ --max_seq_length 384 \ --doc_stride 128 \ --output_dir /tmp/debug_squad/ \ --max_steps 50 \ --save_steps 5000

Expected behavior

Not sure what is expected, I see checkpoint saved twice(also see it on v4.43.3):

[INFO|trainer.py:3510] 2024-08-13 00:03:33,083 >> Saving model checkpoint to /tmp/debug_squad/checkpoint-50
<snip>
<snip>
...
[INFO|trainer.py:3510] 2024-08-13 00:03:34,323 >> Saving model checkpoint to /tmp/debug_squad/

When I go back to transformers v4.40.2 I only see one save coming from the trainer.save_model():

[INFO|trainer.py:3305] 2024-08-13 00:21:02,182 >> Saving model checkpoint to /tmp/debug_squad/

Suspect the first saving model checkpoint is from https://github.com/huggingface/transformers/blob/main/examples/pytorch/question-answering/run_qa.py#L656 and second is the trainer.save_model(): https://github.com/huggingface/transformers/blob/main/examples/pytorch/question-answering/run_qa.py#L657

Can you clarify if something changed in the train() to ensure model checkpoint is now saved as part of it? Why did the behavior change and was this intentional?

cc: @jiminha, @emascare, @libinta, @regisss

github-actions[bot] commented 2 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

vidyasiv commented 2 months ago

Can we get a clarification on this?

github-actions[bot] commented 1 week ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.