Mismatch with epoch when using gradient_accumulation

SangbumChoi commented 3 weeks ago

System Info

transformers version: 4.43.0.dev0
Platform: Linux-5.4.0-167-generic-x86_64-with-glibc2.35
Python version: 3.10.14
Huggingface_hub version: 0.23.4
Safetensors version: 0.4.3
Accelerate version: 0.30.1
Accelerate config: not found
PyTorch version (GPU?): 2.2.2 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?:
Using GPU in script?:
GPU type: NVIDIA TITAN RTX

Who can help?

@muellerzr @SunMarc

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

This is the issue of mismatch defined epoch and actual train epoch. Even though I set 24 epoch in trainarguments and set gradient_accumulation_step as 2. There is a mismatch of calculating max_steps when it is not set.

https://github.com/huggingface/transformers/blob/1c68f2cafb4ca54562f74b66d1085b68dd6682f5/src/transformers/trainer.py#L1983

Expected behavior

https://github.com/huggingface/transformers/blob/1c68f2cafb4ca54562f74b66d1085b68dd6682f5/src/transformers/trainer.py#L1975 If we just use normal divider it solves the issue. Is there any specific reason that num_update_steps_per_epoch should be remained as an integer? num_update_steps_per_epoch = len_dataloader / args.gradient_accumulation_steps

SangbumChoi commented 3 weeks ago

This is the intermediate value that I logged batch size = 4, num_gpu = 4, gradient_accumulation_step=2, num_train_images 140

Before change _inner_training_loop

has_length(train_dataloader) True args.max_steps > 0 False max_steps 96 args.num_train_epochs 24

_total_epoch -> 22 (which is not expected!)_

After change _inner_training_loop

has_length(train_dataloader) True args.max_steps > 0 False max_steps 108 args.num_train_epochs 24

total_epoch -> 24

SangbumChoi commented 2 weeks ago

@muellerzr Hi what do you think about the suggested modification? Is there anything to concern? If not I can open PR!

huggingface / transformers