huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
129.59k stars 25.73k forks source link

Mismatch with epoch when using gradient_accumulation #31677

Open SangbumChoi opened 3 weeks ago

SangbumChoi commented 3 weeks ago

System Info

Who can help?

@muellerzr @SunMarc

Information

Tasks

Reproduction

This is the issue of mismatch defined epoch and actual train epoch. Even though I set 24 epoch in trainarguments and set gradient_accumulation_step as 2. There is a mismatch of calculating max_steps when it is not set.

https://github.com/huggingface/transformers/blob/1c68f2cafb4ca54562f74b66d1085b68dd6682f5/src/transformers/trainer.py#L1983

Expected behavior

https://github.com/huggingface/transformers/blob/1c68f2cafb4ca54562f74b66d1085b68dd6682f5/src/transformers/trainer.py#L1975 If we just use normal divider it solves the issue. Is there any specific reason that num_update_steps_per_epoch should be remained as an integer? num_update_steps_per_epoch = len_dataloader / args.gradient_accumulation_steps

SangbumChoi commented 3 weeks ago

This is the intermediate value that I logged batch size = 4, num_gpu = 4, gradient_accumulation_step=2, num_train_images 140

Before change _inner_training_loop

has_length(train_dataloader) True args.max_steps > 0 False max_steps 96 args.num_train_epochs 24

_total_epoch -> 22 (which is not expected!)_

After change _inner_training_loop

has_length(train_dataloader) True args.max_steps > 0 False max_steps 108 args.num_train_epochs 24

total_epoch -> 24

SangbumChoi commented 2 weeks ago

@muellerzr Hi what do you think about the suggested modification? Is there anything to concern? If not I can open PR!