Open SangbumChoi opened 3 weeks ago
This is the intermediate value that I logged batch size = 4, num_gpu = 4, gradient_accumulation_step=2, num_train_images 140
Before change _inner_training_loop
has_length(train_dataloader)
True
args.max_steps > 0
False
max_steps
96
args.num_train_epochs
24
_total_epoch -> 22 (which is not expected!)_
After change _inner_training_loop
has_length(train_dataloader)
True
args.max_steps > 0
False
max_steps
108
args.num_train_epochs
24
total_epoch -> 24
@muellerzr Hi what do you think about the suggested modification? Is there anything to concern? If not I can open PR!
System Info
transformers
version: 4.43.0.dev0Who can help?
@muellerzr @SunMarc
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
This is the issue of mismatch defined epoch and actual train epoch. Even though I set 24 epoch in trainarguments and set gradient_accumulation_step as 2. There is a mismatch of calculating max_steps when it is not set.
https://github.com/huggingface/transformers/blob/1c68f2cafb4ca54562f74b66d1085b68dd6682f5/src/transformers/trainer.py#L1983
Expected behavior
https://github.com/huggingface/transformers/blob/1c68f2cafb4ca54562f74b66d1085b68dd6682f5/src/transformers/trainer.py#L1975 If we just use normal divider it solves the issue. Is there any specific reason that num_update_steps_per_epoch should be remained as an integer?
num_update_steps_per_epoch = len_dataloader / args.gradient_accumulation_steps