Step shifting using total_batched_samples for gradient_accumulation_steps counting

kibitzing commented 1 month ago

System Info

transformers version: 4.39.0
Platform: Linux-5.4.239-1.el7.elrepo.x86_64-x86_64-with-glibc2.35
Python version: 3.10.12
Huggingface_hub version: 0.24.6
Safetensors version: 0.4.1
Accelerate version: 0.28.0
Accelerate config: not found
PyTorch version (GPU?): 2.2.0+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?:
Using distributed or parallel set-up in script?:

Who can help?

No response

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Issue Analysis

The on_step_begin callback is invoked when the step is divisible by args.gradient_accumulation_steps (i.e.,step % args.gradient_accumulation_steps == 0). However, the on_step_end callback behaves differently. Its condition is as follows: total_batched_samples % args.gradient_accumulation_steps == 0 or is_last_step_and_steps_less_than_grad_acc

Here, the on_step_end callback is triggered when total_batched_samples is divisible by args.gradient_accumulation_steps. It’s important to note that step is reset at the beginning of each epoch, whereas total_batched_samples is initialized to 0 at the start of training and persists across all epochs until training ends.

Expected Behavior:

When gradient_accumulation_steps = N, there should be exactly N sub-steps between the on_step_begin and on_step_end callbacks. This ensures that gradients are accumulated correctly before an optimization step occurs. The only exception to this rule is the last step in an epoch or the training run, where fewer sub-steps might exist.

Problematic Behavior Example

The issue arises when total_batched_samples is not divisible by args.gradient_accumulation_steps. For example, if steps_per_epoch = 3 and gradient_accumulation_steps = 2, we observe the following behavior:

Epoch 1:

step 0: on_step_begin called, on_step_end not called (expected behavior)
step 1: on_step_begin not called, on_step_end called (expected behavior)
step 2: on_step_begin called, on_step_end not called (expected behavior)

Epoch 2:

step 3: on_step_begin called (0 % 2 == 0), on_step_end called (4 % 2 == 0) (incorrect because on_step_end is called after only one sub step)
step 4: on_step_begin not called (1 % 2 != 0), on_step_end not called (5 % 2 != 0) (incorrect)
step 5: on_step_begin called (2 % 2 == 0), on_step_end called (6 % 2 == 0) (incorrect because on_step_end is called after only one sub step)

Epoch 3:

step 6: on_step_begin called, on_step_end not called (expected behavior)
step 7: on_step_begin not called, on_step_end called (expected behavior)
step 8: on_step_begin called, on_step_end not called (expected behavior)

Note: total_batched_samples is incremented by 1 at the start of each step loop.

In this case, when the number of steps per epoch is not divisible by gradient_accumulation_steps, the callbacks only function correctly at intervals, leading to incorrect behavior during other epochs.

kibitzing commented 1 month ago

Upon further investigation, I found that this issue is not solely related to the callbacks. When performing gradient accumulation for updates, if the total number of batches in an epoch is not divisible by gradient_accumulation_steps, a shifting phenomenon occurs with gradient accumulation.

Specifically, after updating with the last non-divisible batch, the gradient accumulation should be counted from 0. However, since total_batched_samples is not refreshed with each epoch, it can lead to this shifting issue.

For example, if the number of data batches_per_epoch = 7 and gradient_accumulation_steps = 4, the updates would proceed as follows:

Epoch 1
- Steps 1, 2, 3, 4 (update parameter because (4 % 4 = 0)
- Steps 5, 6, 7 (update because it's the last step)
Epoch 2
- Step 8 (update parameter because 8 % 4 =0)
- Steps 9, 10, 11, 12 (update parameter because 12 % 4 = 0)
- Steps 13, 14 (update because it's the last step)
Epoch 3
- Step 15, 16 (update parameter because 16 % 4 = 0)
- Steps 17, 18, 19, 20 (update parameter because 16 % 4 = 0)
- Steps 21 (update because it's the last step)

LysandreJik commented 1 month ago

cc @SunMarc and @muellerzr

SunMarc commented 1 month ago

Thanks for this clear report @kibitzing ! I left a comment on your PR

github-actions[bot] commented 2 days ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / transformers