huggingface / accelerate

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
https://huggingface.co/docs/accelerate
Apache License 2.0
7.34k stars 875 forks source link

optimizer.step_was_skipped not correct in accelerator.accumulate #2828

Open Fadelis98 opened 1 month ago

Fadelis98 commented 1 month ago

System Info

- `Accelerate` version: 0.30.1
- Platform: Linux-5.4.0-144-generic-x86_64-with-glibc2.35
- `accelerate` bash location: /root/miniconda3/envs/qec/bin/accelerate
- Python version: 3.11.9
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.3.0 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- System RAM: 1007.53 GB
- GPU type: NVIDIA GeForce RTX 4090
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: MULTI_GPU
        - mixed_precision: no
        - use_cpu: False
        - debug: True
        - num_processes: 8
        - machine_rank: 0
        - num_machines: 1
        - gpu_ids: all
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - enable_cpu_affinity: True
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []

Information

Tasks

Reproduction

The value of optimizer.step_was_skipped, as it is named, should be True whenever optimizer.step() is called but the step is not actually applied to the parameters. The logic is implemented at here, which is inside the if self.gradient_state.sync_gradients condition. The standard implemention of gradient accumulation controls whather to actually step the optimizer by this condition, so optimizer.step_was_skipped would be always False

Expected behavior

If this is an expected behaviour, rename optimizer.step_was_skipped or note this behaviour in the doc string. Otherwise, fix its logic to return True when the gradient is accumulated.

muellerzr commented 1 month ago

step_was_skipped is only done if we hit overflow, not if we are accumulating gradients. We can rename this to be overflow_hit potentially, but that is the true intention of it, not if we are not syncing the gradients yet

github-actions[bot] commented 1 hour ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.