huggingface / accelerate

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
https://huggingface.co/docs/accelerate
Apache License 2.0
7.6k stars 917 forks source link

Learning Rate Scheduler Stepping too fast on MultiGPU #2926

Closed priyammaz closed 1 week ago

priyammaz commented 1 month ago

System Info

- `Accelerate` version: 0.22.0
- Platform: Linux-4.18.0-477.55.1.el8_8.x86_64-x86_64-with-glibc2.28
- Python version: 3.11.4
- Numpy version: 1.24.4
- PyTorch version (GPU?): 2.0.1 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- System RAM: 251.62 GB
- GPU type: NVIDIA A40
- `Accelerate` default config:
    - compute_environment: LOCAL_MACHINE
    - distributed_type: MULTI_GPU
    - mixed_precision: no
    - use_cpu: False
    - debug: False
    - num_processes: 4
    - machine_rank: 0
    - num_machines: 1
    - gpu_ids: 0,1,2,3
    - rdzv_backend: static
    - same_network: True
    - main_training_function: main
    - downcast_bf16: no
    - tpu_use_cluster: False
    - tpu_use_sudo: False
    - tpu_env: []

Information

Tasks

Reproduction

I am experimenting with different schedulers and noticed a small problem Here is the skeleton of the training script, nothing fancy:


model = ...
optimizer = ...

main_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=20, gamma=0.1)
warmup_scheduler = torch.optim.lr_scheduler.LinearLR(optimizer, total_iters=5)
scheduler = torch.optim.lr_scheduler.SequentialLR(optimizer, schedulers=[warmup_scheduler, main_scheduler], milestones=[5])

model, optimizer, trainloader, testloader, scheduler = accelerator.prepare(model, optimizer, trainloader, testloader, scheduler)

for epoch in range(EPOCHS):

    ### Train Loop ###
    for images, labels in trainloader:
        out = model(image)
        loss = loss_fn(out, labels)
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

    ### Validation Loop ###
    for images, labels in testloader:
        with torch.no_grad():
            out = model(images)

     ### Iterate Scheduler ###
     scheduler.step()

Expected behavior

What i want is basically, over the 100 epochs i will train the model, the first 4 epochs should be a warmup and then ever 20 epochs after that the learning rate will reduce by a factor of 0.1. This works totally fine on a single GPU, but then when doing two GPUs, it goes through the scheduler twice as fast as if the scheduler.step() is being called twice. Should i wrap the scheduler.step() to only occur on the main gpu using if accelerator.is_local_main_process(), or multiple everythign by the number of GPUs, or is there a better way to do this that I am missing?

priyammaz commented 1 month ago

Screen Shot 2024-07-09 at 12 55 52 PM

I have been logging the learning rate on wandb and it looks like this (training for 90 epochs and multiplying learning rate by 0.1 every 30 epochs). But as you can see, I was training this model on 2 GPUs, so the scheduler is multipling the learnign rate by 0.1 every 15 epochs instead (so going twice as fast)

github-actions[bot] commented 3 weeks ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.