Duncanswilson commented 1 month ago

System Info

- `Accelerate` version: 0.32.1
- Platform: Linux-6.5.0-44-generic-x86_64-with-glibc2.35
- `accelerate` bash location: /home/user/miniforge3/envs/pytorch_nightly/bin/accelerate
- Python version: 3.10.14
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.5.0.dev20240716+cu124 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- System RAM: 94.10 GB
- GPU type: NVIDIA GeForce RTX 3090
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: MULTI_GPU
        - mixed_precision: bf16
        - use_cpu: False
        - debug: False
        - num_processes: 3
        - machine_rank: 0
        - num_machines: 1
        - gpu_ids: all
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - enable_cpu_affinity: False
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
[ ] My own task or dataset (give details below)

Reproduction

When running the examples on single vs multi GPU, I see significant slowdowns when using more than one GPU.

time python ./nlp_example.py produces:

real    0m51.772s                                                                                    
user    0m44.796s                                                                                    
sys     0m2.716s

while time accelerate launch ./nlp_example.py produces:

real    3m58.350s
user    11m28.994s
sys     0m6.990s

Expected behavior

I would expect a near-linear speedup per GPU added to training. Since the dataloader is being returned from accelerator.prepare(), my understanding is that the batch size it is initialized with (e.g. 16; on line 99 of the example script), is replicated on all of the GPUs. (looking at nvitop also confirms this from the memory usage)

For the nlp_example, this would mean an implicit batch size of 16*3 for my setup, so I would hope to see a 3x speed up vs such a long slowdown.

Duncanswilson commented 1 month ago

2432 and #2056 both have @muellerzr saying that the batch size could be the issue, but as I said above, it looks like the batch size is being multiplied implicitly when preparing the dataloader.

BenjaminBossan commented 1 month ago

The 4x slowdown looks quite strange. I tried to reproduce with a very similar setup (but only 2 GPUs) and for me both ran approximately the same. IMO this is not too surprising, given that this is a very small model and parallelism will always incur a communication overhead. I'd expect the advantages of DDP to only really manifest with bigger models on more GPUs.

Did you run a test with 2 GPUs instead of 3?

Duncanswilson commented 1 month ago

I've just tested the 2 GPU case and get even slower results:

real    5m7.818s
user    9m58.686s                                                                                    
sys     0m6.259s

@BenjaminBossan can you post your accelerate env results so I can see if there are any differences?

BenjaminBossan commented 1 month ago

Hmm, very strange. Here is my env, which as mentioned is pretty similar.

- `Accelerate` version: 0.33.0.dev0
- Platform: Linux-6.8.0-38-generic-x86_64-with-glibc2.39
- `accelerate` bash location: .../anaconda3/envs/accelerate/bin/accelerate
- Python version: 3.10.13
- Numpy version: 1.26.0
- PyTorch version (GPU?): 2.3.1 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- PyTorch MUSA available: False
- System RAM: 93.51 GB
- GPU type: NVIDIA GeForce RTX 4090
- `Accelerate` default config:
    - compute_environment: LOCAL_MACHINE
    - distributed_type: MULTI_GPU
    - mixed_precision: bf16
    - use_cpu: False
    - debug: False
    - num_processes: 2
    - machine_rank: 0
    - num_machines: 1
    - gpu_ids: all
    - rdzv_backend: static
    - same_network: True
    - main_training_function: main
    - enable_cpu_affinity: False
    - downcast_bf16: no
    - tpu_use_cluster: False
    - tpu_use_sudo: False
    - tpu_env: []

The biggest difference is the PyTorch version IMO.

Duncanswilson commented 1 month ago

Just in case anyone runs into similar issues, it had nothing to do with accelerate.

The change that fixed everything was a BIOS level setting of updating the PCI Link Speed from Gen 1 to Gen 4.

I've attached the relevant BIOS changes if anyone else experiences this.

huggingface / accelerate

Multi-GPU is slower than Single-GPU on nlp_example.py #2943

System Info

Information

Tasks

Reproduction

Expected behavior

2432 and #2056 both have @muellerzr saying that the batch size could be the issue, but as I said above, it looks like the batch size is being multiplied implicitly when preparing the dataloader.