huggingface / accelerate

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
https://huggingface.co/docs/accelerate
Apache License 2.0
7.65k stars 926 forks source link

PartialState().wait_for_everyone() hangs using NVIDIA-SMI 555.42.06 #2942

Closed ncchadwi closed 2 weeks ago

ncchadwi commented 1 month ago

System Info

- `Accelerate` version: 0.32.1
- Platform: Linux-6.2.0-1014-aws-x86_64-with-glibc2.35
- `accelerate` bash location: /home/ubuntu/.venv/bin/accelerate
- Python version: 3.12.2
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.3.1+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- System RAM: 186.70 GB
- GPU type: NVIDIA A10G
- `Accelerate` default config:
        Not found

nvidia-smi

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.06              Driver Version: 555.42.06      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A10G                    Off |   00000000:00:1B.0 Off |                    0 |
|  0%   40C    P0             67W /  300W |    1360MiB /  23028MiB |    100%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A10G                    Off |   00000000:00:1C.0 Off |                    0 |
|  0%   42C    P0             68W /  300W |     456MiB /  23028MiB |    100%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA A10G                    Off |   00000000:00:1D.0 Off |                    0 |
|  0%   33C    P8             17W /  300W |      17MiB /  23028MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA A10G                    Off |   00000000:00:1E.0 Off |                    0 |
|  0%   39C    P0             60W /  300W |     283MiB /  23028MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

Information

Tasks

Reproduction

  1. Spin up a instance (EC2 Ubuntu 22.04 deep learning) with multiple GPUs - i.e. g5.24xlarge
  2. pip install accelerate
  3. create script 'debug_hang.py'
    
    from accelerate import PartialState
    import torch

print(f'is cuda available: {torch.cuda.is_available()}') print(f'there are {torch.cuda.device_count()} number of cudas')

if PartialState().is_main_process: print("Pretending to write test file")

print(f"Waiting for everyone: is main? {PartialState().is_main_process}") PartialState().wait_for_everyone() print("Done waiting")

4. run script `accelerate launch  debug_hang.py`

Result:
```shell
is cuda available: True
is cuda available: True
is cuda available: True
is cuda available: True
there are 4 number of cudas
there are 4 number of cudas
there are 4 number of cudas
there are 4 number of cudas
Pretending to write test file
Waiting for everyone: is main? True
Waiting for everyone: is main? False
Waiting for everyone: is main? False
Waiting for everyone: is main? False

The "Done waiting" never occurs.

After downgrading the NVIDIA driver to 535 the code no longer hangs (runs to completion).

NVIDIA-SMI 535.183.06             Driver Version: 535.183.06   CUDA Version: 12.5

Expected behavior

The expected behavior is the script to complete to the "Done waiting" message to appear (x number of GPUs times).

is cuda available: True
is cuda available: True
is cuda available: True
is cuda available: True
there are 4 number of cudas
there are 4 number of cudas
there are 4 number of cudas
there are 4 number of cudas
Pretending to write test file
Waiting for everyone: is main? True
Waiting for everyone: is main? False
Waiting for everyone: is main? False
Waiting for everyone: is main? False
Done waiting
Done waitingDone waitingDone waiting
muellerzr commented 1 month ago

Can you try perhaps manually disabling P2P? (NCCL_DISABLE_P2P iirc)

ncchadwi commented 1 month ago

The code errors out with the following:

NCCL_P2P_DISABLE=1 CUDA_LAUNCH_BLOCKING=1 CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch debug_hang.py
...
nvmlInit_v2() failed: Driver/library version mismatch
github-actions[bot] commented 3 weeks ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.