PartialState().wait_for_everyone() hangs using NVIDIA-SMI 555.42.06

ncchadwi commented 1 month ago

System Info

- `Accelerate` version: 0.32.1
- Platform: Linux-6.2.0-1014-aws-x86_64-with-glibc2.35
- `accelerate` bash location: /home/ubuntu/.venv/bin/accelerate
- Python version: 3.12.2
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.3.1+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- System RAM: 186.70 GB
- GPU type: NVIDIA A10G
- `Accelerate` default config:
        Not found

nvidia-smi

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.06              Driver Version: 555.42.06      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A10G                    Off |   00000000:00:1B.0 Off |                    0 |
|  0%   40C    P0             67W /  300W |    1360MiB /  23028MiB |    100%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A10G                    Off |   00000000:00:1C.0 Off |                    0 |
|  0%   42C    P0             68W /  300W |     456MiB /  23028MiB |    100%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA A10G                    Off |   00000000:00:1D.0 Off |                    0 |
|  0%   33C    P8             17W /  300W |      17MiB /  23028MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA A10G                    Off |   00000000:00:1E.0 Off |                    0 |
|  0%   39C    P0             60W /  300W |     283MiB /  23028MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
[X] My own task or dataset (give details below)

Reproduction

Spin up a instance (EC2 Ubuntu 22.04 deep learning) with multiple GPUs - i.e. g5.24xlarge
pip install accelerate

create script 'debug_hang.py'


from accelerate import PartialState
import torch

print(f'is cuda available: {torch.cuda.is_available()}') print(f'there are {torch.cuda.device_count()} number of cudas')

if PartialState().is_main_process: print("Pretending to write test file")

print(f"Waiting for everyone: is main? {PartialState().is_main_process}") PartialState().wait_for_everyone() print("Done waiting")

4. run script `accelerate launch  debug_hang.py`

Result:
```shell
is cuda available: True
is cuda available: True
is cuda available: True
is cuda available: True
there are 4 number of cudas
there are 4 number of cudas
there are 4 number of cudas
there are 4 number of cudas
Pretending to write test file
Waiting for everyone: is main? True
Waiting for everyone: is main? False
Waiting for everyone: is main? False
Waiting for everyone: is main? False

The "Done waiting" never occurs.

After downgrading the NVIDIA driver to 535 the code no longer hangs (runs to completion).

NVIDIA-SMI 535.183.06             Driver Version: 535.183.06   CUDA Version: 12.5

Expected behavior

The expected behavior is the script to complete to the "Done waiting" message to appear (x number of GPUs times).

is cuda available: True
is cuda available: True
is cuda available: True
is cuda available: True
there are 4 number of cudas
there are 4 number of cudas
there are 4 number of cudas
there are 4 number of cudas
Pretending to write test file
Waiting for everyone: is main? True
Waiting for everyone: is main? False
Waiting for everyone: is main? False
Waiting for everyone: is main? False
Done waiting
Done waitingDone waitingDone waiting

muellerzr commented 1 month ago

Can you try perhaps manually disabling P2P? (NCCL_DISABLE_P2P iirc)

ncchadwi commented 1 month ago

The code errors out with the following:

NCCL_P2P_DISABLE=1 CUDA_LAUNCH_BLOCKING=1 CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch debug_hang.py
...
nvmlInit_v2() failed: Driver/library version mismatch

github-actions[bot] commented 3 weeks ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / accelerate