Closed ncchadwi closed 2 weeks ago
Can you try perhaps manually disabling P2P? (NCCL_DISABLE_P2P iirc)
The code errors out with the following:
NCCL_P2P_DISABLE=1 CUDA_LAUNCH_BLOCKING=1 CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch debug_hang.py
...
nvmlInit_v2() failed: Driver/library version mismatch
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
System Info
nvidia-smi
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
print(f'is cuda available: {torch.cuda.is_available()}') print(f'there are {torch.cuda.device_count()} number of cudas')
if PartialState().is_main_process: print("Pretending to write test file")
print(f"Waiting for everyone: is main? {PartialState().is_main_process}") PartialState().wait_for_everyone() print("Done waiting")
The "Done waiting" never occurs.
After downgrading the NVIDIA driver to 535 the code no longer hangs (runs to completion).
Expected behavior
The expected behavior is the script to complete to the "Done waiting" message to appear (x number of GPUs times).