init_process_group hangs for multi-node, Pytorch > v1.3.1 and file init_method

rmchurch commented 3 years ago

🐛 Bug

When using the NCCL backend and file:// init_method for init_process_group, it inevitably hangs on my system (ppc64le architecture) when running multi-node and more than 1 GPU per compute node. The file I point to is unique each run, and resides on a GPFS scratch filesystem. The hanging only occurs for Pytorch versions >1.3.1 (I have tried 1.9.0, 1.7.0, and 1.5.0, they all exhibit the same behavior); in contrast, with v1.3.1 I am able to train using multiple GPUs with multiple compute nodes. I ran the multi-GPU, multi-node setup through the pdb debugger, and oddly if I pause right after the initial store.add in the _store_based_barrier function, then continue, it runs through without problem. If I pause on the line store.add, then hit continue, it will hang. I'm not sure if there is an issue with multiple writers to the file, or what is causing this behavior, but it seems like a simple time.sleep(1) between the store.add and the get_world_size would fix the problem. Or perhaps there is another setting solution?

To Reproduce

Run with SLURM setup:

salloc  --nodes=2 --ntasks-per-node=2 --ntasks-per-socket=1 --gpus-per-task=1 --cpus-per-task=16 --mem=200gb -t 00:30:00
srun python -m pdb main.py

main.py:

import torch.distributed as dist

jobid = os.environ['SLURM_JOB_ID']
world_size = int(os.environ['SLURM_NTASKS'])
rank = int(os.environ['SLURM_PROCID'])
gpu = int(os.environ['SLURM_LOCALID'])

dist_url = "file:///scratch/gpfs/dummy_splits."+jobid+".txt"
dist.init_process_group(backend="nccl", init_method=dist_url,
                            world_size=world_size, rank=rank)

Expected behavior

init_process_group to not hang

Environment

Collecting environment information...
PyTorch version: 1.9.0a0+gitdfbd030
Is debug build: False
CUDA used to build PyTorch: 11.4
ROCM used to build PyTorch: N/A

OS: Red Hat Enterprise Linux 8.4 (Ootpa) (ppc64le)
GCC version: (GCC) 8.4.1 20200928 (Red Hat 8.4.1-1)
Clang version: Could not collect
CMake version: version 3.18.2
Libc version: glibc-2.28

Python version: 3.8.3 (default, Jul  2 2020, 16:32:18)  [GCC 7.3.0] (64-bit runtime)
Python platform: Linux-4.18.0-305.19.1.el8_4.ppc64le-ppc64le-with-glibc2.17
Is CUDA available: True
CUDA runtime version: 11.4.100
GPU models and configuration:
GPU 0: Tesla V100-SXM2-32GB
GPU 1: Tesla V100-SXM2-32GB

Nvidia driver version: 470.57.02
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.18.5
[pip3] numpydoc==1.1.0
[pip3] torch==1.9.0a0+gitdfbd030
[conda] nomkl                     3.0                           0
[conda] numpy                     1.18.5           py38h7130bb8_0
[conda] numpy-base                1.18.5           py38h2f8d375_0
[conda] numpydoc                  1.1.0                      py_0

PyTorch Version (e.g., 1.0): 1.9.0
OS (e.g., Linux): Linux
How you installed PyTorch (conda, pip, source): source, but have also used directly from conda
Build command you used (if compiling from source): python setup.py install --prefix=~/
Python version: 3.8.3
CUDA/cuDNN version: 11.4
GPU models and configuration: V100, 4 GPUs/node
Any other relevant information:

Additional context

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang

cbalioglu commented 3 years ago

Sounds like a deadlock caused by the file locking mechanism. Although we should definitely investigate why this is not working with newer versions of PyTorch (this is a BC-breaking behavior), I would strongly suggest using the tcp:// initialization option in case you are running a multi-node job. Unfortunately file locking on Linux is inherently broken when used with distributed file systems and the root cause can be many things including some changes in the recent kernel versions.

rmchurch commented 3 years ago

A quick diff showed one of the main differences is v1.3.1 didn't have this _store_based_barrier function, but I haven't dug in to understand the code. I went ahead and looked at the tcp:// init method as you suggested, I hadn't realized I could go off a master IP and port only. I have that working now successfully with the following:

cmd = 'scontrol show hostnames ' + os.getenv('SLURM_JOB_NODELIST')
stdout = subprocess.check_output(cmd.split())
host_name = stdout.decode().splitlines()[0]
port = 54000
dist_url = f'tcp://{host_name}:{self.args.port}'

I will leave the issue open, but feel free to close if you wish.

luoguohao commented 2 years ago

yes, i also encounter this problem , it's very weird. and sleep(1) makes things going on.. i can only use shared-file mechanism for distribute training in my situation, so anyone has good ideals ?

cbalioglu commented 2 years ago

cc @kwen2501

pytorch / pytorch