Open rmchurch opened 3 years ago
Sounds like a deadlock caused by the file locking mechanism. Although we should definitely investigate why this is not working with newer versions of PyTorch (this is a BC-breaking behavior), I would strongly suggest using the tcp://
initialization option in case you are running a multi-node job. Unfortunately file locking on Linux is inherently broken when used with distributed file systems and the root cause can be many things including some changes in the recent kernel versions.
A quick diff showed one of the main differences is v1.3.1 didn't have this _store_based_barrier
function, but I haven't dug in to understand the code. I went ahead and looked at the tcp://
init method as you suggested, I hadn't realized I could go off a master IP and port only. I have that working now successfully with the following:
cmd = 'scontrol show hostnames ' + os.getenv('SLURM_JOB_NODELIST')
stdout = subprocess.check_output(cmd.split())
host_name = stdout.decode().splitlines()[0]
port = 54000
dist_url = f'tcp://{host_name}:{self.args.port}'
I will leave the issue open, but feel free to close if you wish.
yes, i also encounter this problem , it's very weird. and sleep(1) makes things going on.. i can only use shared-file mechanism for distribute training in my situation, so anyone has good ideals ?
cc @kwen2501
🐛 Bug
When using the NCCL backend and
file://
init_method forinit_process_group
, it inevitably hangs on my system (ppc64le architecture) when running multi-node and more than 1 GPU per compute node. The file I point to is unique each run, and resides on a GPFS scratch filesystem. The hanging only occurs for Pytorch versions >1.3.1 (I have tried 1.9.0, 1.7.0, and 1.5.0, they all exhibit the same behavior); in contrast, with v1.3.1 I am able to train using multiple GPUs with multiple compute nodes. I ran the multi-GPU, multi-node setup through the pdb debugger, and oddly if I pause right after the initialstore.add
in the_store_based_barrier
function, then continue, it runs through without problem. If I pause on the linestore.add
, then hit continue, it will hang. I'm not sure if there is an issue with multiple writers to the file, or what is causing this behavior, but it seems like a simpletime.sleep(1)
between thestore.add
and theget_world_size
would fix the problem. Or perhaps there is another setting solution?To Reproduce
Run with SLURM setup:
main.py:
Expected behavior
init_process_group to not hang
Environment
conda
,pip
, source): source, but have also used directly from condaAdditional context
cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang