Open leonardozcm opened 1 year ago
Find out stocks here: https://github.com/pytorch/pytorch/blob/main/torch/nn/parallel/distributed.py#L809
@leonardozcm What backend are you using while initializing torch.init_distributed() ? The recommended backend is "ccl" and judging by the error , you might have set backend="nccl" which is why it is querying for libc10_cuda.so Could you share a snippet /reproducer is possible?
Torch/torch-ccl/ipex version 1.13.0 cluster node: 2 World_size: 2 All nodes have password-less connections set, and mpirun works well as the readme says:
And I try to run it manually by start training in both of the nodes:
This will stock at DDP(model):
This will not happen if I set
dist.init_process_group(backend='gloo')