Open jenniew opened 1 year ago
Hi @jenniew Does your workload runs on CPU? Could you please try the "gloo" backend first and see whether it works? Thanks!
Hi @jenniew Does your workload runs on CPU? Could you please try the "gloo" backend first and see whether it works? Thanks!
Yes, my workload runs on CPU. I tried "gloo" backend, and it works.
I'm try to use torch.distributed.launch to launch multiple node training with oneccl. On each node, I install oneccl, and source $oneccl_bindings_for_pytorch_path/env/setvars.sh The command on 1st node is: CCL_WORKER_COUNT=1 python -m torch.distributed.launch --master_addr=172.168.0.201 --nproc_per_node=1 --nnodes=2 --node_rank=0 demo.py The command on 2nd node is: CCL_WORKER_COUNT=1 python -m torch.distributed.launch --master_addr=172.168.0.201 --nproc_per_node=1 --nnodes=2 --node_rank=1 demo.py
But on both nodes, it hung after these messages: 2023-06-23 03:36:46,458 - torch.distributed.distributed_c10d - INFO - Added key: store_based_barrier_key:1 to store for rank: 0 2023-06-23 03:36:46,520 - torch.distributed.distributed_c10d - INFO - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes. point 0 point 1 point 2 point 2.1 point 2.2 2023:06:23-03:36:46:(3742406) |CCL_WARN| did not find MPI-launcher specific variables, switch to ATL/OFI, to force enable ATL/MPI set CCL_ATL_TRANSPORT=mpi
I'm wondering how to use torch.distributed.launch to run multiple node training with oneccl? Is there any specific setting needs to do?