Open goswamig opened 6 years ago
i have the same issue. Even gloo and nccl do not work also.
Error message is:
For NCCL:
=> creating model 'resnet18'
NCCL version 2.3.5+cuda9.0
Traceback (most recent call last):
File "main.py", line 340, in
for gloo: terminate called after throwing an instance of 'gloo::EnforceNotMet' what(): [enforce fail at /home/cchen01/src-pytorch-distributed/Pytorch/third_party/gloo/gloo/transport/ibverbs/pair.cc:462] wc->status == IBV_WC_SUCCESS. 12 vs 0. Memory region send for slot 0: transport retry counter exceeded Aborted (core dumped)
hi, thanks for help! this code is running, but no communication/synchronization among processes. Is there anything missing in the commit?
I think I have changed it a bit can you take a look again ? https://github.com/gautamkmr/examples/blob/master/imagenet/DistributedTraing.md
@gautamkmr thank you for asking the question because i have the same issue. I don't have knowledge of parallel or distributed computing and I will use cluster computer(HPC) for my research. I will use Slurm(sbatch). Do you know if that is similar to your issue? I read your modified script. but Is it possible to access individual nodes in the cluster? Is there a way to know their IP address and port number?
@curry111 Do you mean accessing cluster node from training code or in general ?
hi, thanks for help! this code is running, but no communication/synchronization among processes. Is there anything missing in the commit?
How did you find no communication/synchronization among processes
?
I am facing the same issue. Is there anyway to know the ip address of node in the HPC cluster from training or in general? So that I can set os.environ['MASTER_ADDR'] and os.environ['MASTER_PORT'] variables?
@samra-irshad I have used cluster to run distributed computing using two nodes. I used following (distributed_init_method is env:// as written on pytorch website): def setup_env(distributed_init_method, local_rank): assert torch.cuda.is_available(), ' cuda not available' ; torch.distributed.init_process_group( backend='nccl', init_method= distributed_init_method, ); rank= torch.distributed.get_rank(); world_size= torch.distributed.get_world_size();
torch.cuda.set_device(local_rank);
return rank,world_size;
This function should be called in the starting of your main program. Now how to get the ip addresses on cluster. You can do the following for PBS (slurm code might be bit different):
FS=$'\n' read -d '' -r -a lines < ${PBS_NODEFILE}
echo $lines ########## THINGS TO CHANGE #################
MASTER=$lines
RANK=0
#########################################
MPORT="6010"
echo "node : ${CURRENT_NODE%%.*} nnode: ${NNODES} rank: $RANK portno: ${MPORT}" &
ssh -q $lines \ $(bash ./run_bigbatch.sh ${NNODES} $RANK $MASTER ${MPORT})
###################################################
qstat -f ${PBS_JOBID}
Now make a second file exactly same as above and just change the MASTER to $MASTER echoed by this file and change RANK=1 on that file and run it. In run_bigbatch.sh just use torch.distributed.launch --nprocs_per_node=2(since i had two gpus on one node) --nnodes=$1 --node_rank=$2 --master_addr=$3 --master_port=$4 main.py --my arguments
In run_bigbatch.sh just use torch.distributed.launch --nprocs_per_node=2(since i had two gpus on one node) --nnodes=$1 --node_rank=$2 --master_addr=$3 --master_port=$4 main.py --my arguments
Thanks that was helpful :)
Actually, the argument name is "nproc_per_node" (without the 's').
Has anyone managed to run the Imagenet distributed example on SLURM using multiple nodes?
You can try ImageNet training example [imagenet.py]
Please check tutorial for detailed Distributed Training tutorials:
def setup_distributed(backend="nccl", port=None):
"""Initialize distributed training environment.
support both slurm and torch.distributed.launch
see torch.distributed.init_process_group() for more details
"""
num_gpus = torch.cuda.device_count()
if "SLURM_JOB_ID" in os.environ:
rank = int(os.environ["SLURM_PROCID"])
world_size = int(os.environ["SLURM_NTASKS"])
node_list = os.environ["SLURM_NODELIST"]
addr = subprocess.getoutput(f"scontrol show hostname {node_list} | head -n1")
# specify master port
if port is not None:
os.environ["MASTER_PORT"] = str(port)
elif "MASTER_PORT" not in os.environ:
os.environ["MASTER_PORT"] = "29566"
if "MASTER_ADDR" not in os.environ:
os.environ["MASTER_ADDR"] = addr
os.environ["WORLD_SIZE"] = str(world_size)
os.environ["LOCAL_RANK"] = str(rank % num_gpus)
os.environ["RANK"] = str(rank)
else:
rank = int(os.environ["RANK"])
world_size = int(os.environ["WORLD_SIZE"])
torch.cuda.set_device(rank % num_gpus)
dist.init_process_group(
backend=backend,
world_size=world_size,
rank=rank,
)
The script mentioned in https://github.com/pytorch/examples/tree/master/imagenet does provides good guideline on single node training however it doesn't have good documentation on Distributed training on multiple Node.
I tried to use two machines with 8 gpus with below command
Machine-1 script
On machine-2
However it fails with below error