goswamig commented 6 years ago

The script mentioned in https://github.com/pytorch/examples/tree/master/imagenet does provides good guideline on single node training however it doesn't have good documentation on Distributed training on multiple Node.

I tried to use two machines with 8 gpus with below command

Machine-1 script

HOST_PORT="tcp://Machine-1-ip:13333"

NODE=0
RANKS_PER_NODE=8

for i in $(seq 0 7); do
  LOCAL_RANK=$i
  DISTRIBUTED_RANK=$((RANKS_PER_NODE * NODE + LOCAL_RANK))
  NCCL_DEBUG=INFO NCCL_MIN_NRINGS=5 python /home/ubuntu/examples/imagenet/main.py  \
       --a resnet18 \
       /home/ubuntu/mini_imagenet \
       --dist-url $HOST_PORT        \
       --gpu $DISTRIBUTED_RANK \
       --dist-backend nccl \
       --world-size  16 &
  PIDS[$LOCAL_RANK]=$!
done

On machine-2

HOST_PORT="tcp://Machine-1-ip:13333"

NODE=1
RANKS_PER_NODE=8

for i in $(seq 0 7); do
  LOCAL_RANK=$i
  DISTRIBUTED_RANK=$((RANKS_PER_NODE * NODE + LOCAL_RANK))
  NCCL_DEBUG=INFO NCCL_MIN_NRINGS=5 python /home/ubuntu/examples/imagenet/main.py  \
       --a resnet18 \
       /home/ubuntu/mini_imagenet \
       --dist-url $HOST_PORT        \
       --gpu $DISTRIBUTED_RANK \
       --dist-backend nccl \
       --world-size  16 &
  PIDS[$LOCAL_RANK]=$!
done

However it fails with below error

Traceback (most recent call last):
  File "/home/ubuntu/examples/imagenet/main.py", line 347, in <module>
    main()
  File "/home/ubuntu/examples/imagenet/main.py", line 96, in main
    world_size=args.world_size)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/distributed/__init__.py", line 94, in init_process_group
    group_name, rank)
RuntimeError: the MPI backend is not available; try to recompile the THD package with MPI support at /opt/conda/conda-bld/pytorch_1532579245307/work/torch/lib/THD/process_group/General.cpp:17

cchen01 commented 6 years ago

i have the same issue. Even gloo and nccl do not work also. Error message is: For NCCL: => creating model 'resnet18' NCCL version 2.3.5+cuda9.0 Traceback (most recent call last): File "main.py", line 340, in main() File "main.py", line 180, in main train(train_loader, model, criterion, optimizer, epoch) File "main.py", line 228, in train loss.backward() File "/home/cchen01/packages-pytorch-distributed/pytorch/lib/python3.7/site-packages/torch/tensor.py", line 93, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/home/cchen01/packages-pytorch-distributed/pytorch/lib/python3.7/site-packages/torch/autograd/init.py", line 90, in backward allow_unreachable=True) # allow_unreachable flag File "/home/cchen01/packages-pytorch-distributed/pytorch/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 341, in reduction_fn_nccl group=self.nccl_reduction_group_id) File "/home/cchen01/packages-pytorch-distributed/pytorch/lib/python3.7/site-packages/torch/distributed/init.py", line 306, in all_reduce_multigpu return torch._C._dist_all_reduce_multigpu(tensor_list, op, group) RuntimeError: NCCL error in: /home/cchen01/src-pytorch-distributed/Pytorch/torch/lib/THD/base/data_channels/DataChannelNccl.cpp:322, unhandled system error

for gloo: terminate called after throwing an instance of 'gloo::EnforceNotMet' what(): [enforce fail at /home/cchen01/src-pytorch-distributed/Pytorch/third_party/gloo/gloo/transport/ibverbs/pair.cc:462] wc->status == IBV_WC_SUCCESS. 12 vs 0. Memory region send for slot 0: transport retry counter exceeded Aborted (core dumped)

goswamig commented 6 years ago

Try this out https://github.com/gautamkmr/examples/blob/master/imagenet/DistributedTraing.md with the fix https://github.com/gautamkmr/examples/commit/4f030db0f5e53e1920690fb6a33b0b7156b53fc0

cchen01 commented 6 years ago

hi, thanks for help! this code is running, but no communication/synchronization among processes. Is there anything missing in the commit?

goswamig commented 6 years ago

I think I have changed it a bit can you take a look again ? https://github.com/gautamkmr/examples/blob/master/imagenet/DistributedTraing.md

BasselAli1 commented 5 years ago

@gautamkmr thank you for asking the question because i have the same issue. I don't have knowledge of parallel or distributed computing and I will use cluster computer(HPC) for my research. I will use Slurm(sbatch). Do you know if that is similar to your issue? I read your modified script. but Is it possible to access individual nodes in the cluster? Is there a way to know their IP address and port number?

goswamig commented 5 years ago

@curry111 Do you mean accessing cluster node from training code or in general ?

nnop commented 4 years ago

hi, thanks for help! this code is running, but no communication/synchronization among processes. Is there anything missing in the commit?

How did you find no communication/synchronization among processes?

samra-irshad commented 4 years ago

I am facing the same issue. Is there anyway to know the ip address of node in the HPC cluster from training or in general? So that I can set os.environ['MASTER_ADDR'] and os.environ['MASTER_PORT'] variables?

vishwajitvishnu commented 4 years ago

@samra-irshad I have used cluster to run distributed computing using two nodes. I used following (distributed_init_method is env:// as written on pytorch website): def setup_env(distributed_init_method, local_rank): assert torch.cuda.is_available(), ' cuda not available' ; torch.distributed.init_process_group( backend='nccl', init_method= distributed_init_method, ); rank= torch.distributed.get_rank(); world_size= torch.distributed.get_world_size();

print("rank: ",rank," local_rank:",local_rank," world_size",world_size);

torch.cuda.set_device(local_rank);
return rank,world_size;

This function should be called in the starting of your main program. Now how to get the ip addresses on cluster. You can do the following for PBS (slurm code might be bit different):

FS=$'\n' read -d '' -r -a lines < ${PBS_NODEFILE}

echo $lines ########## THINGS TO CHANGE #################

see the node no and then change and change rank to 0/1 later

MASTER=$lines

MASTER="gpu_"

RANK=0

RANK=1

#########################################

MPORT="6010"

echo "node : ${CURRENT_NODE%%.*} nnode: ${NNODES} rank: $RANK portno: ${MPORT}" &

ssh -q ${lines%%.*}

ssh -q $lines \ $(bash ./run_bigbatch.sh ${NNODES} $RANK $MASTER ${MPORT})

###################################################

At end we just see the stats

qstat -f ${PBS_JOBID}

Now make a second file exactly same as above and just change the MASTER to $MASTER echoed by this file and change RANK=1 on that file and run it. In run_bigbatch.sh just use torch.distributed.launch --nprocs_per_node=2(since i had two gpus on one node) --nnodes=$1 --node_rank=$2 --master_addr=$3 --master_port=$4 main.py --my arguments

guhur commented 3 years ago

In run_bigbatch.sh just use torch.distributed.launch --nprocs_per_node=2(since i had two gpus on one node) --nnodes=$1 --node_rank=$2 --master_addr=$3 --master_port=$4 main.py --my arguments

Thanks that was helpful :)

Actually, the argument name is "nproc_per_node" (without the 's').

mesllo commented 2 years ago

Has anyone managed to run the Imagenet distributed example on SLURM using multiple nodes?

BIGBALLON commented 2 years ago

You can try ImageNet training example [imagenet.py]

More

Please check tutorial for detailed Distributed Training tutorials:

Single Node Single GPU Card Training [snsc.py]
Single Node Multi-GPU Cards Training (with DataParallel) [snmc_dp.py]
Multiple Nodes Multi-GPU Cards Training (with DistributedDataParallel)
- torch.distributed.launch [mnmc_ddp_launch.py]
- torch.multiprocessing [mnmc_ddp_mp.py]
Slurm Workload Manager [mnmc_ddp_slurm.py]
ImageNet training example [imagenet.py]

Core function

def setup_distributed(backend="nccl", port=None):
    """Initialize distributed training environment.
    support both slurm and torch.distributed.launch
    see torch.distributed.init_process_group() for more details
    """
    num_gpus = torch.cuda.device_count()

    if "SLURM_JOB_ID" in os.environ:
        rank = int(os.environ["SLURM_PROCID"])
        world_size = int(os.environ["SLURM_NTASKS"])
        node_list = os.environ["SLURM_NODELIST"]
        addr = subprocess.getoutput(f"scontrol show hostname {node_list} | head -n1")
        # specify master port
        if port is not None:
            os.environ["MASTER_PORT"] = str(port)
        elif "MASTER_PORT" not in os.environ:
            os.environ["MASTER_PORT"] = "29566"
        if "MASTER_ADDR" not in os.environ:
            os.environ["MASTER_ADDR"] = addr
        os.environ["WORLD_SIZE"] = str(world_size)
        os.environ["LOCAL_RANK"] = str(rank % num_gpus)
        os.environ["RANK"] = str(rank)
    else:
        rank = int(os.environ["RANK"])
        world_size = int(os.environ["WORLD_SIZE"])

    torch.cuda.set_device(rank % num_gpus)

    dist.init_process_group(
        backend=backend,
        world_size=world_size,
        rank=rank,
    )

pytorch / examples

How to run distributed training on multiple Node using ImageNet using ResNet model #431

print("rank: ",rank," local_rank:",local_rank," world_size",world_size);

see the node no and then change and change rank to 0/1 later

MASTER="gpu_"

RANK=1

ssh -q ${lines%%.*}

At end we just see the stats

More

Core function