some question and thinking about TenosrFlow and OpenMPI's usnic btl

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

v4.0.2

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

From rpm source code package.

Please describe the system on which you are running

Operating system/version: Centos 7.6
Computer hardware: x86_64

Details of the problem

These days I am trying to train TensorFlow with the help of OpenMPI. I run TensorFlow distributed training with two gpu server, each of them has two Nvidia V100 GPUs and support usNIC. The OpenMPI response for the communication during distributed training, and OpenMPI's usnic BTL is enable, so I let OpenMPI use usnic BTL to finish bytes transport. And when TensorFlow finish training, I find the performance(how many images TensorFlow can process per second) is much lower than TensorFlow's original performance at the same circumstances. I am very confused about this result due to the fact that OpenMPI is a high performance communicate library , so I have a careful look at OpenMPI's usnic BTL source code, then I have some question about it, and I will list them after explain how I start distributed training, and add some thought about questions.

Run method:

There are two gpu servers, and each gpu server has two Nvidia V100 GPUs and support usNIC, and only one usNIC's physical port can use. Considering that usNIC don't support communicate of two process in the same gpu server, I have to let OpenMPI launch two training process and each process runs in a unique gpu server, so each training process can use two GPUs. The launch script is:

#!/bin/bash
set -x

TRAIN_BATCHSIZE=128
ALLREDUCE_ALG='xring'
TRAIN_MODEL='resnet50'
VARIABLE_UPDATE='distributed_all_reduce'

run=${TF_BTL}
JOB_LIST='10.0.0.1:1,10.0.0.2:1'

if [ $run -eq 1 ];
then
  /opt/openmpi/4.0.2/bin/mpirun -np 2 --host ${JOB_LIST} \
    -mca pml ob1 \
    -mca btl_tcp_if_include eno5 \
    -map-by node -mca btl usnic,vader,self \
    --allow-run-as-root \
    -mca btl_base_verbose 100 \
    -x TF_MODULE=${TRAIN_MODEL} \
    -x TF_BATCHSIZE=${TRAIN_BATCHSIZE} \
    -x TF_ALLREDUCE_ALG=${ALLREDUCE_ALG} \
    -x TF_VARIABLE_UPDATE=${VARIABLE_UPDATE} \
    -x LD_LIBRARY_PATH=/opt/cisco/libfabric/lib \
    sh worker.sh
else
  /opt/openmpi/4.0.2/bin/mpirun -np 2 --host ${JOB_LIST} \
    -mca pml ob1 \
    -mca btl_tcp_if_include eno5 \
    -map-by node -mca btl tcp,vader,self \
    --allow-run-as-root \
    -mca btl_base_verbose 100 \
    -x TF_MODULE=${TRAIN_MODEL} \
    -x TF_BATCHSIZE=${TRAIN_BATCHSIZE} \
    -x TF_ALLREDUCE_ALG=${ALLREDUCE_ALG} \
    -x TF_VARIABLE_UPDATE=${VARIABLE_UPDATE} \
    -x LD_LIBRARY_PATH=/opt/cisco/libfabric/lib \
    sh worker.sh
fi

content of worker.sh is so long, so I append it to the end of this post.

Question:

Question 1:

I find that usnic BTL will scan all possiable usNIC device and create a data channel and priority channel to each. I think the data channel and priority channel will use 2 QPs, and there remain 4 QPs that are not in use(I set Transmit Queue Count to 6 and Receive Queue Count to 6 in CIMC). And combine with my run method(see above), this will lead some QPs to be not in use.

I think the fact that OpenMPI don't make full use of all QPs in may run case can lead to a low tx and rx performance ofusNIC , even lead to some drop of rx, does this right?

Question 2:

I read some posts about OpenMPI, and they say that OpenMPI has some component that support communicate of two process in the same server without any NIC involved. So I have a new idea about how to run distributed training in my gpu servers. Although usNIC don't support communicate of two process in the same server, OpenMPI's intra-node component can solve this problem, and I think OpenMPI can distinguish which two process is in the same server. So I think another way to run distributed training in my gpu servers is: let OpenMPI launch four process(two in one gpu server, and two in another), each process use one GPU. And In this way, there will be 4 QPs used in a server's usNIC device, and the rx and tx performance of usNIC will gain a increase, does it right?

Thank you.

These days I am trying to train TensorFlow with the help of OpenMPI. I run TensorFlow distributed training with two gpu server, each of them has two Nvidia V100 GPUs and support usNIC. The OpenMPI response for the communication during distributed training, and OpenMPI's usnic BTL is enable, so I let OpenMPI use usnic BTL to finish bytes transport. And when TensorFlow finish training, I find the performance(how many images TensorFlow can process per second) is much lower than TensorFlow's original performance at the same circumstances. I am very confused about this result due to the fact that OpenMPI is a high performance communicate library , so I have a careful look at OpenMPI's usnic BTL source code, then I have some question about it, and I will list them after explain how I start distributed training, and add some thought about questions.

Question about this: "when TensorFlow finish training, I find the performance is much lower than TensorFlow's original performance at the same circumstances"

Can you shed a little more light on the two cases that you're comparing? Are both cases TensorFlow training -- one with MPI/usnic and one with TCP sockets (no MPI)?

There are two gpu servers, and each gpu server has two Nvidia V100 GPUs and support usNIC, and only one usNIC's physical port can use.

Are the VICs in the same NUMA locality as the GPUs?

Considering that usNIC don't support communicate of two process in the same gpu server, I have to let OpenMPI launch two training process and each process runs in a unique gpu server, so each training process can use two GPUs.

That's correct that usNIC doesn't handle server-loopback communication. But the vader BTL does -- it's shared memory communication, intended for server-loopback communication. Hence, if you run 2 MPI processes on each server (4 processes total) -- MPI_COMM_WORLD ranks 0 and 1 on server A and MCW ranks 2 and 3 on server B:

When MCW rank 0 sends to rank 1, it will use vader/shared memory
When MCW rank 0 sends to ranks 2 or 3, it will use usNIC

(and so on for the other combinations of MCW ranks)

Question 1:

I find that usnic BTL will scan all possiable usNIC device and create a data channel and priority channel to each.

Correct.

I think the data channel and priority channel will use 2 QPs, and there remain 4 QPs that are not in use(I set Transmit Queue Count to 6 and Receive Queue Count to 6 in CIMC). And combine with my run method(see above), this will lead some QPs to be not in use. I think the fact that OpenMPI don't make full use of all QPs in may run case can lead to a low tx and rx performance ofusNIC , even lead to some drop of rx, does this right?

For tx, one work queue is sufficient to achieve line rate (barring PCI bottlenecks).

For rx, Open MPI makes a fairly deep receive queue to be able to service incoming requests, even when Open MPI does not dip into the software layer to check for completions. Drops are always possible, of course, if Open MPI does not poll the usNIC BTL for lengthy periods of time (just like verbs/IB). You can use the MCA param btl_usnic_rd_num to override the default depth of the receive queue.

Adding more tx/rx contexts typically tends to increase small message latency, because then the software has to check more than one set of queues per polling iteration.

I read some posts about OpenMPI, and they say that OpenMPI has some component that support communicate of two process in the same server without any NIC involved.

Correct: it's the (poorly-named) vader component, which uses shared memory for on-server communication between MPI processes.

So I have a new idea about how to run distributed training in my gpu servers. Although usNIC don't support communicate of two process in the same server, OpenMPI's intra-node component can solve this problem, and I think OpenMPI can distinguish which two process is in the same server.

Correct.

So I think another way to run distributed training in my gpu servers is: let OpenMPI launch four process(two in one gpu server, and two in another), each process use one GPU. And In this way, there will be 4 QPs used in a server's usNIC device, and the rx and tx performance of usNIC will gain a increase, does it right?

I would agree that this is a good configuration.

It'll do the communication pattern I listed earlier in this post. In short: shared memory will be used for on-server communication and usNIC will be used for off-server communication.

open-mpi / ompi