Open Keepmoving-ZXY opened 4 years ago
@jsquyres I believe this is in your alley as it involves usNIC.
These days I am trying to train
TensorFlow
with the help ofOpenMPI
. I runTensorFlow
distributed training with two gpu server, each of them has two Nvidia V100 GPUs and supportusNIC
. TheOpenMPI
response for the communication during distributed training, andOpenMPI
'susnic
BTL is enable, so I letOpenMPI
useusnic
BTL to finish bytes transport. And whenTensorFlow
finish training, I find the performance(how many imagesTensorFlow
can process per second) is much lower thanTensorFlow
's original performance at the same circumstances. I am very confused about this result due to the fact thatOpenMPI
is a high performance communicate library , so I have a careful look atOpenMPI
'susnic
BTL source code, then I have some question about it, and I will list them after explain how I start distributed training, and add some thought about questions.
Question about this: "when TensorFlow
finish training, I find the performance is much lower than TensorFlow
's original performance at the same circumstances"
Can you shed a little more light on the two cases that you're comparing? Are both cases TensorFlow training -- one with MPI/usnic and one with TCP sockets (no MPI)?
There are two gpu servers, and each gpu server has two Nvidia V100 GPUs and support
usNIC
, and only oneusNIC
's physical port can use.
Are the VICs in the same NUMA locality as the GPUs?
Considering that
usNIC
don't support communicate of two process in the same gpu server, I have to letOpenMPI
launch two training process and each process runs in a unique gpu server, so each training process can use two GPUs.
That's correct that usNIC doesn't handle server-loopback communication. But the vader
BTL does -- it's shared memory communication, intended for server-loopback communication. Hence, if you run 2 MPI processes on each server (4 processes total) -- MPI_COMM_WORLD ranks 0 and 1 on server A and MCW ranks 2 and 3 on server B:
(and so on for the other combinations of MCW ranks)
Question 1
:I find that
usnic
BTL will scan all possiableusNIC
device and create a data channel and priority channel to each.
Correct.
I think the data channel and priority channel will use 2 QPs, and there remain 4 QPs that are not in use(I set
Transmit Queue Count
to 6 andReceive Queue Count
to 6 inCIMC
). And combine with my run method(see above), this will lead some QPs to be not in use. I think the fact thatOpenMPI
don't make full use of all QPs in may run case can lead to a low tx and rx performance ofusNIC
, even lead to some drop of rx, does this right?
For tx, one work queue is sufficient to achieve line rate (barring PCI bottlenecks).
For rx, Open MPI makes a fairly deep receive queue to be able to service incoming requests, even when Open MPI does not dip into the software layer to check for completions. Drops are always possible, of course, if Open MPI does not poll the usNIC BTL for lengthy periods of time (just like verbs/IB). You can use the MCA param btl_usnic_rd_num
to override the default depth of the receive queue.
Adding more tx/rx contexts typically tends to increase small message latency, because then the software has to check more than one set of queues per polling iteration.
I read some posts about
OpenMPI
, and they say thatOpenMPI
has some component that support communicate of two process in the same server without anyNIC
involved.
Correct: it's the (poorly-named) vader
component, which uses shared memory for on-server communication between MPI processes.
So I have a new idea about how to run distributed training in my gpu servers. Although
usNIC
don't support communicate of two process in the same server,OpenMPI
's intra-node component can solve this problem, and I thinkOpenMPI
can distinguish which two process is in the same server.
Correct.
So I think another way to run distributed training in my gpu servers is: let
OpenMPI
launch four process(two in one gpu server, and two in another), each process use one GPU. And In this way, there will be 4 QPs used in a server'susNIC
device, and the rx and tx performance ofusNIC
will gain a increase, does it right?
I would agree that this is a good configuration.
It'll do the communication pattern I listed earlier in this post. In short: shared memory will be used for on-server communication and usNIC will be used for off-server communication.
The GPU used during training and VIC are the same located in the same numa node, other information I have emailed to you.
Background information
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
v4.0.2
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
From rpm source code package.
Please describe the system on which you are running
Details of the problem
These days I am trying to train
TensorFlow
with the help ofOpenMPI
. I runTensorFlow
distributed training with two gpu server, each of them has two Nvidia V100 GPUs and supportusNIC
. TheOpenMPI
response for the communication during distributed training, andOpenMPI
'susnic
BTL is enable, so I letOpenMPI
useusnic
BTL to finish bytes transport. And whenTensorFlow
finish training, I find the performance(how many imagesTensorFlow
can process per second) is much lower thanTensorFlow
's original performance at the same circumstances. I am very confused about this result due to the fact thatOpenMPI
is a high performance communicate library , so I have a careful look atOpenMPI
'susnic
BTL source code, then I have some question about it, and I will list them after explain how I start distributed training, and add some thought about questions.Run method
:There are two gpu servers, and each gpu server has two Nvidia V100 GPUs and support
usNIC
, and only oneusNIC
's physical port can use. Considering thatusNIC
don't support communicate of two process in the same gpu server, I have to letOpenMPI
launch two training process and each process runs in a unique gpu server, so each training process can use two GPUs. The launch script is:content of
worker.sh
is so long, so I append it to the end of this post.Question
:Question 1
:I find that
usnic
BTL will scan all possiableusNIC
device and create a data channel and priority channel to each. I think the data channel and priority channel will use 2 QPs, and there remain 4 QPs that are not in use(I setTransmit Queue Count
to 6 andReceive Queue Count
to 6 inCIMC
). And combine with my run method(see above), this will lead some QPs to be not in use.I think the fact that
OpenMPI
don't make full use of all QPs in may run case can lead to a low tx and rx performance ofusNIC
, even lead to some drop of rx, does this right?Question 2
:I read some posts about
OpenMPI
, and they say thatOpenMPI
has some component that support communicate of two process in the same server without anyNIC
involved. So I have a new idea about how to run distributed training in my gpu servers. AlthoughusNIC
don't support communicate of two process in the same server,OpenMPI
's intra-node component can solve this problem, and I thinkOpenMPI
can distinguish which two process is in the same server. So I think another way to run distributed training in my gpu servers is: letOpenMPI
launch four process(two in one gpu server, and two in another), each process use one GPU. And In this way, there will be 4 QPs used in a server'susNIC
device, and the rx and tx performance ofusNIC
will gain a increase, does it right?Thank you.