open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.17k stars 861 forks source link

Anomalous OpenMPI performance with ConnectX-5 Ex NICs #11616

Closed planetA closed 1 year ago

planetA commented 1 year ago

Hi,

I am running tests on a system with two 36-core nodes with 100G ConnectX-5 Ex NICs and I observe a what I consider to be a performance anomaly. This is a RoCE NIC, so I can use it in TCP and RoCE modes.

Specifically, I run FT.D benchmark from NPB suite with I tell OpenMPI (or rather UCX) to use the NIC either as TCP NIC, or as a RoCE NIC. The surprising thing, that in many cases TCP is faster than RoCE. Not by too much (92 vs 99 seconds runtime), but my understanding is that this is not supposed to happen at all.

If I look at microbenchmarks (rdma-core, UCX, OSU MPI), the performance is what I would expect: RoCE is much faster.

Here are the command lines to run the benchmarks:

/opt/openmpi-4.1.4/bin/mpirun --mca coll_hcoll_enable 0 --map-by node -np 64 --hostfile ~/hostfile -x UCX_TLS=rc,self -x UCX_NET_DEVICES=mlx5_2:1  ~/NPB3.4.2/NPB3.4-MPI/bin/ft.D.x
/opt/openmpi-4.1.4/bin/mpirun --mca coll_hcoll_enable 0 --map-by node -np 64 --hostfile ~/hostfile -x UCX_TLS=tcp,self -x UCX_NET_DEVICES=ens800f0  ~/NPB3.4.2/NPB3.4-MPI/bin/ft.D.x

In this case I run openmpi 4.1.4, but I tried several different versions of OpenMPI, UCX, OFED/no-OFED, even different OS (Ubuntu vs RHEL-based). the behavior seems consistent. The exact numbers may change a bit, but TCP being faster seems to be quite consistent.

What am I missing? How can I debug that? Or is it the right behavior?

I would be glad for any suggestion.

planetA commented 1 year ago

It seems that the effect happens, because NPB benchmark does a lot of large message communication. And only with two nodes a lot of traffic goes through the same node. With TCP it is loopback device, which is faster than than loopback communication over IB device.

In short, this seems to be rather an artifact of the measurement setup, not inadvertent misconfiguration.