Anomalous OpenMPI performance with ConnectX-5 Ex NICs

Hi,

I am running tests on a system with two 36-core nodes with 100G ConnectX-5 Ex NICs and I observe a what I consider to be a performance anomaly. This is a RoCE NIC, so I can use it in TCP and RoCE modes.

Specifically, I run FT.D benchmark from NPB suite with I tell OpenMPI (or rather UCX) to use the NIC either as TCP NIC, or as a RoCE NIC. The surprising thing, that in many cases TCP is faster than RoCE. Not by too much (92 vs 99 seconds runtime), but my understanding is that this is not supposed to happen at all.

If I look at microbenchmarks (rdma-core, UCX, OSU MPI), the performance is what I would expect: RoCE is much faster.

Here are the command lines to run the benchmarks:

/opt/openmpi-4.1.4/bin/mpirun --mca coll_hcoll_enable 0 --map-by node -np 64 --hostfile ~/hostfile -x UCX_TLS=rc,self -x UCX_NET_DEVICES=mlx5_2:1  ~/NPB3.4.2/NPB3.4-MPI/bin/ft.D.x
/opt/openmpi-4.1.4/bin/mpirun --mca coll_hcoll_enable 0 --map-by node -np 64 --hostfile ~/hostfile -x UCX_TLS=tcp,self -x UCX_NET_DEVICES=ens800f0  ~/NPB3.4.2/NPB3.4-MPI/bin/ft.D.x

In this case I run openmpi 4.1.4, but I tried several different versions of OpenMPI, UCX, OFED/no-OFED, even different OS (Ubuntu vs RHEL-based). the behavior seems consistent. The exact numbers may change a bit, but TCP being faster seems to be quite consistent.

What am I missing? How can I debug that? Or is it the right behavior?

I would be glad for any suggestion.

open-mpi / ompi

Anomalous OpenMPI performance with ConnectX-5 Ex NICs #11616