Closed planetA closed 1 year ago
It seems that the effect happens, because NPB benchmark does a lot of large message communication. And only with two nodes a lot of traffic goes through the same node. With TCP it is loopback device, which is faster than than loopback communication over IB device.
In short, this seems to be rather an artifact of the measurement setup, not inadvertent misconfiguration.
Hi,
I am running tests on a system with two 36-core nodes with 100G ConnectX-5 Ex NICs and I observe a what I consider to be a performance anomaly. This is a RoCE NIC, so I can use it in TCP and RoCE modes.
Specifically, I run FT.D benchmark from NPB suite with I tell OpenMPI (or rather UCX) to use the NIC either as TCP NIC, or as a RoCE NIC. The surprising thing, that in many cases TCP is faster than RoCE. Not by too much (92 vs 99 seconds runtime), but my understanding is that this is not supposed to happen at all.
If I look at microbenchmarks (rdma-core, UCX, OSU MPI), the performance is what I would expect: RoCE is much faster.
Here are the command lines to run the benchmarks:
In this case I run openmpi 4.1.4, but I tried several different versions of OpenMPI, UCX, OFED/no-OFED, even different OS (Ubuntu vs RHEL-based). the behavior seems consistent. The exact numbers may change a bit, but TCP being faster seems to be quite consistent.
What am I missing? How can I debug that? Or is it the right behavior?
I would be glad for any suggestion.