MPI_TYPE_INDEXED + MPI_SEND/RECV slow with older infiniband network?

chhu commented 6 months ago

Related to #12202 but without CUDA. On our shared-memory system (2xEPYC) MPI_TYPE_INDEXED works fast as expected, but as soon as our 40GBit Infiniband gets involved performance breaks down by a factor of 2-5. This does not happen with the same OMPI and linear buffers (arrays).

Speed and response time of IB is very high and working fine as expected.

I do not see this behavior on our big HPC system that has 100G IB, even with the same OMPI. Is there something I can tune? How does OMPI transmit indexed types? Single request per block or scatter/gather into linear array first?

Thanks!

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

Tested on 3.1, 4.1 and 5.1 latest

Please describe the system on which you are running

See #12202

brminich commented 5 months ago

Is performance impact of using MPI_TYPE_INDEXED on 100G IB HPC system negligible or just smaller than on 40G systems? I'd expect it to be noticable on any system, as UCX does not use certain protocols when data is not contigious.

chhu commented 5 months ago

Only thing I can say is that on 100G IB the TYPE_INDEX has no notable impact, while on the 40G it has a major impact. Are you suggesting one should avoid non-contiguous data exchange?

brminich commented 5 months ago

yes, using non-contigious data may imply some limitations on mpi/ucx/network protocols

chhu commented 5 months ago

Hmm, maybe it would be a nice feature to linearize into a new buffer first before the exchange? Maybe let the user control this via a threshold setting?

open-mpi / ompi

MPI_TYPE_INDEXED + MPI_SEND/RECV slow with older infiniband network? #12209

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

Please describe the system on which you are running