openucx / ucx

Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
http://www.openucx.org
Other
1.17k stars 428 forks source link

Extremely low collective performance on UCX IB but not PSM3 when using HDR HCA / FDR switch combinations #8200

Open AlexNinaber opened 2 years ago

AlexNinaber commented 2 years ago

Describe the bug

During analyses of figuring out low StarCCM application performance, we did some testing with IMB-MPI1 AlltoAll and noticed extremely low performance with UCX communication. Both OpenMPI and IntelMPI, any version of UCX, native RH7 IB and Mellanox OFED.

What's special is that the hosts have either EDR or HDR cards and the switch is FDR. Tried various cables, copper/fibre, other FDR switches, mixed results/impact. All firmwares are up to date. However: on kernel 5.17.5 with IntelMPI and PSM3 communication, it's fine. Bit worse latency but not much, at 8K it's what is expected.

We're a bit confused about what can be causing this.

Steps to Reproduce

mpirun -np 64 -ppn 32 -host n1,n2 ./IMB-MPI1 AlltoAll Transport communication: MLX Tried various UCX transports, all same problem.

Once the above hits all 64 processes involved, progress is jumpy and in general very slow.

Any version

Setup and versions

Additional information (depending on the issue)

[1651843038.963228] [burn1-006:62839:0] mpool.c:206 UCX DEBUG mpool pending-ops: allocated chunk 0xbab0f0 of 8217 bytes with 128 elements [1651843038.963228] [burn1-006:62840:0] mpool.c:206 UCX DEBUG mpool pending-ops: allocated chunk 0xbab0f0 of 8217 bytes with 128 elements [1651843038.963227] [burn1-006:62841:0] mpool.c:206 UCX DEBUG mpool pending-ops: allocated chunk 0xbab0f0 of 8217 bytes with 128 elements [1651843038.963227] [burn1-006:62842:0] mpool.c:206 UCX DEBUG mpool pending-ops: allocated chunk 0xbab0f0 of 8217 bytes with 128 elements

yosefe commented 2 years ago

@AlexNinaber can you pls try adding -x UCX_RC_RETRY_COUNT=0 to mpirun? If this causes application to crash, it would mean the problem is with packet drops and retransmissions when having mixed port speeds.

AlexNinaber commented 2 years ago

@yosefe tried with UCX_RC_RETRY_COUNT, no change. The ports have no mixed port speeds by the way, IB just moves to the lowest common speeds which is FDR in this case.

yosefe commented 2 years ago

@AlexNinaber thanks for the feedback. The next step would be to take it with NVIDIA networking support since such a large delay is likely coming from the network level. The issue may still be related to HDR/EDR being downgraded to FDR.

AlexNinaber commented 2 years ago

@yosefe We figured it comes from the network level, however, the (unexpected) weirdness is that with PSM3 we see no such problem. Hence we were wondering if there's anything UCX can be doing weird on an EDR/HDR card if run in FDR mode.

yosefe commented 2 years ago

@AlexNinaber perhaps PSM3 is using a different transport such as UD. Can you try running UCX with "UCX_TLS=ud,self,sm" ?