Unexpectedly low MPI bandwidth between Connectx-6 HSAs on two multisocket Dell R7525 servers

lfmeadow commented 3 years ago

Describe the bug

We have two 2-socket Dell servers with two AMD EPYC 7402 (24-core) processors each and two Connect-6 cards each. The HSAs are each on a separate PCI bus corresponding to each processor. Each HSA is connected to the same IB switch. I ran an MPI PingPong (from Intel IMB benchmarks) in all 4 combinations of mlx5_0:1 and mlx5_1:1 on the two servers using UCX_NET_DEVICES and binding the MPI rank to the corresponding socket, e.g.:

ed-dlgpu-168c:src_c$ cat rankfile00
rank 0=ed-dlgpu-168c slot=0:0-23
rank 1=ed-dlgpu-1bb0 slot=0:0-23

mpirun --prefix /home/larry/sycl-with-cuda/ompi_install \
-rf rankfile00 --report-bindings \
-np 1 -x UCX_NET_DEVICES=mlx5_0:1 ./IMB-MPI1 -msglog 18:19 PingPong : \
-np 1 -x UCX_NET_DEVICES=mlx5_0:1 ./IMB-MPI1 -msglog 18:19 PingPong

Here are two bandwidth tables, one for 256KiB and one for 512KiB:

256KiB Mbytes/sec node0/node1	mlx5_0	mlx5_1
mlx5_0	7225.23	8384.99
mlx5_1	8393.62	12001.65

512KiB Mbytes/sec node0/node1	mlx5_0	mlx5_1
mlx5_0	timeout	8402.31
mlx5_1	8439.10	15255.85

Since the cards are all connected to a switch and the MPI ranks are bound to the closest socket I would expect all the bandwidths to be about the same. It seems like there are two problems:

Low bandwidth when cards on different sockets communicate
Low bandwidth with a cliff above 256MiB when using card 0 on each node.

Perhaps this is some configuration problem with card 0.

Steps to Reproduce

See command line in bug description. IMB benchmarks are https://github.com/intel/mpi-benchmarks
UCT version=1.11.0 revision fa84605
configured with: --prefix=/home/larry/sycl-with-cuda/ucx_install --with-cuda=/usr/local/cuda --enable-mt
UCX_NET_DEVICES as shown in the description

Setup and versions

OS version (e.g Linux distro) + CPU architecture (x86_64/aarch64/ppc64le/...)
- Ubuntu 20.04.2 LTS
- Linux ed-dlgpu-168c 5.6.19-050619-generic #202006171132 SMP Wed Jun 17 16:31:10 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

For RDMA/IB/RoCE related issues:

Driver version:
- MLNX_OFED_LINUX-5.3-1.0.0.1:

HW information from ibstat or ibv_devinfo -vv command

ed-dlgpu-168c:src_c$ ibstat
CA 'mlx5_0'
CA type: MT4123
Number of ports: 1
Firmware version: 20.28.4512
Hardware version: 0
Node GUID: 0xb8cef60300506e08
System image GUID: 0xb8cef60300506e08
Port 1:
        State: Active
        Physical state: LinkUp
        Rate: 200
        Base lid: 2
        LMC: 0
        SM lid: 1
        Capability mask: 0x2651e848
        Port GUID: 0xb8cef60300506e08
        Link layer: InfiniBand
CA 'mlx5_1'
CA type: MT4123
Number of ports: 1
Firmware version: 20.28.1002
Hardware version: 0
Node GUID: 0x043f720300da6b16
System image GUID: 0x043f720300da6b16
Port 1:
        State: Active
        Physical state: LinkUp
        Rate: 200
        Base lid: 4
        LMC: 0
        SM lid: 1
        Capability mask: 0x2651e848
        Port GUID: 0x043f720300da6b16
        Link layer: InfiniBand
ed-dlgpu-168c:src_c$ ibswitches 
Switch  : 0xb8cef6030021fc3e ports 41 "MF0;switch-2aed3e:MQM8700/U1" enhanced port 0 lid 1 lmc 0

Additional information (depending on the issue)

OpenMPI version: 4.1.1
Output of ucx_info -d to show transports and devices recognized by UCX: ucx_info.txt
Configure result - config.log: config.log
Log file - configure UCX with "--enable-logging" - and run with "UCX_LOG_LEVEL=data": on request

yosefe commented 3 years ago

Can you pls try the following:

Run ib_read_bw benchmark on mlx5_0 (both sides) and message size 512k and report the result? should be something like this: Server: taskset -c 0-23 ib_read_bw -s $((512*1024)) -D 5 -d mlx5_0 Client: taskset -c 0-23 ib_read_bw -s $((512*1024)) -D 5 -d mlx5_0 ed-dlgpu-168c
IMB Pingpong with UCX_LOG_LEVEL=info env var, and post the output?
IMB Pingpong with UCX_TLS=rc UCX_RNDV_THRESH=1k UCX_MAX_RNDV_RAILS=1 UCX_RNDV_SCHEME=get_zcopy to see if it improves the result?
Try binding the MPI rank to one specific core (0 instead of 0-23, 24 instead of 24-47) to see if it improves?
Test osu_bw benchmark since it has a window of 64 outstanding send operations

lfmeadow commented 3 years ago

Will do. However, now I'm thinking it may be a firmware issue. Card 0 has a PSID DEL0000000010 and the firmware tools won't let me update the firmware. So our system guy is talking to Dell. I'll keep you informed. Thanks for the quick response.

lfmeadow commented 3 years ago

Upgrading to FW 20.31.1014 made the problem go away. Sorry for the noise.

openucx / ucx