openucx / ucx

Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
http://www.openucx.org
Other
1.12k stars 420 forks source link

Unexpectedly low MPI bandwidth between Connectx-6 HSAs on two multisocket Dell R7525 servers #7449

Closed lfmeadow closed 3 years ago

lfmeadow commented 3 years ago

Describe the bug

We have two 2-socket Dell servers with two AMD EPYC 7402 (24-core) processors each and two Connect-6 cards each. The HSAs are each on a separate PCI bus corresponding to each processor. Each HSA is connected to the same IB switch. I ran an MPI PingPong (from Intel IMB benchmarks) in all 4 combinations of mlx5_0:1 and mlx5_1:1 on the two servers using UCX_NET_DEVICES and binding the MPI rank to the corresponding socket, e.g.:

ed-dlgpu-168c:src_c$ cat rankfile00
rank 0=ed-dlgpu-168c slot=0:0-23
rank 1=ed-dlgpu-1bb0 slot=0:0-23

mpirun --prefix /home/larry/sycl-with-cuda/ompi_install \
-rf rankfile00 --report-bindings \
-np 1 -x UCX_NET_DEVICES=mlx5_0:1 ./IMB-MPI1 -msglog 18:19 PingPong : \
-np 1 -x UCX_NET_DEVICES=mlx5_0:1 ./IMB-MPI1 -msglog 18:19 PingPong

Here are two bandwidth tables, one for 256KiB and one for 512KiB:

256KiB Mbytes/sec node0/node1 mlx5_0 mlx5_1
mlx5_0 7225.23 8384.99
mlx5_1 8393.62 12001.65
512KiB Mbytes/sec node0/node1 mlx5_0 mlx5_1
mlx5_0 timeout 8402.31
mlx5_1 8439.10 15255.85

Since the cards are all connected to a switch and the MPI ranks are bound to the closest socket I would expect all the bandwidths to be about the same. It seems like there are two problems:

  1. Low bandwidth when cards on different sockets communicate
  2. Low bandwidth with a cliff above 256MiB when using card 0 on each node.

Perhaps this is some configuration problem with card 0.

Steps to Reproduce

Setup and versions

Additional information (depending on the issue)

yosefe commented 3 years ago

Can you pls try the following:

  1. Run ib_read_bw benchmark on mlx5_0 (both sides) and message size 512k and report the result? should be something like this: Server: taskset -c 0-23 ib_read_bw -s $((512*1024)) -D 5 -d mlx5_0 Client: taskset -c 0-23 ib_read_bw -s $((512*1024)) -D 5 -d mlx5_0 ed-dlgpu-168c

  2. IMB Pingpong with UCX_LOG_LEVEL=info env var, and post the output?

  3. IMB Pingpong with UCX_TLS=rc UCX_RNDV_THRESH=1k UCX_MAX_RNDV_RAILS=1 UCX_RNDV_SCHEME=get_zcopy to see if it improves the result?

  4. Try binding the MPI rank to one specific core (0 instead of 0-23, 24 instead of 24-47) to see if it improves?

  5. Test osu_bw benchmark since it has a window of 64 outstanding send operations

lfmeadow commented 3 years ago

Will do. However, now I'm thinking it may be a firmware issue. Card 0 has a PSID DEL0000000010 and the firmware tools won't let me update the firmware. So our system guy is talking to Dell. I'll keep you informed. Thanks for the quick response.

lfmeadow commented 3 years ago

Upgrading to FW 20.31.1014 made the problem go away. Sorry for the noise.