openucx / ucx

Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
http://www.openucx.org
Other
1.14k stars 424 forks source link

Single network card is very fast, dual network card speed is super slow #10249

Open yangrudan opened 4 days ago

yangrudan commented 4 days ago

Describe the bug

My environment is the ucx perfest tag_bw test of the GDR in the machine. When I configure the environment variables of one network card, the measured speed is very fast. The environment variables select dual network cards and the speed is super slow. In addition, both dual network cards are optimal pcie topology.

image

Steps to Reproduce

My commands:

UCX_TLS=rc_v,cuda UCX_NET_DEVICES=mlx5_0:1 ./ucx_perftest -t tag_bw
UCX_TLS=rc_v,cuda UCX_NET_DEVICES=mlx5_0:1 ./ucx_perftest -t tag_bw -s 4194304 -m cuda 173.22.3.35

UCX_TLS=rc_v,cuda UCX_NET_DEVICES=mlx5_0:1,mlx5_1:1 ./ucx_perftest -t tag_bw 
UCX_TLS=rc_v,cuda UCX_NET_DEVICES=mlx5_0:1,mlx5_1:1 ./ucx_perftest -t tag_bw -s 4194304 -m cuda 173.22.3.35
- UCX version 1.17
- # Library version: 1.17.0
# Library path: /workspace/zccl-ucx/build/lib/libucs.so.0
# API headers version: 1.17.0
# Git branch 'v1.17_kunlun_930', revision 551089e
# Configured with: --prefix=/workspace/zccl-ucx/build --enable-compiler-opt=0 --with-cuda=/usr/local/xpu --with-verbs --with-dm --with-rdmacm --enable-mt=yes --with-rc --with-mlx5-dv --with-go=no --enable-kunlun-gdr

Setup and versions

brminich commented 3 days ago

what is the performance when you set UCX_NET_DEVICES=mlx5_1:1?

yangrudan commented 3 days ago

what is the performance when you set UCX_NET_DEVICES=mlx5_1:1?

NIC mlx5_0 and NIC mlx5_1 both are the best pcie topo for my xpu. So when set UCX_NET_DEVICES=mlx5_1:1, it is also fast as below.

image

brminich commented 3 days ago

can you try to profile it with linux perf and check for the hotspots?