openucx / ucx

Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
http://www.openucx.org
Other
1.16k stars 428 forks source link

Single network card is very fast, dual network card speed is super slow #10249

Open yangrudan opened 1 month ago

yangrudan commented 1 month ago

Describe the bug

My environment is the ucx perfest tag_bw test of the GDR in the machine. When I configure the environment variables of one network card, the measured speed is very fast. The environment variables select dual network cards and the speed is super slow. In addition, both dual network cards are optimal pcie topology.

image

Steps to Reproduce

My commands:

UCX_TLS=rc_v,cuda UCX_NET_DEVICES=mlx5_0:1 ./ucx_perftest -t tag_bw
UCX_TLS=rc_v,cuda UCX_NET_DEVICES=mlx5_0:1 ./ucx_perftest -t tag_bw -s 4194304 -m cuda 173.22.3.35

UCX_TLS=rc_v,cuda UCX_NET_DEVICES=mlx5_0:1,mlx5_1:1 ./ucx_perftest -t tag_bw 
UCX_TLS=rc_v,cuda UCX_NET_DEVICES=mlx5_0:1,mlx5_1:1 ./ucx_perftest -t tag_bw -s 4194304 -m cuda 173.22.3.35
- UCX version 1.17
- # Library version: 1.17.0
# Library path: /workspace/zccl-ucx/build/lib/libucs.so.0
# API headers version: 1.17.0
# Git branch 'v1.17_kunlun_930', revision 551089e
# Configured with: --prefix=/workspace/zccl-ucx/build --enable-compiler-opt=0 --with-cuda=/usr/local/xpu --with-verbs --with-dm --with-rdmacm --enable-mt=yes --with-rc --with-mlx5-dv --with-go=no --enable-kunlun-gdr

Setup and versions

brminich commented 1 month ago

what is the performance when you set UCX_NET_DEVICES=mlx5_1:1?

yangrudan commented 1 month ago

what is the performance when you set UCX_NET_DEVICES=mlx5_1:1?

NIC mlx5_0 and NIC mlx5_1 both are the best pcie topo for my xpu. So when set UCX_NET_DEVICES=mlx5_1:1, it is also fast as below.

image

brminich commented 1 month ago

can you try to profile it with linux perf and check for the hotspots?