Failing to run cudf_merge benchmark on a node with 4 H100

orliac commented 1 month ago

Hi there, I'm facing issue when trying to run the cudf_merge benchmark locally on a node that hosts 4 h100:

        GPU0    GPU1    GPU2    GPU3    NIC0    NIC1    NIC2    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV6     NV6     NV6     SYS     SYS     SYS     24-31   3               N/A
GPU1    NV6      X      NV6     NV6     SYS     SYS     SYS     24-31   3               N/A
GPU2    NV6     NV6      X      NV6     SYS     SYS     SYS     40-47   5               N/A
GPU3    NV6     NV6     NV6      X      SYS     SYS     SYS     40-47   5               N/A
NIC0    SYS     SYS     SYS     SYS      X      PIX     SYS
NIC1    SYS     SYS     SYS     SYS     PIX      X      SYS
NIC2    SYS     SYS     SYS     SYS     SYS     SYS      X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_2
  NIC1: mlx5_3
  NIC2: mlx5_bond_0

I can run the benchmark over any pair of GPUs with no issue:

python -m ucp.benchmarks.cudf_merge --devs 0,1 --chunk-size 200_000_000 --iter 10

ucx-py-cu12            0.40.0

[notice] A new release of pip is available: 23.2.1 -> 24.2
[notice] To update, run: pip install --upgrade pip
[1729696581.284731] [kh013:1596654:0]     ucp_context.c:2190 UCX  INFO  Version 1.17.0 (loaded from /ssoft/spack/pinot-noir/kuma-h100/v1/spack/opt/spack/linux-rhel9-zen4/gcc-13.2.0/ucx-1.17.0-no2vdboyxq2falry3mus5kwwmpafamdy/lib/libucp.so.0)
[1729696584.749090] [kh013:1596654:0]          parser.c:2314 UCX  INFO  UCX_* env variables: UCX_LOG_LEVEL=info UCX_MEMTYPE_CACHE=n UCX_RNDV_THRESH=8192 UCX_RNDV_FRAG_MEM_TYPE=cuda UCX_MAX_RNDV_RAILS=1 UCX_PROTO_ENABLE=n
[1729696586.301065] [kh013:1596677:0]     ucp_context.c:2190 UCX  INFO  Version 1.17.0 (loaded from /ssoft/spack/pinot-noir/kuma-h100/v1/spack/opt/spack/linux-rhel9-zen4/gcc-13.2.0/ucx-1.17.0-no2vdboyxq2falry3mus5kwwmpafamdy/lib/libucp.so.0)
[1729696586.311799] [kh013:1596678:0]     ucp_context.c:2190 UCX  INFO  Version 1.17.0 (loaded from /ssoft/spack/pinot-noir/kuma-h100/v1/spack/opt/spack/linux-rhel9-zen4/gcc-13.2.0/ucx-1.17.0-no2vdboyxq2falry3mus5kwwmpafamdy/lib/libucp.so.0)
[1729696586.928897] [kh013:1596677:0]          parser.c:2314 UCX  INFO  UCX_* env variables: UCX_LOG_LEVEL=info UCX_MEMTYPE_CACHE=n UCX_RNDV_THRESH=8192 UCX_RNDV_FRAG_MEM_TYPE=cuda UCX_MAX_RNDV_RAILS=1 UCX_PROTO_ENABLE=n
[1729696586.928901] [kh013:1596678:0]          parser.c:2314 UCX  INFO  UCX_* env variables: UCX_LOG_LEVEL=info UCX_MEMTYPE_CACHE=n UCX_RNDV_THRESH=8192 UCX_RNDV_FRAG_MEM_TYPE=cuda UCX_MAX_RNDV_RAILS=1 UCX_PROTO_ENABLE=n
[1729696586.949084] [kh013:1596677:0]      ucp_worker.c:1888 UCX  INFO  ucp_context_0 intra-node cfg#3 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729696586.949104] [kh013:1596678:0]      ucp_worker.c:1888 UCX  INFO  ucp_context_0 intra-node cfg#3 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729696586.971682] [kh013:1596654:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#2 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729696586.988770] [kh013:1596677:0]      ucp_worker.c:1888 UCX  INFO      ucp_context_0 intra-node cfg#4 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729696586.992192] [kh013:1596678:0]      ucp_worker.c:1888 UCX  INFO      ucp_context_0 intra-node cfg#4 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729696586.994004] [kh013:1596678:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#5 tag(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  rma(rc_mlx5/mlx5_2:1)  am(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  stream(rc_mlx5/mlx5_2:1)  ka(ud_mlx5/mlx5_2:1)
[1729696586.994007] [kh013:1596677:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#5 tag(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  rma(rc_mlx5/mlx5_2:1)  am(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  stream(rc_mlx5/mlx5_2:1)  ka(ud_mlx5/mlx5_2:1)
[1729696587.000277] [kh013:1596654:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#3 tag(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  rma(rc_mlx5/mlx5_2:1)  am(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  stream(rc_mlx5/mlx5_2:1)  ka(ud_mlx5/mlx5_2:1)
[1729696587.123809] [kh013:1596678:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#6 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729696587.135094] [kh013:1596677:0]      ucp_worker.c:1888 UCX  INFO      ucp_context_0 intra-node cfg#6 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729696587.136850] [kh013:1596677:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#7 tag(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  rma(rc_mlx5/mlx5_2:1)  am(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  stream(rc_mlx5/mlx5_2:1)  ka(ud_mlx5/mlx5_2:1)
[1729696587.137598] [kh013:1596678:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#7 tag(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  rma(rc_mlx5/mlx5_2:1)  am(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  stream(rc_mlx5/mlx5_2:1)  ka(ud_mlx5/mlx5_2:1)
cuDF merge benchmark
--------------------------------------------------------------------------------------------------------------
Device(s)                 | [0, 1]
Chunks per device         | 1
Rows per chunk            | 200000000
Total data processed      | 119.21 GiB
Data processed per iter   | 11.92 GiB
Row matching fraction     | 0.3
==============================================================================================================
Wall-clock                | 3.00 s
Bandwidth                 | 24.88 GiB/s
Throughput                | 39.68 GiB/s
==============================================================================================================
Run                       | Wall-clock                | Bandwidth                 | Throughput
0                         | 161.36 ms                 | 108.39 GiB/s              | 73.88 GiB/s
1                         | 360.91 ms                 | 18.61 GiB/s               | 33.03 GiB/s
2                         | 455.87 ms                 | 13.33 GiB/s               | 26.15 GiB/s
3                         | 383.55 ms                 | 16.98 GiB/s               | 31.08 GiB/s
4                         | 169.31 ms                 | 90.74 GiB/s               | 70.41 GiB/s
5                         | 474.04 ms                 | 12.65 GiB/s               | 25.15 GiB/s
6                         | 293.38 ms                 | 25.85 GiB/s               | 40.63 GiB/s
7                         | 370.52 ms                 | 17.85 GiB/s               | 32.17 GiB/s
8                         | 161.21 ms                 | 108.41 GiB/s              | 73.95 GiB/s
9                         | 169.63 ms                 | 90.10 GiB/s               | 70.28 GiB/s

But it fails to run over the 4 devices:

python -m ucp.benchmarks.cudf_merge --devs 0,1,2,3 --chunk-size 200_000_000 --iter 10

[1729696635.679592] [kh013:1596934:0]     ucp_context.c:2190 UCX  INFO  Version 1.17.0 (loaded from /ssoft/spack/pinot-noir/kuma-h100/v1/spack/opt/spack/linux-rhel9-zen4/gcc-13.2.0/ucx-1.17.0-no2vdboyxq2falry3mus5kwwmpafamdy/lib/libucp.so.0)
[1729696639.178149] [kh013:1596934:0]          parser.c:2314 UCX  INFO  UCX_* env variables: UCX_LOG_LEVEL=info UCX_MEMTYPE_CACHE=n UCX_RNDV_THRESH=8192 UCX_RNDV_FRAG_MEM_TYPE=cuda UCX_MAX_RNDV_RAILS=1 UCX_PROTO_ENABLE=n
[1729696642.222481] [kh013:1596952:0]     ucp_context.c:2190 UCX  INFO  Version 1.17.0 (loaded from /ssoft/spack/pinot-noir/kuma-h100/v1/spack/opt/spack/linux-rhel9-zen4/gcc-13.2.0/ucx-1.17.0-no2vdboyxq2falry3mus5kwwmpafamdy/lib/libucp.so.0)
[1729696642.243167] [kh013:1596955:0]     ucp_context.c:2190 UCX  INFO  Version 1.17.0 (loaded from /ssoft/spack/pinot-noir/kuma-h100/v1/spack/opt/spack/linux-rhel9-zen4/gcc-13.2.0/ucx-1.17.0-no2vdboyxq2falry3mus5kwwmpafamdy/lib/libucp.so.0)
[1729696642.245405] [kh013:1596953:0]     ucp_context.c:2190 UCX  INFO  Version 1.17.0 (loaded from /ssoft/spack/pinot-noir/kuma-h100/v1/spack/opt/spack/linux-rhel9-zen4/gcc-13.2.0/ucx-1.17.0-no2vdboyxq2falry3mus5kwwmpafamdy/lib/libucp.so.0)
[1729696642.247080] [kh013:1596954:0]     ucp_context.c:2190 UCX  INFO  Version 1.17.0 (loaded from /ssoft/spack/pinot-noir/kuma-h100/v1/spack/opt/spack/linux-rhel9-zen4/gcc-13.2.0/ucx-1.17.0-no2vdboyxq2falry3mus5kwwmpafamdy/lib/libucp.so.0)
[1729696642.977422] [kh013:1596952:0]          parser.c:2314 UCX  INFO  UCX_* env variables: UCX_LOG_LEVEL=info UCX_MEMTYPE_CACHE=n UCX_RNDV_THRESH=8192 UCX_RNDV_FRAG_MEM_TYPE=cuda UCX_MAX_RNDV_RAILS=1 UCX_PROTO_ENABLE=n
[1729696642.980930] [kh013:1596955:0]          parser.c:2314 UCX  INFO  UCX_* env variables: UCX_LOG_LEVEL=info UCX_MEMTYPE_CACHE=n UCX_RNDV_THRESH=8192 UCX_RNDV_FRAG_MEM_TYPE=cuda UCX_MAX_RNDV_RAILS=1 UCX_PROTO_ENABLE=n
[1729696642.997896] [kh013:1596952:0]      ucp_worker.c:1888 UCX  INFO  ucp_context_0 intra-node cfg#3 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729696643.000134] [kh013:1596955:0]      ucp_worker.c:1888 UCX  INFO  ucp_context_0 intra-node cfg#3 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729696643.012295] [kh013:1596954:0]          parser.c:2314 UCX  INFO  UCX_* env variables: UCX_LOG_LEVEL=info UCX_MEMTYPE_CACHE=n UCX_RNDV_THRESH=8192 UCX_RNDV_FRAG_MEM_TYPE=cuda UCX_MAX_RNDV_RAILS=1 UCX_PROTO_ENABLE=n
[1729696643.014948] [kh013:1596953:0]          parser.c:2314 UCX  INFO  UCX_* env variables: UCX_LOG_LEVEL=info UCX_MEMTYPE_CACHE=n UCX_RNDV_THRESH=8192 UCX_RNDV_FRAG_MEM_TYPE=cuda UCX_MAX_RNDV_RAILS=1 UCX_PROTO_ENABLE=n
[1729696643.020409] [kh013:1596934:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#2 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729696643.035409] [kh013:1596953:0]      ucp_worker.c:1888 UCX  INFO  ucp_context_0 intra-node cfg#3 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729696643.035662] [kh013:1596954:0]      ucp_worker.c:1888 UCX  INFO  ucp_context_0 intra-node cfg#3 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729696643.039847] [kh013:1596952:0]      ucp_worker.c:1888 UCX  INFO      ucp_context_0 intra-node cfg#4 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729696643.041863] [kh013:1596955:0]      ucp_worker.c:1888 UCX  INFO      ucp_context_0 intra-node cfg#4 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729696643.043738] [kh013:1596952:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#5 tag(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  rma(rc_mlx5/mlx5_2:1)  am(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  stream(rc_mlx5/mlx5_2:1)  ka(ud_mlx5/mlx5_2:1)
[1729696643.043739] [kh013:1596955:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#5 tag(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  rma(rc_mlx5/mlx5_2:1)  am(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  stream(rc_mlx5/mlx5_2:1)  ka(ud_mlx5/mlx5_2:1)
[1729696643.046726] [kh013:1596954:0]      ucp_worker.c:1888 UCX  INFO      ucp_context_0 intra-node cfg#4 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729696643.048626] [kh013:1596953:0]      ucp_worker.c:1888 UCX  INFO      ucp_context_0 intra-node cfg#4 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729696643.049389] [kh013:1596954:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#5 tag(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  rma(rc_mlx5/mlx5_2:1)  am(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  stream(rc_mlx5/mlx5_2:1)  ka(ud_mlx5/mlx5_2:1)
[1729696643.050116] [kh013:1596934:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#3 tag(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  rma(rc_mlx5/mlx5_2:1)  am(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  stream(rc_mlx5/mlx5_2:1)  ka(ud_mlx5/mlx5_2:1)
[1729696643.050260] [kh013:1596953:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#5 tag(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  rma(rc_mlx5/mlx5_2:1)  am(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  stream(rc_mlx5/mlx5_2:1)  ka(ud_mlx5/mlx5_2:1)
[1729696643.076486] [kh013:1596954:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#6 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729696643.087261] [kh013:1596953:0]      ucp_worker.c:1888 UCX  INFO      ucp_context_0 intra-node cfg#6 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729696643.089062] [kh013:1596953:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#7 tag(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  rma(rc_mlx5/mlx5_2:1)  am(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  stream(rc_mlx5/mlx5_2:1)  ka(ud_mlx5/mlx5_2:1)
[1729696643.089778] [kh013:1596954:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#7 tag(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  rma(rc_mlx5/mlx5_2:1)  am(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  stream(rc_mlx5/mlx5_2:1)  ka(ud_mlx5/mlx5_2:1)
[1729696643.103213] [kh013:1596955:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#6 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729696643.114753] [kh013:1596955:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#7 tag(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  rma(rc_mlx5/mlx5_2:1)  am(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  stream(rc_mlx5/mlx5_2:1)  ka(ud_mlx5/mlx5_2:1)
[1729696643.176847] [kh013:1596953:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#8 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729696643.184527] [kh013:1596952:0]      ucp_worker.c:1888 UCX  INFO      ucp_context_0 intra-node cfg#6 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729696643.185616] [kh013:1596954:0]      ucp_worker.c:1888 UCX  INFO      ucp_context_0 intra-node cfg#8 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1729696643.186623] [kh013:1596952:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#7 tag(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  rma(rc_mlx5/mlx5_2:1)  am(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  stream(rc_mlx5/mlx5_2:1)  ka(ud_mlx5/mlx5_2:1)
[1729696643.187346] [kh013:1596954:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#9 tag(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  rma(rc_mlx5/mlx5_2:1)  am(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  stream(rc_mlx5/mlx5_2:1)  ka(ud_mlx5/mlx5_2:1)
[1729696643.187424] [kh013:1596953:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#9 tag(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  rma(rc_mlx5/mlx5_2:1)  am(rc_mlx5/mlx5_2:1 cuda_ipc/cuda)  stream(rc_mlx5/mlx5_2:1)  ka(ud_mlx5/mlx5_2:1)
Task exception was never retrieved
future: <Task finished name='Task-7' coro=<_listener_handler_coroutine() done, defined at /work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/core.py:140> exception=UCXCanceled("<[Recv #002] ep: 0x7f5af660f140, tag: 0xedf8353cc3df7250, nbytes: 8, type: <class 'array.array'>>: ")>
Traceback (most recent call last):
  File "/work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/core.py", line 190, in _listener_handler_coroutine
    await func(ep)
  File "/work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/benchmarks/utils.py", line 106, in server_handler
    worker_results = await recv_pickled_msg(ep)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/benchmarks/utils.py", line 75, in recv_pickled_msg
    msg = await ep.recv_obj()
          ^^^^^^^^^^^^^^^^^^^
  File "/work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/core.py", line 863, in recv_obj
    await self.recv(nbytes, tag=tag)
  File "/work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/core.py", line 737, in recv
    ret = await comm.tag_recv(self._ep, buffer, nbytes, tag, name=log)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ucp._libs.exceptions.UCXCanceled: <[Recv #002] ep: 0x7f5af660f140, tag: 0xedf8353cc3df7250, nbytes: 8, type: <class 'array.array'>>: 
Task exception was never retrieved
future: <Task finished name='Task-2' coro=<_listener_handler_coroutine() done, defined at /work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/core.py:140> exception=UCXCanceled("<[Recv #002] ep: 0x7f5af660f080, tag: 0xd49e6a08b8eeaedd, nbytes: 8, type: <class 'array.array'>>: ")>
Traceback (most recent call last):
  File "/work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/core.py", line 190, in _listener_handler_coroutine
    await func(ep)
  File "/work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/benchmarks/utils.py", line 106, in server_handler
    worker_results = await recv_pickled_msg(ep)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/benchmarks/utils.py", line 75, in recv_pickled_msg
    msg = await ep.recv_obj()
          ^^^^^^^^^^^^^^^^^^^
  File "/work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/core.py", line 863, in recv_obj
    await self.recv(nbytes, tag=tag)
  File "/work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/core.py", line 737, in recv
    ret = await comm.tag_recv(self._ep, buffer, nbytes, tag, name=log)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ucp._libs.exceptions.UCXCanceled: <[Recv #002] ep: 0x7f5af660f080, tag: 0xd49e6a08b8eeaedd, nbytes: 8, type: <class 'array.array'>>: 
Task exception was never retrieved
future: <Task finished name='Task-3' coro=<_listener_handler_coroutine() done, defined at /work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/core.py:140> exception=UCXCanceled("<[Recv #002] ep: 0x7f5af660f0c0, tag: 0xbe680e4915f49a08, nbytes: 8, type: <class 'array.array'>>: ")>
Traceback (most recent call last):
  File "/work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/core.py", line 190, in _listener_handler_coroutine
    await func(ep)
  File "/work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/benchmarks/utils.py", line 106, in server_handler
    worker_results = await recv_pickled_msg(ep)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/benchmarks/utils.py", line 75, in recv_pickled_msg
    msg = await ep.recv_obj()
          ^^^^^^^^^^^^^^^^^^^
  File "/work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/core.py", line 863, in recv_obj
    await self.recv(nbytes, tag=tag)
  File "/work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/core.py", line 737, in recv
    ret = await comm.tag_recv(self._ep, buffer, nbytes, tag, name=log)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ucp._libs.exceptions.UCXCanceled: <[Recv #002] ep: 0x7f5af660f0c0, tag: 0xbe680e4915f49a08, nbytes: 8, type: <class 'array.array'>>: 
Task exception was never retrieved
future: <Task finished name='Task-6' coro=<_listener_handler_coroutine() done, defined at /work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/core.py:140> exception=UCXCanceled("<[Recv #002] ep: 0x7f5af660f100, tag: 0xf1c3abccc2be7d01, nbytes: 8, type: <class 'array.array'>>: ")>
Traceback (most recent call last):
  File "/work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/core.py", line 190, in _listener_handler_coroutine
    await func(ep)
  File "/work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/benchmarks/utils.py", line 106, in server_handler
    worker_results = await recv_pickled_msg(ep)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/benchmarks/utils.py", line 75, in recv_pickled_msg
    msg = await ep.recv_obj()
          ^^^^^^^^^^^^^^^^^^^
  File "/work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/core.py", line 863, in recv_obj
    await self.recv(nbytes, tag=tag)
  File "/work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/core.py", line 737, in recv
    ret = await comm.tag_recv(self._ep, buffer, nbytes, tag, name=log)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ucp._libs.exceptions.UCXCanceled: <[Recv #002] ep: 0x7f5af660f100, tag: 0xf1c3abccc2be7d01, nbytes: 8, type: <class 'array.array'>>: 
^CProcess Process-1:
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
Process Process-3:
  File "/work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/benchmarks/cudf_merge.py", line 633, in <module>
Process Process-2:
    main()
  File "/work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/benchmarks/cudf_merge.py", line 590, in main
    stats = [server_queue.get() for i in range(args.n_chunks)]
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/benchmarks/cudf_merge.py", line 590, in <listcomp>
    stats = [server_queue.get() for i in range(args.n_chunks)]
             ^^^^^^^^^^^^^^^^^^
  File "/ssoft/spack/pinot-noir/kuma-h100/v1/spack/opt/spack/linux-rhel9-zen4/gcc-13.2.0/python-3.11.7-wpgsyqek7spdydbmic66srcfb3v7kzoi/lib/python3.11/multiprocessing/queues.py", line 103, in get
    res = self._recv_bytes()
          ^^^^^^^^^^^^^^^^^^
  File "/ssoft/spack/pinot-noir/kuma-h100/v1/spack/opt/spack/linux-rhel9-zen4/gcc-13.2.0/python-3.11.7-wpgsyqek7spdydbmic66srcfb3v7kzoi/lib/python3.11/multiprocessing/connection.py", line 216, in recv_bytes
Traceback (most recent call last):
    buf = self._recv_bytes(maxlength)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/ssoft/spack/pinot-noir/kuma-h100/v1/spack/opt/spack/linux-rhel9-zen4/gcc-13.2.0/python-3.11.7-wpgsyqek7spdydbmic66srcfb3v7kzoi/lib/python3.11/multiprocessing/connection.py", line 430, in _recv_bytes
  File "/ssoft/spack/pinot-noir/kuma-h100/v1/spack/opt/spack/linux-rhel9-zen4/gcc-13.2.0/python-3.11.7-wpgsyqek7spdydbmic66srcfb3v7kzoi/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
    buf = self._recv(4)
  File "/ssoft/spack/pinot-noir/kuma-h100/v1/spack/opt/spack/linux-rhel9-zen4/gcc-13.2.0/python-3.11.7-wpgsyqek7spdydbmic66srcfb3v7kzoi/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/work/scitas-ge/orliac/KUMA_VENVS/UCX-PY-BENCH/lib/python3.11/site-packages/ucp/benchmarks/utils.py", line 125, in _server_process
    ret = loop.run_until_complete(run())
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
          ^^^^^^^^^^^^^
  File "/ssoft/spack/pinot-noir/kuma-h100/v1/spack/opt/spack/linux-rhel9-zen4/gcc-13.2.0/python-3.11.7-wpgsyqek7spdydbmic66srcfb3v7kzoi/lib/python3.11/asyncio/base_events.py", line 640, in run_until_complete
    self.run_forever()
  File "/ssoft/spack/pinot-noir/kuma-h100/v1/spack/opt/spack/linux-rhel9-zen4/gcc-13.2.0/python-3.11.7-wpgsyqek7spdydbmic66srcfb3v7kzoi/lib/python3.11/multiprocessing/connection.py", line 395, in _recv
  File "/ssoft/spack/pinot-noir/kuma-h100/v1/spack/opt/spack/linux-rhel9-zen4/gcc-13.2.0/python-3.11.7-wpgsyqek7spdydbmic66srcfb3v7kzoi/lib/python3.11/asyncio/base_events.py", line 607, in run_forever
    self._run_once()
  File "/ssoft/spack/pinot-noir/kuma-h100/v1/spack/opt/spack/linux-rhel9-zen4/gcc-13.2.0/python-3.11.7-wpgsyqek7spdydbmic66srcfb3v7kzoi/lib/python3.11/asyncio/base_events.py", line 1884, in _run_once
    event_list = self._selector.select(timeout)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/ssoft/spack/pinot-noir/kuma-h100/v1/spack/opt/spack/linux-rhel9-zen4/gcc-13.2.0/python-3.11.7-wpgsyqek7spdydbmic66srcfb3v7kzoi/lib/python3.11/selectors.py", line 468, in select
    fd_event_list = self._selector.poll(timeout, max_ev)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

My environment:

Package                Version
---------------------- -----------
cachetools             5.5.0
click                  8.1.7
cloudpickle            3.1.0
cuda-python            12.6.0
cudf-cu12              24.10.1
cupy-cuda12x           13.3.0
dask                   2024.9.0
dask-cudf-cu12         24.10.1
dask-expr              1.1.14
distributed            2024.9.0
fastrlock              0.8.2
fsspec                 2024.10.0
importlib_metadata     8.5.0
Jinja2                 3.1.4
libcudf-cu12           24.10.1
llvmlite               0.43.0
locket                 1.0.0
markdown-it-py         3.0.0
MarkupSafe             3.0.2
mdurl                  0.1.2
msgpack                1.1.0
numba                  0.60.0
numpy                  2.0.2
nvtx                   0.2.10
packaging              24.1
pandas                 2.2.2
partd                  1.4.2
pip                    23.2.1
psutil                 6.1.0
pyarrow                17.0.0
Pygments               2.18.0
pylibcudf-cu12         24.10.1
pynvjitlink-cu12       0.3.0
python-dateutil        2.9.0.post0
pytz                   2024.2
PyYAML                 6.0.2
rapids-dask-dependency 24.10.0
rich                   13.9.2
rmm-cu12               24.10.0
setuptools             65.5.0
six                    1.16.0
sortedcontainers       2.4.0
tblib                  3.0.0
toolz                  1.0.0
tornado                6.4.1
typing_extensions      4.12.2
tzdata                 2024.2
ucx-py-cu12            0.40.0
urllib3                2.2.3
zict                   3.0.0
zipp                   3.20.2

Any idea?

Also, I'm surprised by the variability of the benchmark over the successive 10 iterations.

And finally, is it expected that the benchmark saturate the available bandwidth between the GPUs?

pentschev commented 1 month ago

Indeed that doesn't look right. I do not have immediate access to a system with H100s but I can try to do that tomorrow. In the meantime I was able to run that on a DGX-1 and the results I see are much more in line with what we would expect, although I had to reduce to chunk size to 100M due to the amount of memory in the V100s:

2 GPUs

4 GPUs

8 GPUs

Based on the affinity reported by your system as the output of nvidia-smi topo -m I suspect this is a only a partition of a node, is that right? Are you able to get a full node allocation to test it as well? Could you also try to disable InfiniBand (UCX_TLS=^rc) and later NVLink (UCX_TLS=cuda_ipc) to see whether the errors and the variability go away? Could you also report what cat /proc/cpuinfo shows (I want that we confirm that only the cores with matching affinity are available to your node partition, but I'm not entirely sure only /proc/cpuinfo will provide us with that information)?

And finally, is it expected that the benchmark saturate the available bandwidth between the GPUs?

That is a good question, I'm not sure we have done such testing in the past but for the 2 GPU case we get 85-90% of expected, with ~19.5GiB/s where ucx_perftest alone reports ~22.5GiB/s (note ucx_perftest reports GB/s and not GiB/s, I've done the conversion on my own so we're comparing at right scale):

ucx_perftest 2xV100

``` $ ucx_perftest -t tag_bw -m cuda -s 1000000000 -n 1000 & ucx_perftest -t tag_bw -m cuda -s 1000000000 -n 1000 localhost [1] 1875533 [1729719349.121679] [dgx13:1875534:0] perftest.c:793 UCX WARN CPU affinity is not set (bound to 80 cpus). Performance may be impacted. [1729719349.121731] [dgx13:1875533:0] perftest.c:793 UCX WARN CPU affinity is not set (bound to 80 cpus). Performance may be impacted. Waiting for connection... Accepted connection from 127.0.0.1:33582 +----------------------------------------------------------------------------------------------------------+ +--------------+--------------+------------------------------+---------------------+-----------------------+ | API: protocol layer | | | | overhead (usec) | bandwidth (MB/s) | message rate (msg/s) | | Test: tag match bandwidth | +--------------+--------------+----------+---------+---------+----------+----------+-----------+-----------+ | Data layout: (automatic) | | Stage | # iterations | 50.0%ile | average | overall | average | overall | average | overall | | Send memory: cuda | +--------------+--------------+----------+---------+---------+----------+----------+-----------+-----------+ | Recv memory: cuda | | Message size: 1000000000 | | Window size: 32 | +----------------------------------------------------------------------------------------------------------+ [thread 0] 57 3.855 18078.963 18078.963 52750.50 52750.50 55 55 [thread 0] 82 41114.179 41216.364 25133.049 23138.24 37945.03 24 40 [thread 0] 107 41118.761 41216.316 28890.821 23138.27 33009.60 24 35 [thread 0] 132 41119.304 41216.402 31225.212 23138.22 30541.80 24 32 [thread 0] 157 41119.606 41216.240 32816.140 23138.31 29061.14 24 30 [thread 0] 182 41119.606 41216.278 33970.005 23138.29 28074.01 24 29 [thread 0] 207 41119.798 41216.326 34845.164 23138.27 27368.91 24 29 [thread 0] 232 41119.923 41216.316 35531.711 23138.27 26840.09 24 28 [thread 0] 257 41120.042 41216.516 36084.708 23138.16 26428.77 24 28 [thread 0] 282 41120.065 41216.278 36539.634 23138.29 26099.72 24 27 [thread 0] 307 41120.199 41216.364 36920.475 23138.24 25830.50 24 27 [thread 0] 332 41120.195 41216.278 37243.954 23138.29 25606.15 24 27 [thread 0] 357 41120.195 41216.478 37522.142 23138.18 25416.31 24 27 [thread 0] 382 41120.199 41216.326 37763.908 23138.27 25253.59 24 26 [thread 0] 407 41120.263 41216.354 37975.975 23138.25 25112.57 24 26 [thread 0] 432 41120.337 41216.364 38163.497 23138.24 24989.17 24 26 [thread 0] 457 41120.364 41216.316 38330.501 23138.27 24880.30 24 26 [thread 0] 482 41120.434 41216.440 38480.186 23138.20 24783.52 24 26 [thread 0] 507 41120.435 41216.288 38615.103 23138.29 24696.93 24 26 [thread 0] 532 41120.435 41216.316 38737.340 23138.27 24618.99 24 26 [thread 0] 557 41120.453 41216.402 38848.609 23138.22 24548.48 24 26 [thread 0] 582 41120.453 41216.316 38950.314 23138.27 24484.38 24 26 [thread 0] 607 41120.510 41216.278 39043.641 23138.29 24425.86 24 26 [thread 0] 632 41120.543 41216.326 39129.585 23138.27 24372.21 24 26 [thread 0] 657 41120.543 41216.402 39208.992 23138.22 24322.85 24 26 [thread 0] 682 41120.550 41216.240 39282.572 23138.31 24277.29 24 25 [thread 0] 707 41120.577 41216.316 39350.950 23138.27 24235.10 24 25 [thread 0] 732 41120.577 41216.478 39414.664 23138.18 24195.93 24 25 [thread 0] 757 41120.568 41216.240 39474.161 23138.31 24159.46 24 25 [thread 0] 782 41120.568 41216.326 39529.857 23138.27 24125.42 24 25 [thread 0] 807 41120.567 41216.316 39582.102 23138.27 24093.57 24 25 [thread 0] 832 41120.553 41216.478 39631.211 23138.18 24063.72 24 25 [thread 0] 857 41120.577 41216.240 39677.449 23138.31 24035.68 24 25 [thread 0] 882 41120.581 41216.402 39721.070 23138.22 24009.28 24 25 [thread 0] 907 41120.581 41216.316 39762.284 23138.27 23984.39 24 25 [thread 0] 932 41120.581 41216.326 39801.288 23138.27 23960.89 24 25 [thread 0] 957 41120.591 41216.316 39838.253 23138.27 23938.66 24 25 [thread 0] 982 41120.599 41216.316 39873.336 23138.27 23917.60 24 25 Final: 1000 41120.606 114493.397 41216.497 8329.51 23138.17 9 24 ```

Also note that a DGX-1 doesn't have NVSwitch connecting all GPUs, so when scaling to all of them we expect to be limited by the Connect-X 4 bandwidth.

orliac commented 1 month ago

Thanks @pentschev for the quick feedback.

No, I'm using a full node in exclusive mode for these tests. There are 8 NUMA nodes of 8 CPU physical cores, and GPUs are connected by pairs on a same NUMA node. I can provide the full lstopo output if of interest.

Note that we may have an issue on our side wrt to affinity as actually nothing forces Slurm to allocate CPU cores on the same NUMA node of the GPU when you request a single GPU. But in this case I'm using the full node, and at least the list of CPU cores is not empty, so I assume it works as expected on that side. But to be checked.

export UCX_TLS=^rc changes nothing.

export UCX_TLS=^cuda_ipc allows me to run the 4 GPUs case, but gives (as expected) lower bandwidths and does not help with the repeatability:

2 GPUs

``` [1729749691.952192] [kh060:2253630:0] ucp_context.c:2190 UCX INFO Version 1.17.0 (loaded from /ssoft/spack/pinot-noir/kuma-h100/v1/spack/opt/spack/linux-rhel9-zen4/gcc-13.2.0/ucx-1.17.0-no2vdboyxq2falry3mus5kwwmpafamdy/lib/libucp.so.0) [1729749695.174229] [kh060:2253630:0] parser.c:2314 UCX INFO UCX_* env variables: UCX_TLS=^cuda_ipc UCX_LOG_LEVEL=info UCX_MEMTYPE_CACHE=n UCX_RNDV_THRESH=8192 UCX_RNDV_FRAG_MEM_TYPE=cuda UCX_MAX_RNDV_RAILS=1 UCX_PROTO_ENABLE=n [1729749696.726614] [kh060:2253657:0] ucp_context.c:2190 UCX INFO Version 1.17.0 (loaded from /ssoft/spack/pinot-noir/kuma-h100/v1/spack/opt/spack/linux-rhel9-zen4/gcc-13.2.0/ucx-1.17.0-no2vdboyxq2falry3mus5kwwmpafamdy/lib/libucp.so.0) [1729749696.732299] [kh060:2253658:0] ucp_context.c:2190 UCX INFO Version 1.17.0 (loaded from /ssoft/spack/pinot-noir/kuma-h100/v1/spack/opt/spack/linux-rhel9-zen4/gcc-13.2.0/ucx-1.17.0-no2vdboyxq2falry3mus5kwwmpafamdy/lib/libucp.so.0) [1729749696.961931] [kh060:2253658:0] parser.c:2314 UCX INFO UCX_* env variables: UCX_TLS=^cuda_ipc UCX_LOG_LEVEL=info UCX_MEMTYPE_CACHE=n UCX_RNDV_THRESH=8192 UCX_RNDV_FRAG_MEM_TYPE=cuda UCX_MAX_RNDV_RAILS=1 UCX_PROTO_ENABLE=n [1729749696.969666] [kh060:2253657:0] parser.c:2314 UCX INFO UCX_* env variables: UCX_TLS=^cuda_ipc UCX_LOG_LEVEL=info UCX_MEMTYPE_CACHE=n UCX_RNDV_THRESH=8192 UCX_RNDV_FRAG_MEM_TYPE=cuda UCX_MAX_RNDV_RAILS=1 UCX_PROTO_ENABLE=n [1729749696.981722] [kh060:2253658:0] ucp_worker.c:1888 UCX INFO ucp_context_0 intra-node cfg#3 tag(rc_mlx5/mlx5_bond_0:1) rma(rc_mlx5/mlx5_bond_0:1) am(rc_mlx5/mlx5_bond_0:1) stream(rc_mlx5/mlx5_bond_0:1) ka(rc_mlx5/mlx5_bond_0:1) [1729749696.989279] [kh060:2253657:0] ucp_worker.c:1888 UCX INFO ucp_context_0 intra-node cfg#3 tag(rc_mlx5/mlx5_bond_0:1) rma(rc_mlx5/mlx5_bond_0:1) am(rc_mlx5/mlx5_bond_0:1) stream(rc_mlx5/mlx5_bond_0:1) ka(rc_mlx5/mlx5_bond_0:1) [1729749697.003612] [kh060:2253630:0] ucp_worker.c:1888 UCX INFO ucp_context_0 intra-node cfg#2 tag(rc_mlx5/mlx5_bond_0:1) rma(rc_mlx5/mlx5_bond_0:1) am(rc_mlx5/mlx5_bond_0:1) stream(rc_mlx5/mlx5_bond_0:1) ka(rc_mlx5/mlx5_bond_0:1) [1729749697.019853] [kh060:2253658:0] ucp_worker.c:1888 UCX INFO ucp_context_0 intra-node cfg#4 tag(rc_mlx5/mlx5_bond_0:1) rma(rc_mlx5/mlx5_bond_0:1) am(rc_mlx5/mlx5_bond_0:1) stream(rc_mlx5/mlx5_bond_0:1) ka(rc_mlx5/mlx5_bond_0:1) [1729749697.023013] [kh060:2253657:0] ucp_worker.c:1888 UCX INFO ucp_context_0 intra-node cfg#4 tag(rc_mlx5/mlx5_bond_0:1) rma(rc_mlx5/mlx5_bond_0:1) am(rc_mlx5/mlx5_bond_0:1) stream(rc_mlx5/mlx5_bond_0:1) ka(rc_mlx5/mlx5_bond_0:1) [1729749697.025036] [kh060:2253657:0] ucp_worker.c:1888 UCX INFO ucp_context_0 intra-node cfg#5 tag(rc_mlx5/mlx5_2:1) rma(rc_mlx5/mlx5_2:1) am(rc_mlx5/mlx5_2:1) stream(rc_mlx5/mlx5_2:1) ka(ud_mlx5/mlx5_2:1) [1729749697.025051] [kh060:2253658:0] ucp_worker.c:1888 UCX INFO ucp_context_0 intra-node cfg#5 tag(rc_mlx5/mlx5_2:1) rma(rc_mlx5/mlx5_2:1) am(rc_mlx5/mlx5_2:1) stream(rc_mlx5/mlx5_2:1) ka(ud_mlx5/mlx5_2:1) [1729749697.030642] [kh060:2253630:0] ucp_worker.c:1888 UCX INFO ucp_context_0 intra-node cfg#3 tag(rc_mlx5/mlx5_2:1) rma(rc_mlx5/mlx5_2:1) am(rc_mlx5/mlx5_2:1) stream(rc_mlx5/mlx5_2:1) ka(ud_mlx5/mlx5_2:1) [1729749697.054524] [kh060:2253658:0] ucp_worker.c:1888 UCX INFO ucp_context_0 intra-node cfg#6 tag(rc_mlx5/mlx5_bond_0:1) rma(rc_mlx5/mlx5_bond_0:1) am(rc_mlx5/mlx5_bond_0:1) stream(rc_mlx5/mlx5_bond_0:1) ka(rc_mlx5/mlx5_bond_0:1) [1729749697.064794] [kh060:2253657:0] ucp_worker.c:1888 UCX INFO ucp_context_0 intra-node cfg#6 tag(rc_mlx5/mlx5_bond_0:1) rma(rc_mlx5/mlx5_bond_0:1) am(rc_mlx5/mlx5_bond_0:1) stream(rc_mlx5/mlx5_bond_0:1) ka(rc_mlx5/mlx5_bond_0:1) [1729749697.066477] [kh060:2253657:0] ucp_worker.c:1888 UCX INFO ucp_context_0 intra-node cfg#7 tag(rc_mlx5/mlx5_2:1) rma(rc_mlx5/mlx5_2:1) am(rc_mlx5/mlx5_2:1) stream(rc_mlx5/mlx5_2:1) ka(ud_mlx5/mlx5_2:1) [1729749697.067184] [kh060:2253658:0] ucp_worker.c:1888 UCX INFO ucp_context_0 intra-node cfg#7 tag(rc_mlx5/mlx5_2:1) rma(rc_mlx5/mlx5_2:1) am(rc_mlx5/mlx5_2:1) stream(rc_mlx5/mlx5_2:1) ka(ud_mlx5/mlx5_2:1) cuDF merge benchmark -------------------------------------------------------------------------------------------------------------- Device(s) | [0, 1] Chunks per device | 1 Rows per chunk | 100000000 Total data processed | 59.60 GiB Data processed per iter | 5.96 GiB Row matching fraction | 0.3 ============================================================================================================== Wall-clock | 2.59 s Bandwidth | 11.39 GiB/s Throughput | 22.99 GiB/s ============================================================================================================== Run | Wall-clock | Bandwidth | Throughput 0 | 206.63 ms | 15.48 GiB/s | 28.85 GiB/s 1 | 510.53 ms | 5.01 GiB/s | 11.68 GiB/s 2 | 234.18 ms | 13.01 GiB/s | 25.45 GiB/s 3 | 285.94 ms | 10.00 GiB/s | 20.85 GiB/s 4 | 226.55 ms | 13.60 GiB/s | 26.31 GiB/s 5 | 207.24 ms | 15.42 GiB/s | 28.76 GiB/s 6 | 216.79 ms | 14.46 GiB/s | 27.49 GiB/s 7 | 207.29 ms | 15.41 GiB/s | 28.75 GiB/s 8 | 277.84 ms | 10.38 GiB/s | 21.45 GiB/s 9 | 214.85 ms | 14.68 GiB/s | 27.74 GiB/s ```

4 GPUs

``` [1729749706.301045] [kh060:2253823:0] ucp_context.c:2190 UCX INFO Version 1.17.0 (loaded from /ssoft/spack/pinot-noir/kuma-h100/v1/spack/opt/spack/linux-rhel9-zen4/gcc-13.2.0/ucx-1.17.0-no2vdboyxq2falry3mus5kwwmpafamdy/lib/libucp.so.0) [1729749709.522614] [kh060:2253823:0] parser.c:2314 UCX INFO UCX_* env variables: UCX_TLS=^cuda_ipc UCX_LOG_LEVEL=info UCX_MEMTYPE_CACHE=n UCX_RNDV_THRESH=8192 UCX_RNDV_FRAG_MEM_TYPE=cuda UCX_MAX_RNDV_RAILS=1 UCX_PROTO_ENABLE=n [1729749712.463292] [kh060:2253839:0] ucp_context.c:2190 UCX INFO Version 1.17.0 (loaded from /ssoft/spack/pinot-noir/kuma-h100/v1/spack/opt/spack/linux-rhel9-zen4/gcc-13.2.0/ucx-1.17.0-no2vdboyxq2falry3mus5kwwmpafamdy/lib/libucp.so.0) [1729749712.475283] [kh060:2253841:0] ucp_context.c:2190 UCX INFO Version 1.17.0 (loaded from /ssoft/spack/pinot-noir/kuma-h100/v1/spack/opt/spack/linux-rhel9-zen4/gcc-13.2.0/ucx-1.17.0-no2vdboyxq2falry3mus5kwwmpafamdy/lib/libucp.so.0) [1729749712.476138] [kh060:2253840:0] ucp_context.c:2190 UCX INFO Version 1.17.0 (loaded from /ssoft/spack/pinot-noir/kuma-h100/v1/spack/opt/spack/linux-rhel9-zen4/gcc-13.2.0/ucx-1.17.0-no2vdboyxq2falry3mus5kwwmpafamdy/lib/libucp.so.0) [1729749712.480276] [kh060:2253842:0] ucp_context.c:2190 UCX INFO Version 1.17.0 (loaded from /ssoft/spack/pinot-noir/kuma-h100/v1/spack/opt/spack/linux-rhel9-zen4/gcc-13.2.0/ucx-1.17.0-no2vdboyxq2falry3mus5kwwmpafamdy/lib/libucp.so.0) [1729749712.720902] [kh060:2253840:0] parser.c:2314 UCX INFO UCX_* env variables: UCX_TLS=^cuda_ipc UCX_LOG_LEVEL=info UCX_MEMTYPE_CACHE=n UCX_RNDV_THRESH=8192 UCX_RNDV_FRAG_MEM_TYPE=cuda UCX_MAX_RNDV_RAILS=1 UCX_PROTO_ENABLE=n [1729749712.723496] [kh060:2253841:0] parser.c:2314 UCX INFO UCX_* env variables: UCX_TLS=^cuda_ipc UCX_LOG_LEVEL=info UCX_MEMTYPE_CACHE=n UCX_RNDV_THRESH=8192 UCX_RNDV_FRAG_MEM_TYPE=cuda UCX_MAX_RNDV_RAILS=1 UCX_PROTO_ENABLE=n [1729749712.723721] [kh060:2253842:0] parser.c:2314 UCX INFO UCX_* env variables: UCX_TLS=^cuda_ipc UCX_LOG_LEVEL=info UCX_MEMTYPE_CACHE=n UCX_RNDV_THRESH=8192 UCX_RNDV_FRAG_MEM_TYPE=cuda UCX_MAX_RNDV_RAILS=1 UCX_PROTO_ENABLE=n [1729749712.729463] [kh060:2253839:0] parser.c:2314 UCX INFO UCX_* env variables: UCX_TLS=^cuda_ipc UCX_LOG_LEVEL=info UCX_MEMTYPE_CACHE=n UCX_RNDV_THRESH=8192 UCX_RNDV_FRAG_MEM_TYPE=cuda UCX_MAX_RNDV_RAILS=1 UCX_PROTO_ENABLE=n [1729749712.731856] [kh060:2253841:0] ucp_worker.c:1888 UCX INFO ucp_context_0 intra-node cfg#3 tag(rc_mlx5/mlx5_bond_0:1) rma(rc_mlx5/mlx5_bond_0:1) am(rc_mlx5/mlx5_bond_0:1) stream(rc_mlx5/mlx5_bond_0:1) ka(rc_mlx5/mlx5_bond_0:1) [1729749712.737099] [kh060:2253839:0] ucp_worker.c:1888 UCX INFO ucp_context_0 intra-node cfg#3 tag(rc_mlx5/mlx5_bond_0:1) rma(rc_mlx5/mlx5_bond_0:1) am(rc_mlx5/mlx5_bond_0:1) stream(rc_mlx5/mlx5_bond_0:1) ka(rc_mlx5/mlx5_bond_0:1) [1729749712.741351] [kh060:2253840:0] ucp_worker.c:1888 UCX INFO ucp_context_0 intra-node cfg#3 tag(rc_mlx5/mlx5_bond_0:1) rma(rc_mlx5/mlx5_bond_0:1) am(rc_mlx5/mlx5_bond_0:1) stream(rc_mlx5/mlx5_bond_0:1) ka(rc_mlx5/mlx5_bond_0:1) [1729749712.744331] [kh060:2253842:0] ucp_worker.c:1888 UCX INFO ucp_context_0 intra-node cfg#3 tag(rc_mlx5/mlx5_bond_0:1) rma(rc_mlx5/mlx5_bond_0:1) am(rc_mlx5/mlx5_bond_0:1) stream(rc_mlx5/mlx5_bond_0:1) ka(rc_mlx5/mlx5_bond_0:1) [1729749712.754557] [kh060:2253823:0] ucp_worker.c:1888 UCX INFO ucp_context_0 intra-node cfg#2 tag(rc_mlx5/mlx5_bond_0:1) rma(rc_mlx5/mlx5_bond_0:1) am(rc_mlx5/mlx5_bond_0:1) stream(rc_mlx5/mlx5_bond_0:1) ka(rc_mlx5/mlx5_bond_0:1) [1729749712.771086] [kh060:2253841:0] ucp_worker.c:1888 UCX INFO ucp_context_0 intra-node cfg#4 tag(rc_mlx5/mlx5_bond_0:1) rma(rc_mlx5/mlx5_bond_0:1) am(rc_mlx5/mlx5_bond_0:1) stream(rc_mlx5/mlx5_bond_0:1) ka(rc_mlx5/mlx5_bond_0:1) [1729749712.774935] [kh060:2253839:0] ucp_worker.c:1888 UCX INFO ucp_context_0 intra-node cfg#4 tag(rc_mlx5/mlx5_bond_0:1) rma(rc_mlx5/mlx5_bond_0:1) am(rc_mlx5/mlx5_bond_0:1) stream(rc_mlx5/mlx5_bond_0:1) ka(rc_mlx5/mlx5_bond_0:1) [1729749712.776781] [kh060:2253840:0] ucp_worker.c:1888 UCX INFO ucp_context_0 intra-node cfg#4 tag(rc_mlx5/mlx5_bond_0:1) rma(rc_mlx5/mlx5_bond_0:1) am(rc_mlx5/mlx5_bond_0:1) stream(rc_mlx5/mlx5_bond_0:1) ka(rc_mlx5/mlx5_bond_0:1) [1729749712.778541] [kh060:2253842:0] ucp_worker.c:1888 UCX INFO ucp_context_0 intra-node cfg#4 tag(rc_mlx5/mlx5_bond_0:1) rma(rc_mlx5/mlx5_bond_0:1) am(rc_mlx5/mlx5_bond_0:1) stream(rc_mlx5/mlx5_bond_0:1) ka(rc_mlx5/mlx5_bond_0:1) [1729749712.780713] [kh060:2253839:0] ucp_worker.c:1888 UCX INFO ucp_context_0 intra-node cfg#5 tag(rc_mlx5/mlx5_2:1) rma(rc_mlx5/mlx5_2:1) am(rc_mlx5/mlx5_2:1) stream(rc_mlx5/mlx5_2:1) ka(ud_mlx5/mlx5_2:1) [1729749712.780713] [kh060:2253841:0] ucp_worker.c:1888 UCX INFO ucp_context_0 intra-node cfg#5 tag(rc_mlx5/mlx5_2:1) rma(rc_mlx5/mlx5_2:1) am(rc_mlx5/mlx5_2:1) stream(rc_mlx5/mlx5_2:1) ka(ud_mlx5/mlx5_2:1) [1729749712.780795] [kh060:2253842:0] ucp_worker.c:1888 UCX INFO ucp_context_0 intra-node cfg#5 tag(rc_mlx5/mlx5_2:1) rma(rc_mlx5/mlx5_2:1) am(rc_mlx5/mlx5_2:1) stream(rc_mlx5/mlx5_2:1) ka(ud_mlx5/mlx5_2:1) [1729749712.780814] [kh060:2253840:0] ucp_worker.c:1888 UCX INFO ucp_context_0 intra-node cfg#5 tag(rc_mlx5/mlx5_2:1) rma(rc_mlx5/mlx5_2:1) am(rc_mlx5/mlx5_2:1) stream(rc_mlx5/mlx5_2:1) ka(ud_mlx5/mlx5_2:1) [1729749712.786425] [kh060:2253823:0] ucp_worker.c:1888 UCX INFO ucp_context_0 intra-node cfg#3 tag(rc_mlx5/mlx5_2:1) rma(rc_mlx5/mlx5_2:1) am(rc_mlx5/mlx5_2:1) stream(rc_mlx5/mlx5_2:1) ka(ud_mlx5/mlx5_2:1) [1729749712.812465] [kh060:2253842:0] ucp_worker.c:1888 UCX INFO ucp_context_0 intra-node cfg#6 tag(rc_mlx5/mlx5_bond_0:1) rma(rc_mlx5/mlx5_bond_0:1) am(rc_mlx5/mlx5_bond_0:1) stream(rc_mlx5/mlx5_bond_0:1) ka(rc_mlx5/mlx5_bond_0:1) [1729749712.822722] [kh060:2253841:0] ucp_worker.c:1888 UCX INFO ucp_context_0 intra-node cfg#6 tag(rc_mlx5/mlx5_bond_0:1) rma(rc_mlx5/mlx5_bond_0:1) am(rc_mlx5/mlx5_bond_0:1) stream(rc_mlx5/mlx5_bond_0:1) ka(rc_mlx5/mlx5_bond_0:1) [1729749712.824464] [kh060:2253841:0] ucp_worker.c:1888 UCX INFO ucp_context_0 intra-node cfg#7 tag(rc_mlx5/mlx5_2:1) rma(rc_mlx5/mlx5_2:1) am(rc_mlx5/mlx5_2:1) stream(rc_mlx5/mlx5_2:1) ka(ud_mlx5/mlx5_2:1) [1729749712.825180] [kh060:2253842:0] ucp_worker.c:1888 UCX INFO ucp_context_0 intra-node cfg#7 tag(rc_mlx5/mlx5_2:1) rma(rc_mlx5/mlx5_2:1) am(rc_mlx5/mlx5_2:1) stream(rc_mlx5/mlx5_2:1) ka(ud_mlx5/mlx5_2:1) [1729749712.913412] [kh060:2253840:0] ucp_worker.c:1888 UCX INFO ucp_context_0 intra-node cfg#6 tag(rc_mlx5/mlx5_bond_0:1) rma(rc_mlx5/mlx5_bond_0:1) am(rc_mlx5/mlx5_bond_0:1) stream(rc_mlx5/mlx5_bond_0:1) ka(rc_mlx5/mlx5_bond_0:1) [1729749712.913563] [kh060:2253841:0] ucp_worker.c:1888 UCX INFO ucp_context_0 intra-node cfg#8 tag(rc_mlx5/mlx5_bond_0:1) rma(rc_mlx5/mlx5_bond_0:1) am(rc_mlx5/mlx5_bond_0:1) stream(rc_mlx5/mlx5_bond_0:1) ka(rc_mlx5/mlx5_bond_0:1) [1729749712.921241] [kh060:2253839:0] ucp_worker.c:1888 UCX INFO ucp_context_0 intra-node cfg#6 tag(rc_mlx5/mlx5_bond_0:1) rma(rc_mlx5/mlx5_bond_0:1) am(rc_mlx5/mlx5_bond_0:1) stream(rc_mlx5/mlx5_bond_0:1) ka(rc_mlx5/mlx5_bond_0:1) [1729749712.922209] [kh060:2253840:0] ucp_worker.c:1888 UCX INFO ucp_context_0 intra-node cfg#7 tag(rc_mlx5/mlx5_bond_0:1) rma(rc_mlx5/mlx5_bond_0:1) am(rc_mlx5/mlx5_bond_0:1) stream(rc_mlx5/mlx5_bond_0:1) ka(rc_mlx5/mlx5_bond_0:1) [1729749712.923049] [kh060:2253839:0] ucp_worker.c:1888 UCX INFO ucp_context_0 intra-node cfg#7 tag(rc_mlx5/mlx5_2:1) rma(rc_mlx5/mlx5_2:1) am(rc_mlx5/mlx5_2:1) stream(rc_mlx5/mlx5_2:1) ka(ud_mlx5/mlx5_2:1) [1729749712.923769] [kh060:2253840:0] ucp_worker.c:1888 UCX INFO ucp_context_0 intra-node cfg#8 tag(rc_mlx5/mlx5_2:1) rma(rc_mlx5/mlx5_2:1) am(rc_mlx5/mlx5_2:1) stream(rc_mlx5/mlx5_2:1) ka(ud_mlx5/mlx5_2:1) [1729749712.924376] [kh060:2253840:0] ucp_worker.c:1888 UCX INFO ucp_context_0 intra-node cfg#9 tag(rc_mlx5/mlx5_2:1) rma(rc_mlx5/mlx5_2:1) am(rc_mlx5/mlx5_2:1) stream(rc_mlx5/mlx5_2:1) ka(ud_mlx5/mlx5_2:1) [1729749712.924412] [kh060:2253841:0] ucp_worker.c:1888 UCX INFO ucp_context_0 intra-node cfg#9 tag(rc_mlx5/mlx5_2:1) rma(rc_mlx5/mlx5_2:1) am(rc_mlx5/mlx5_2:1) stream(rc_mlx5/mlx5_2:1) ka(ud_mlx5/mlx5_2:1) cuDF merge benchmark -------------------------------------------------------------------------------------------------------------- Device(s) | [0, 1, 2, 3] Chunks per device | 1 Rows per chunk | 100000000 Total data processed | 119.21 GiB Data processed per iter | 11.92 GiB Row matching fraction | 0.3 ============================================================================================================== Wall-clock | 5.47 s Bandwidth | 6.94 GiB/s Throughput | 21.78 GiB/s ============================================================================================================== Run | Wall-clock | Bandwidth | Throughput 0 | 391.79 ms | 10.15 GiB/s | 30.43 GiB/s 1 | 742.05 ms | 5.02 GiB/s | 16.06 GiB/s 2 | 731.00 ms | 5.00 GiB/s | 16.31 GiB/s 3 | 682.10 ms | 5.45 GiB/s | 17.48 GiB/s 4 | 608.11 ms | 6.12 GiB/s | 19.60 GiB/s 5 | 518.30 ms | 7.44 GiB/s | 23.00 GiB/s 6 | 539.24 ms | 6.98 GiB/s | 22.11 GiB/s 7 | 388.66 ms | 10.26 GiB/s | 30.67 GiB/s 8 | 398.94 ms | 9.98 GiB/s | 29.88 GiB/s 9 | 468.55 ms | 8.28 GiB/s | 25.44 GiB/s ```

Yes, if you can access a H100 node and reproduce the case that would be a great point of comparison. Thanks again.

pentschev commented 1 month ago

I was able to reproduce the variability consistently in a H100 node, both by getting a partial and full node allocations, so presumably this is not related to what I initially thought. I was also able to reproduce the some errors preventing the run to succeed with 4 GPUs, and by the looks of it on my end it was during the establishment of endpoints, which I've previously observed flakiness in Dask clusters when a large number of endpoints are attempted to be created simultaneously, but the errors exactly as you've posted I was unable to observe.

I do not have a lead yet as to what happens in H100s, my first guess is that it's related to suboptimal paths or the lack of assigning proper affinity to each process. I'll see if I can do more testing tomorrow or early next week.

To be honest, this is a benchmark that, to my knowledge, is not often used, I haven't touched this myself probably in 2 years. Would you mind briefly describing how did you come across it and why are you interested in this one specifically?

orliac commented 4 weeks ago

Hi @pentschev, many thanks for your investigations. I was after some established benchmarks for validating my setup which reads datasets (x100 GB-xTB) stored in Zarr format that I want to process with a dask-gpu cluster using UCX and KvikIO. I just want to make sure I'm able to saturate the available bandwidths (reading from disk and intra and inter nodes communications). So far only nvbandwith gives reasonable results. ucx_perftest also returns funny numbers (900 GB/s P2P) while theoretical bandwidth is at 318 GB/s:

+--------------+--------------+------------------------------+---------------------+-----------------------+
|              |              |       overhead (usec)        |   bandwidth (MB/s)  |  message rate (msg/s) |
+--------------+--------------+----------+---------+---------+----------+----------+-----------+-----------+
|    Stage     | # iterations | 50.0%ile | average | overall |  average |  overall |  average  |  overall  |
+--------------+--------------+----------+---------+---------+----------+----------+-----------+-----------+
...
[thread 0]               512  21162.373 21155.958 19833.996   901565.70  961656.26          47          50
[thread 0]               560  21162.373 21156.088 19947.318   901560.19  956193.02          47          50
[thread 0]               608  21162.353 21156.018 20042.742   901563.16  951640.58          47          50
[thread 0]               656  21162.373 21156.271 20124.220   901552.36  947787.63          47          50
[thread 0]               704  21162.373 21156.023 20194.570   901562.94  944485.90          47          50
[thread 0]               752  21162.373 21156.083 20255.943   901560.40  941624.21          47          49
...

        GPU0    GPU1    NIC0    NIC1    NIC2    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV6     SYS     SYS     SYS     24-31   3               N/A
GPU1    NV6      X      SYS     SYS     SYS     24-31   3               N/A
NIC0    SYS     SYS      X      PIX     SYS
NIC1    SYS     SYS     PIX      X      SYS
NIC2    SYS     SYS     SYS     SYS      X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_2
  NIC1: mlx5_3
  NIC2: mlx5_bond_0

nvbandwidth gives ~260 GB/s:

Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
 D\D       0       1       2       3 
     0 1958.71  260.82  260.49  260.60 
     1  260.73 1955.88  260.56  260.57 
     2  260.87  260.70 1954.35  261.87 
     3  260.68  260.45  260.44 1953.96

pentschev commented 3 weeks ago

UCX can be faster than a single link because it uses multiple rails for communication, meaning transfers may be split among those various links (e.g., NVLink + InfiniBand) to achieve higher bandwidth than you would when using a single rail, this is something that can be controlled with UCX_MAX_RNDV_RAILS, for instance. Could you post the full command you used to run ucx_perftest above?

orliac commented 3 weeks ago

Here is the test (on fewer iteration, setting UCX_MAX_RNDV_RAILS to 1):

TEST=tag_bw
SIZE=20000000000
export UCX_MAX_RNDV_RAILS=1
env | grep UCX   || echo "-W- no UCX"; echo
CUDA_VISIBLE_DEVICES=0      ucx_perftest           -t $TEST -m cuda -s $SIZE -n 15 -p 9999 -c 24 & \
    CUDA_VISIBLE_DEVICES=1  ucx_perftest localhost -t $TEST -m cuda -s $SIZE -n 15 -p 9999 -c 25

UCX_MAX_RNDV_RAILS=1
PS1=(UCX-PY-BENCH) 
UCX_LOG_LEVEL=info

+--------------+--------------+------------------------------+---------------------+-----------------------+
|              |              |       overhead (usec)        |   bandwidth (MB/s)  |  message rate (msg/s) |
+--------------+--------------+----------+---------+---------+----------+----------+-----------+-----------+
+----------------------------------------------------------------------------------------------------------+
|    Stage     | # iterations | 50.0%ile | average | overall |  average |  overall |  average  |  overall  |
+--------------+--------------+----------+---------+---------+----------+----------+-----------+-----------+
| API:          protocol layer                                                                             |
| Test:         tag match bandwidth                                                                        |
| Data layout:  (automatic)                                                                                |
| Send memory:  cuda                                                                                       |
| Recv memory:  cuda                                                                                       |
| Message size: 20000000000                                                                                |
| Window size:  32                                                                                         |
+----------------------------------------------------------------------------------------------------------+
Final:                    15      1.022 21160.793 21160.793   901359.70  901359.70          47          47

[1730454147.495515] [kh010:2558782:0]         libperf.c:2090 UCX  DIAG  UCT tests also copy one-byte value from host memory to cuda send memory, which may impact performance results
[1730454147.495526] [kh010:2558782:0]         libperf.c:2097 UCX  DIAG  UCT tests also copy one-byte value from cuda recv memory to host memory, which may impact performance results
[1730454147.495525] [kh010:2558781:0]         libperf.c:2090 UCX  DIAG  UCT tests also copy one-byte value from host memory to cuda send memory, which may impact performance results
[1730454147.495532] [kh010:2558781:0]         libperf.c:2097 UCX  DIAG  UCT tests also copy one-byte value from cuda recv memory to host memory, which may impact performance results
[1730454149.556893] [kh010:2558782:0]     ucp_context.c:2190 UCX  INFO  Version 1.17.0 (loaded from /ssoft/spack/pinot-noir/kuma-h100/v1/spack/opt/spack/linux-rhel9-zen4/gcc-13.2.0/ucx-1.17.0-no2vdboyxq2falry3mus5kwwmpafamdy/lib/libucp.so.0)
[1730454149.556893] [kh010:2558781:0]     ucp_context.c:2190 UCX  INFO  Version 1.17.0 (loaded from /ssoft/spack/pinot-noir/kuma-h100/v1/spack/opt/spack/linux-rhel9-zen4/gcc-13.2.0/ucx-1.17.0-no2vdboyxq2falry3mus5kwwmpafamdy/lib/libucp.so.0)
[1730454151.618710] [kh010:2558782:0]          parser.c:2314 UCX  INFO  UCX_* env variables: UCX_MAX_RNDV_RAILS=1 UCX_LOG_LEVEL=info
[1730454151.618710] [kh010:2558781:0]          parser.c:2314 UCX  INFO  UCX_* env variables: UCX_MAX_RNDV_RAILS=1 UCX_LOG_LEVEL=info
[1730454151.625211] [kh010:2558782:0]      ucp_worker.c:1888 UCX  INFO    perftest intra-node cfg#2 tag(rc_mlx5/mlx5_2:1 sysv/memory cma/memory rc_mlx5/mlx5_3:1 rc_mlx5/mlx5_bond_0:1 rc_mlx5/mlx5_bond_0:1 cuda_ipc/cuda)  rma(rc_mlx5/mlx5_2:1 sysv/memory posix/memory)
[1730454151.625215] [kh010:2558781:0]      ucp_worker.c:1888 UCX  INFO    perftest intra-node cfg#2 tag(rc_mlx5/mlx5_2:1 sysv/memory cma/memory rc_mlx5/mlx5_3:1 rc_mlx5/mlx5_bond_0:1 rc_mlx5/mlx5_bond_0:1 cuda_ipc/cuda)  rma(rc_mlx5/mlx5_2:1 sysv/memory posix/memory)
[1730454167.014007] [kh010:2558781:0]      ucp_worker.c:1888 UCX  INFO    perftest self cfg#3 tag(self/memory rc_mlx5/mlx5_2:1 cma/memory cuda_copy/cuda rc_mlx5/mlx5_3:1 rc_mlx5/mlx5_bond_0:1 rc_mlx5/mlx5_bond_0:1)  rma(self/memory rc_mlx5/mlx5_2:1)
[1730454167.014006] [kh010:2558782:0]      ucp_worker.c:1888 UCX  INFO    perftest self cfg#3 tag(self/memory rc_mlx5/mlx5_2:1 cma/memory cuda_copy/cuda rc_mlx5/mlx5_3:1 rc_mlx5/mlx5_bond_0:1 rc_mlx5/mlx5_bond_0:1)  rma(self/memory rc_mlx5/mlx5_2:1)

pentschev commented 3 weeks ago

Thanks for confirming those results on your end @orliac . After talking to some of the UCX developers I've been informed we have an internal bug report (not publicly viewable) that may be the cause, what happens is ucx_perftest bandwidth tests reports results on the client side after just the send operation completes which may lead to incorrect results. In UCX-Py performance tests we do something different, we wait for both send and receiver side to complete and report that result instead, could you try running the following on your end?

python -m ucp.benchmarks.send_recv --n-bytes 20000000000 --object_type cupy --server-dev 0 --client-dev 1 --server-cpu-affinity 0 --client-cpu-affinity 1

This results in much more consistent results with p2pBandwidthLatencyTest, at least on my end:

Roundtrip benchmark
================================================================================
Iterations                | 10
Bytes                     | 18.63 GiB
Object type               | cupy
Reuse allocation          | False
Transfer API              | TAG
UCX_TLS                   | all
UCX_NET_DEVICES           | mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_5:1,mlx5_6:1,mlx5_7:1,mlx5_8:1,enp90s0np0
================================================================================
Device(s)                 | 0, 1
================================================================================
Bandwidth (average)       | 270.79 GiB/s
Bandwidth (median)        | 270.78 GiB/s
Latency (average)         | 68785070 ns
Latency (median)          | 68788472 ns
================================================================================
Iterations                | Bandwidth, Latency
--------------------------------------------------------------------------------
0                         | 270.92 GiB/s, 68752570ns
1                         | 270.74 GiB/s, 68797729ns
2                         | 270.84 GiB/s, 68772184ns
3                         | 270.75 GiB/s, 68794847ns
4                         | 270.87 GiB/s, 68766119ns
5                         | 270.78 GiB/s, 68789240ns
6                         | 270.74 GiB/s, 68797713ns
7                         | 270.78 GiB/s, 68787704ns
8                         | 270.84 GiB/s, 68773049ns
9                         | 270.66 GiB/s, 68819548ns

orliac commented 2 weeks ago

Thanks for investigating @pentschev. Replicating on my side I cannot reach the numbers you obtain on your side (which look consistent with what is expected). I'm below half the expected throughput but at least numbers are consistent over iterations.

UCXPY_IFNAME=bond0 UCX_NET_DEVICES=mlx5_bond_0:1 \
            python -m ucp.benchmarks.send_recv --n-bytes 20000000000 --object_type cupy \
            --server-dev 0 --client-dev 1 \
            --server-cpu-affinity 24 --client-cpu-affinity 25

[1731314604.795283] [kh007:2971561:0]     ucp_context.c:2190 UCX  INFO  Version 1.17.0 (loaded from /ssoft/spack/pinot-noir/kuma-h100/v1/spack/opt/spack/linux-rhel9-zen4/gcc-13.2.0/ucx-1.17.0-no2vdboyxq2falry3mus5kwwmpafamdy/lib/libucp.so.0)
[1731314605.182965] [kh007:2971561:0]          parser.c:2314 UCX  INFO  UCX_* env variables: UCX_NET_DEVICES=mlx5_bond_0:1 UCX_LOG_LEVEL=info UCX_MEMTYPE_CACHE=n UCX_RNDV_THRESH=8192 UCX_RNDV_FRAG_MEM_TYPE=cuda UCX_MAX_RNDV_RAILS=1 UCX_PROTO_ENABLE=n
Server Running at 10.91.54.7:34300
Client connecting to server at 10.91.54.7:34300
[1731314607.019350] [kh007:2971637:0]     ucp_context.c:2190 UCX  INFO  Version 1.17.0 (loaded from /ssoft/spack/pinot-noir/kuma-h100/v1/spack/opt/spack/linux-rhel9-zen4/gcc-13.2.0/ucx-1.17.0-no2vdboyxq2falry3mus5kwwmpafamdy/lib/libucp.so.0)
[1731314607.397984] [kh007:2971637:0]          parser.c:2314 UCX  INFO  UCX_* env variables: UCX_NET_DEVICES=mlx5_bond_0:1 UCX_LOG_LEVEL=info UCX_MEMTYPE_CACHE=n UCX_RNDV_THRESH=8192 UCX_RNDV_FRAG_MEM_TYPE=cuda UCX_MAX_RNDV_RAILS=1 UCX_PROTO_ENABLE=n
[1731314607.504726] [kh007:2971637:0]      ucp_worker.c:1888 UCX  INFO  ucp_context_0 intra-node cfg#3 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1731314607.528434] [kh007:2971561:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#2 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1731314607.545081] [kh007:2971637:0]      ucp_worker.c:1888 UCX  INFO      ucp_context_0 intra-node cfg#4 tag(rc_mlx5/mlx5_bond_0:1)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1)  stream(rc_mlx5/mlx5_bond_0:1)  ka(rc_mlx5/mlx5_bond_0:1)
[1731314607.547429] [kh007:2971637:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#5 tag(rc_mlx5/mlx5_bond_0:1 cuda_ipc/cuda)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1 cuda_ipc/cuda)  stream(rc_mlx5/mlx5_bond_0:1)  ka(ud_mlx5/mlx5_bond_0:1)
[1731314607.548621] [kh007:2971561:0]      ucp_worker.c:1888 UCX  INFO    ucp_context_0 intra-node cfg#3 tag(rc_mlx5/mlx5_bond_0:1 cuda_ipc/cuda)  rma(rc_mlx5/mlx5_bond_0:1)  am(rc_mlx5/mlx5_bond_0:1 cuda_ipc/cuda)  stream(rc_mlx5/mlx5_bond_0:1)  ka(ud_mlx5/mlx5_bond_0:1)
Roundtrip benchmark
================================================================================
Iterations                | 10
Bytes                     | 18.63 GiB
Object type               | cupy
Reuse allocation          | False
Transfer API              | TAG
UCX_TLS                   | all
UCX_NET_DEVICES           | mlx5_bond_0:1
================================================================================
Device(s)                 | 0, 1
================================================================================
Bandwidth (average)       | 123.09 GiB/s
Bandwidth (median)        | 123.09 GiB/s
Latency (average)         | 151322271 ns
Latency (median)          | 151327768 ns
================================================================================
Iterations                | Bandwidth, Latency
--------------------------------------------------------------------------------
0                         | 123.07 GiB/s, 151343997ns
1                         | 123.09 GiB/s, 151322250ns
2                         | 123.07 GiB/s, 151346345ns
3                         | 123.08 GiB/s, 151337466ns
4                         | 123.12 GiB/s, 151282572ns
5                         | 123.12 GiB/s, 151291173ns
6                         | 123.10 GiB/s, 151313041ns
7                         | 123.10 GiB/s, 151311778ns
8                         | 123.08 GiB/s, 151340797ns
9                         | 123.08 GiB/s, 151333286ns

pentschev commented 2 weeks ago

Actually, what you're seeing is probably correct, given the implementation internals. The "Bidirectional" test in p2pBandwidthLatencyTest transfers in both directions simultaneously and aggregates the bandwidth of both. The ucx_perftest, as well as the ucp.benchmarks.send_recv transfer in both directions but impose a synchronization step in between (see how we await receive and then await send in the server, and vice-versa on the client), so that is analogous to the "Unidirectional" test (assuming transports are symmetric). Can you check what is the bandwidth you see for the "Unidirectional" test?

In my case, GPUs have more NVLinks (NV18 vs NV6 on your end), and thus the higher unidirectional bandwidth.

        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    NIC6    NIC7    NIC8    CPU Affinity    NUMA Affinity       GPU NUMA ID
GPU0     X      NV18    NV18    NV18    NV18    NV18    NV18    NV18    PIX     PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS     0-31,64-95      0          N/A
GPU1    NV18     X      NV18    NV18    NV18    NV18    NV18    NV18    PIX     PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS     0-31,64-95      0          N/A
GPU2    NV18    NV18     X      NV18    NV18    NV18    NV18    NV18    SYS     SYS     PIX     PIX     SYS     SYS     SYS     SYS     SYS     0-31,64-95      0          N/A
GPU3    NV18    NV18    NV18     X      NV18    NV18    NV18    NV18    SYS     SYS     PIX     PIX     SYS     SYS     SYS     SYS     SYS     0-31,64-95      0          N/A
GPU4    NV18    NV18    NV18    NV18     X      NV18    NV18    NV18    SYS     SYS     SYS     SYS     SYS     PIX     PIX     SYS     SYS     32-63,96-127    1          N/A
GPU5    NV18    NV18    NV18    NV18    NV18     X      NV18    NV18    SYS     SYS     SYS     SYS     SYS     PIX     PIX     SYS     SYS     32-63,96-127    1          N/A
GPU6    NV18    NV18    NV18    NV18    NV18    NV18     X      NV18    SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX     PIX     32-63,96-127    1          N/A
GPU7    NV18    NV18    NV18    NV18    NV18    NV18    NV18     X      SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX     PIX     32-63,96-127    1          N/A

And then in P2P I get actually higher bandwidth than UCX/UCX-Py, which I can't explain at the moment but I think it might have to do with the way the p2pBandwidthLatencyTest implements the use of CUDA streams:

Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1      2      3      4      5      6      7
     0 2494.89 363.11 376.08 375.21 377.03 376.37 376.43 376.31
     1 375.29 2530.88 375.96 376.06 375.84 375.96 376.13 375.88
     2 362.41 393.40 2516.48 376.12 376.25 376.04 376.49 376.95
     3 362.12 375.27 393.70 2515.72 379.18 376.51 376.10 375.19
     4 376.39 375.52 376.25 393.99 2513.57 375.95 376.54 376.53
     5 375.47 375.91 376.18 374.39 378.68 2530.24 375.09 376.85
     6 375.55 375.49 376.57 376.07 376.21 376.53 2517.75 375.43
     7 376.96 376.22 376.09 376.82 375.69 376.64 376.41 2523.98
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3      4      5      6      7
     0 2577.32 738.73 739.77 741.27 739.83 741.98 741.48 739.96
     1 743.14 2582.04 742.24 742.26 742.18 740.86 741.75 740.41
     2 741.21 773.13 2571.55 741.08 739.45 740.92 742.42 742.46
     3 741.82 742.04 768.73 2573.01 740.39 742.54 740.19 740.19
     4 742.36 741.60 741.85 773.05 2574.00 742.37 742.21 742.49
     5 742.79 742.26 741.62 741.80 740.99 2577.25 742.66 742.85
     6 739.51 741.34 741.45 741.50 742.05 743.83 2580.58 740.02
     7 741.04 741.15 741.40 742.59 740.64 740.52 740.55 2575.66

orliac commented 1 week ago

Thanks @pentschev for the feedback, and sorry for the slow reply. Then on my side the numbers make sense then, the ucp.benchmarks.send_recv stats match pretty well those of the unidirectional p2p from p2pBandwidthLatencyTest:

Unidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3 
     0 1213.12 122.00 122.27 121.61 
     1 129.23 1367.91 123.17 122.63 
     2 129.32 118.11 1366.12 123.31 
     3 129.47 129.29 129.06 1366.72 

Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3 
     0 1474.06 242.82 224.82 223.63 
     1 256.97 1566.02 244.26 245.83 
     2 256.62 234.86 1568.58 244.61 
     3 257.49 257.38 256.58 1561.33

rapidsai / ucx-py

Failing to run cudf_merge benchmark on a node with 4 H100 #1088