Open orliac opened 1 month ago
Indeed that doesn't look right. I do not have immediate access to a system with H100s but I can try to do that tomorrow. In the meantime I was able to run that on a DGX-1 and the results I see are much more in line with what we would expect, although I had to reduce to chunk size to 100M due to the amount of memory in the V100s:
Based on the affinity reported by your system as the output of nvidia-smi topo -m
I suspect this is a only a partition of a node, is that right? Are you able to get a full node allocation to test it as well? Could you also try to disable InfiniBand (UCX_TLS=^rc
) and later NVLink (UCX_TLS=cuda_ipc
) to see whether the errors and the variability go away? Could you also report what cat /proc/cpuinfo
shows (I want that we confirm that only the cores with matching affinity are available to your node partition, but I'm not entirely sure only /proc/cpuinfo
will provide us with that information)?
And finally, is it expected that the benchmark saturate the available bandwidth between the GPUs?
That is a good question, I'm not sure we have done such testing in the past but for the 2 GPU case we get 85-90% of expected, with ~19.5GiB/s where ucx_perftest
alone reports ~22.5GiB/s (note ucx_perftest
reports GB/s and not GiB/s, I've done the conversion on my own so we're comparing at right scale):
Also note that a DGX-1 doesn't have NVSwitch connecting all GPUs, so when scaling to all of them we expect to be limited by the Connect-X 4 bandwidth.
Thanks @pentschev for the quick feedback.
No, I'm using a full node in exclusive mode for these tests. There are 8 NUMA nodes of 8 CPU physical cores, and GPUs are connected by pairs on a same NUMA node. I can provide the full lstopo output if of interest.
Note that we may have an issue on our side wrt to affinity as actually nothing forces Slurm to allocate CPU cores on the same NUMA node of the GPU when you request a single GPU. But in this case I'm using the full node, and at least the list of CPU cores is not empty, so I assume it works as expected on that side. But to be checked.
export UCX_TLS=^rc
changes nothing.
export UCX_TLS=^cuda_ipc
allows me to run the 4 GPUs case, but gives (as expected) lower bandwidths and does not help with the repeatability:
Yes, if you can access a H100 node and reproduce the case that would be a great point of comparison. Thanks again.
I was able to reproduce the variability consistently in a H100 node, both by getting a partial and full node allocations, so presumably this is not related to what I initially thought. I was also able to reproduce the some errors preventing the run to succeed with 4 GPUs, and by the looks of it on my end it was during the establishment of endpoints, which I've previously observed flakiness in Dask clusters when a large number of endpoints are attempted to be created simultaneously, but the errors exactly as you've posted I was unable to observe.
I do not have a lead yet as to what happens in H100s, my first guess is that it's related to suboptimal paths or the lack of assigning proper affinity to each process. I'll see if I can do more testing tomorrow or early next week.
To be honest, this is a benchmark that, to my knowledge, is not often used, I haven't touched this myself probably in 2 years. Would you mind briefly describing how did you come across it and why are you interested in this one specifically?
Hi @pentschev, many thanks for your investigations. I was after some established benchmarks for validating my setup which reads datasets (x100 GB-xTB) stored in Zarr format that I want to process with a dask-gpu cluster using UCX and KvikIO. I just want to make sure I'm able to saturate the available bandwidths (reading from disk and intra and inter nodes communications). So far only nvbandwith gives reasonable results. ucx_perftest also returns funny numbers (900 GB/s P2P) while theoretical bandwidth is at 318 GB/s:
+--------------+--------------+------------------------------+---------------------+-----------------------+
| | | overhead (usec) | bandwidth (MB/s) | message rate (msg/s) |
+--------------+--------------+----------+---------+---------+----------+----------+-----------+-----------+
| Stage | # iterations | 50.0%ile | average | overall | average | overall | average | overall |
+--------------+--------------+----------+---------+---------+----------+----------+-----------+-----------+
...
[thread 0] 512 21162.373 21155.958 19833.996 901565.70 961656.26 47 50
[thread 0] 560 21162.373 21156.088 19947.318 901560.19 956193.02 47 50
[thread 0] 608 21162.353 21156.018 20042.742 901563.16 951640.58 47 50
[thread 0] 656 21162.373 21156.271 20124.220 901552.36 947787.63 47 50
[thread 0] 704 21162.373 21156.023 20194.570 901562.94 944485.90 47 50
[thread 0] 752 21162.373 21156.083 20255.943 901560.40 941624.21 47 49
...
GPU0 GPU1 NIC0 NIC1 NIC2 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV6 SYS SYS SYS 24-31 3 N/A
GPU1 NV6 X SYS SYS SYS 24-31 3 N/A
NIC0 SYS SYS X PIX SYS
NIC1 SYS SYS PIX X SYS
NIC2 SYS SYS SYS SYS X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_2
NIC1: mlx5_3
NIC2: mlx5_bond_0
nvbandwidth gives ~260 GB/s:
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3
0 1958.71 260.82 260.49 260.60
1 260.73 1955.88 260.56 260.57
2 260.87 260.70 1954.35 261.87
3 260.68 260.45 260.44 1953.96
UCX can be faster than a single link because it uses multiple rails for communication, meaning transfers may be split among those various links (e.g., NVLink + InfiniBand) to achieve higher bandwidth than you would when using a single rail, this is something that can be controlled with UCX_MAX_RNDV_RAILS
, for instance. Could you post the full command you used to run ucx_perftest
above?
Here is the test (on fewer iteration, setting UCX_MAX_RNDV_RAILS to 1):
TEST=tag_bw
SIZE=20000000000
export UCX_MAX_RNDV_RAILS=1
env | grep UCX || echo "-W- no UCX"; echo
CUDA_VISIBLE_DEVICES=0 ucx_perftest -t $TEST -m cuda -s $SIZE -n 15 -p 9999 -c 24 & \
CUDA_VISIBLE_DEVICES=1 ucx_perftest localhost -t $TEST -m cuda -s $SIZE -n 15 -p 9999 -c 25
UCX_MAX_RNDV_RAILS=1
PS1=(UCX-PY-BENCH)
UCX_LOG_LEVEL=info
+--------------+--------------+------------------------------+---------------------+-----------------------+
| | | overhead (usec) | bandwidth (MB/s) | message rate (msg/s) |
+--------------+--------------+----------+---------+---------+----------+----------+-----------+-----------+
+----------------------------------------------------------------------------------------------------------+
| Stage | # iterations | 50.0%ile | average | overall | average | overall | average | overall |
+--------------+--------------+----------+---------+---------+----------+----------+-----------+-----------+
| API: protocol layer |
| Test: tag match bandwidth |
| Data layout: (automatic) |
| Send memory: cuda |
| Recv memory: cuda |
| Message size: 20000000000 |
| Window size: 32 |
+----------------------------------------------------------------------------------------------------------+
Final: 15 1.022 21160.793 21160.793 901359.70 901359.70 47 47
[1730454147.495515] [kh010:2558782:0] libperf.c:2090 UCX DIAG UCT tests also copy one-byte value from host memory to cuda send memory, which may impact performance results
[1730454147.495526] [kh010:2558782:0] libperf.c:2097 UCX DIAG UCT tests also copy one-byte value from cuda recv memory to host memory, which may impact performance results
[1730454147.495525] [kh010:2558781:0] libperf.c:2090 UCX DIAG UCT tests also copy one-byte value from host memory to cuda send memory, which may impact performance results
[1730454147.495532] [kh010:2558781:0] libperf.c:2097 UCX DIAG UCT tests also copy one-byte value from cuda recv memory to host memory, which may impact performance results
[1730454149.556893] [kh010:2558782:0] ucp_context.c:2190 UCX INFO Version 1.17.0 (loaded from /ssoft/spack/pinot-noir/kuma-h100/v1/spack/opt/spack/linux-rhel9-zen4/gcc-13.2.0/ucx-1.17.0-no2vdboyxq2falry3mus5kwwmpafamdy/lib/libucp.so.0)
[1730454149.556893] [kh010:2558781:0] ucp_context.c:2190 UCX INFO Version 1.17.0 (loaded from /ssoft/spack/pinot-noir/kuma-h100/v1/spack/opt/spack/linux-rhel9-zen4/gcc-13.2.0/ucx-1.17.0-no2vdboyxq2falry3mus5kwwmpafamdy/lib/libucp.so.0)
[1730454151.618710] [kh010:2558782:0] parser.c:2314 UCX INFO UCX_* env variables: UCX_MAX_RNDV_RAILS=1 UCX_LOG_LEVEL=info
[1730454151.618710] [kh010:2558781:0] parser.c:2314 UCX INFO UCX_* env variables: UCX_MAX_RNDV_RAILS=1 UCX_LOG_LEVEL=info
[1730454151.625211] [kh010:2558782:0] ucp_worker.c:1888 UCX INFO perftest intra-node cfg#2 tag(rc_mlx5/mlx5_2:1 sysv/memory cma/memory rc_mlx5/mlx5_3:1 rc_mlx5/mlx5_bond_0:1 rc_mlx5/mlx5_bond_0:1 cuda_ipc/cuda) rma(rc_mlx5/mlx5_2:1 sysv/memory posix/memory)
[1730454151.625215] [kh010:2558781:0] ucp_worker.c:1888 UCX INFO perftest intra-node cfg#2 tag(rc_mlx5/mlx5_2:1 sysv/memory cma/memory rc_mlx5/mlx5_3:1 rc_mlx5/mlx5_bond_0:1 rc_mlx5/mlx5_bond_0:1 cuda_ipc/cuda) rma(rc_mlx5/mlx5_2:1 sysv/memory posix/memory)
[1730454167.014007] [kh010:2558781:0] ucp_worker.c:1888 UCX INFO perftest self cfg#3 tag(self/memory rc_mlx5/mlx5_2:1 cma/memory cuda_copy/cuda rc_mlx5/mlx5_3:1 rc_mlx5/mlx5_bond_0:1 rc_mlx5/mlx5_bond_0:1) rma(self/memory rc_mlx5/mlx5_2:1)
[1730454167.014006] [kh010:2558782:0] ucp_worker.c:1888 UCX INFO perftest self cfg#3 tag(self/memory rc_mlx5/mlx5_2:1 cma/memory cuda_copy/cuda rc_mlx5/mlx5_3:1 rc_mlx5/mlx5_bond_0:1 rc_mlx5/mlx5_bond_0:1) rma(self/memory rc_mlx5/mlx5_2:1)
Thanks for confirming those results on your end @orliac . After talking to some of the UCX developers I've been informed we have an internal bug report (not publicly viewable) that may be the cause, what happens is ucx_perftest
bandwidth tests reports results on the client side after just the send operation completes which may lead to incorrect results. In UCX-Py performance tests we do something different, we wait for both send and receiver side to complete and report that result instead, could you try running the following on your end?
python -m ucp.benchmarks.send_recv --n-bytes 20000000000 --object_type cupy --server-dev 0 --client-dev 1 --server-cpu-affinity 0 --client-cpu-affinity 1
This results in much more consistent results with p2pBandwidthLatencyTest
, at least on my end:
Roundtrip benchmark
================================================================================
Iterations | 10
Bytes | 18.63 GiB
Object type | cupy
Reuse allocation | False
Transfer API | TAG
UCX_TLS | all
UCX_NET_DEVICES | mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_5:1,mlx5_6:1,mlx5_7:1,mlx5_8:1,enp90s0np0
================================================================================
Device(s) | 0, 1
================================================================================
Bandwidth (average) | 270.79 GiB/s
Bandwidth (median) | 270.78 GiB/s
Latency (average) | 68785070 ns
Latency (median) | 68788472 ns
================================================================================
Iterations | Bandwidth, Latency
--------------------------------------------------------------------------------
0 | 270.92 GiB/s, 68752570ns
1 | 270.74 GiB/s, 68797729ns
2 | 270.84 GiB/s, 68772184ns
3 | 270.75 GiB/s, 68794847ns
4 | 270.87 GiB/s, 68766119ns
5 | 270.78 GiB/s, 68789240ns
6 | 270.74 GiB/s, 68797713ns
7 | 270.78 GiB/s, 68787704ns
8 | 270.84 GiB/s, 68773049ns
9 | 270.66 GiB/s, 68819548ns
Thanks for investigating @pentschev. Replicating on my side I cannot reach the numbers you obtain on your side (which look consistent with what is expected). I'm below half the expected throughput but at least numbers are consistent over iterations.
UCXPY_IFNAME=bond0 UCX_NET_DEVICES=mlx5_bond_0:1 \
python -m ucp.benchmarks.send_recv --n-bytes 20000000000 --object_type cupy \
--server-dev 0 --client-dev 1 \
--server-cpu-affinity 24 --client-cpu-affinity 25
[1731314604.795283] [kh007:2971561:0] ucp_context.c:2190 UCX INFO Version 1.17.0 (loaded from /ssoft/spack/pinot-noir/kuma-h100/v1/spack/opt/spack/linux-rhel9-zen4/gcc-13.2.0/ucx-1.17.0-no2vdboyxq2falry3mus5kwwmpafamdy/lib/libucp.so.0)
[1731314605.182965] [kh007:2971561:0] parser.c:2314 UCX INFO UCX_* env variables: UCX_NET_DEVICES=mlx5_bond_0:1 UCX_LOG_LEVEL=info UCX_MEMTYPE_CACHE=n UCX_RNDV_THRESH=8192 UCX_RNDV_FRAG_MEM_TYPE=cuda UCX_MAX_RNDV_RAILS=1 UCX_PROTO_ENABLE=n
Server Running at 10.91.54.7:34300
Client connecting to server at 10.91.54.7:34300
[1731314607.019350] [kh007:2971637:0] ucp_context.c:2190 UCX INFO Version 1.17.0 (loaded from /ssoft/spack/pinot-noir/kuma-h100/v1/spack/opt/spack/linux-rhel9-zen4/gcc-13.2.0/ucx-1.17.0-no2vdboyxq2falry3mus5kwwmpafamdy/lib/libucp.so.0)
[1731314607.397984] [kh007:2971637:0] parser.c:2314 UCX INFO UCX_* env variables: UCX_NET_DEVICES=mlx5_bond_0:1 UCX_LOG_LEVEL=info UCX_MEMTYPE_CACHE=n UCX_RNDV_THRESH=8192 UCX_RNDV_FRAG_MEM_TYPE=cuda UCX_MAX_RNDV_RAILS=1 UCX_PROTO_ENABLE=n
[1731314607.504726] [kh007:2971637:0] ucp_worker.c:1888 UCX INFO ucp_context_0 intra-node cfg#3 tag(rc_mlx5/mlx5_bond_0:1) rma(rc_mlx5/mlx5_bond_0:1) am(rc_mlx5/mlx5_bond_0:1) stream(rc_mlx5/mlx5_bond_0:1) ka(rc_mlx5/mlx5_bond_0:1)
[1731314607.528434] [kh007:2971561:0] ucp_worker.c:1888 UCX INFO ucp_context_0 intra-node cfg#2 tag(rc_mlx5/mlx5_bond_0:1) rma(rc_mlx5/mlx5_bond_0:1) am(rc_mlx5/mlx5_bond_0:1) stream(rc_mlx5/mlx5_bond_0:1) ka(rc_mlx5/mlx5_bond_0:1)
[1731314607.545081] [kh007:2971637:0] ucp_worker.c:1888 UCX INFO ucp_context_0 intra-node cfg#4 tag(rc_mlx5/mlx5_bond_0:1) rma(rc_mlx5/mlx5_bond_0:1) am(rc_mlx5/mlx5_bond_0:1) stream(rc_mlx5/mlx5_bond_0:1) ka(rc_mlx5/mlx5_bond_0:1)
[1731314607.547429] [kh007:2971637:0] ucp_worker.c:1888 UCX INFO ucp_context_0 intra-node cfg#5 tag(rc_mlx5/mlx5_bond_0:1 cuda_ipc/cuda) rma(rc_mlx5/mlx5_bond_0:1) am(rc_mlx5/mlx5_bond_0:1 cuda_ipc/cuda) stream(rc_mlx5/mlx5_bond_0:1) ka(ud_mlx5/mlx5_bond_0:1)
[1731314607.548621] [kh007:2971561:0] ucp_worker.c:1888 UCX INFO ucp_context_0 intra-node cfg#3 tag(rc_mlx5/mlx5_bond_0:1 cuda_ipc/cuda) rma(rc_mlx5/mlx5_bond_0:1) am(rc_mlx5/mlx5_bond_0:1 cuda_ipc/cuda) stream(rc_mlx5/mlx5_bond_0:1) ka(ud_mlx5/mlx5_bond_0:1)
Roundtrip benchmark
================================================================================
Iterations | 10
Bytes | 18.63 GiB
Object type | cupy
Reuse allocation | False
Transfer API | TAG
UCX_TLS | all
UCX_NET_DEVICES | mlx5_bond_0:1
================================================================================
Device(s) | 0, 1
================================================================================
Bandwidth (average) | 123.09 GiB/s
Bandwidth (median) | 123.09 GiB/s
Latency (average) | 151322271 ns
Latency (median) | 151327768 ns
================================================================================
Iterations | Bandwidth, Latency
--------------------------------------------------------------------------------
0 | 123.07 GiB/s, 151343997ns
1 | 123.09 GiB/s, 151322250ns
2 | 123.07 GiB/s, 151346345ns
3 | 123.08 GiB/s, 151337466ns
4 | 123.12 GiB/s, 151282572ns
5 | 123.12 GiB/s, 151291173ns
6 | 123.10 GiB/s, 151313041ns
7 | 123.10 GiB/s, 151311778ns
8 | 123.08 GiB/s, 151340797ns
9 | 123.08 GiB/s, 151333286ns
Actually, what you're seeing is probably correct, given the implementation internals. The "Bidirectional" test in p2pBandwidthLatencyTest
transfers in both directions simultaneously and aggregates the bandwidth of both. The ucx_perftest
, as well as the ucp.benchmarks.send_recv
transfer in both directions but impose a synchronization step in between (see how we await
receive and then await
send in the server, and vice-versa on the client), so that is analogous to the "Unidirectional" test (assuming transports are symmetric). Can you check what is the bandwidth you see for the "Unidirectional" test?
In my case, GPUs have more NVLinks (NV18 vs NV6 on your end), and thus the higher unidirectional bandwidth.
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 NIC6 NIC7 NIC8 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 PIX PIX SYS SYS SYS SYS SYS SYS SYS 0-31,64-95 0 N/A
GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 PIX PIX SYS SYS SYS SYS SYS SYS SYS 0-31,64-95 0 N/A
GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 SYS SYS PIX PIX SYS SYS SYS SYS SYS 0-31,64-95 0 N/A
GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 SYS SYS PIX PIX SYS SYS SYS SYS SYS 0-31,64-95 0 N/A
GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 SYS SYS SYS SYS SYS PIX PIX SYS SYS 32-63,96-127 1 N/A
GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 SYS SYS SYS SYS SYS PIX PIX SYS SYS 32-63,96-127 1 N/A
GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 SYS SYS SYS SYS SYS SYS SYS PIX PIX 32-63,96-127 1 N/A
GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X SYS SYS SYS SYS SYS SYS SYS PIX PIX 32-63,96-127 1 N/A
And then in P2P I get actually higher bandwidth than UCX/UCX-Py, which I can't explain at the moment but I think it might have to do with the way the p2pBandwidthLatencyTest
implements the use of CUDA streams:
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
D\D 0 1 2 3 4 5 6 7
0 2494.89 363.11 376.08 375.21 377.03 376.37 376.43 376.31
1 375.29 2530.88 375.96 376.06 375.84 375.96 376.13 375.88
2 362.41 393.40 2516.48 376.12 376.25 376.04 376.49 376.95
3 362.12 375.27 393.70 2515.72 379.18 376.51 376.10 375.19
4 376.39 375.52 376.25 393.99 2513.57 375.95 376.54 376.53
5 375.47 375.91 376.18 374.39 378.68 2530.24 375.09 376.85
6 375.55 375.49 376.57 376.07 376.21 376.53 2517.75 375.43
7 376.96 376.22 376.09 376.82 375.69 376.64 376.41 2523.98
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3 4 5 6 7
0 2577.32 738.73 739.77 741.27 739.83 741.98 741.48 739.96
1 743.14 2582.04 742.24 742.26 742.18 740.86 741.75 740.41
2 741.21 773.13 2571.55 741.08 739.45 740.92 742.42 742.46
3 741.82 742.04 768.73 2573.01 740.39 742.54 740.19 740.19
4 742.36 741.60 741.85 773.05 2574.00 742.37 742.21 742.49
5 742.79 742.26 741.62 741.80 740.99 2577.25 742.66 742.85
6 739.51 741.34 741.45 741.50 742.05 743.83 2580.58 740.02
7 741.04 741.15 741.40 742.59 740.64 740.52 740.55 2575.66
Thanks @pentschev for the feedback, and sorry for the slow reply.
Then on my side the numbers make sense then, the ucp.benchmarks.send_recv
stats match pretty well those of the unidirectional p2p from
p2pBandwidthLatencyTest
:
Unidirectional P2P=Enabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3
0 1213.12 122.00 122.27 121.61
1 129.23 1367.91 123.17 122.63
2 129.32 118.11 1366.12 123.31
3 129.47 129.29 129.06 1366.72
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3
0 1474.06 242.82 224.82 223.63
1 256.97 1566.02 244.26 245.83
2 256.62 234.86 1568.58 244.61
3 257.49 257.38 256.58 1561.33
Hi there, I'm facing issue when trying to run the cudf_merge benchmark locally on a node that hosts 4 h100:
I can run the benchmark over any pair of GPUs with no issue:
python -m ucp.benchmarks.cudf_merge --devs 0,1 --chunk-size 200_000_000 --iter 10
But it fails to run over the 4 devices:
python -m ucp.benchmarks.cudf_merge --devs 0,1,2,3 --chunk-size 200_000_000 --iter 10
My environment:
Any idea?
Also, I'm surprised by the variability of the benchmark over the successive 10 iterations.
And finally, is it expected that the benchmark saturate the available bandwidth between the GPUs?