Open mmccarty opened 2 years ago
On DGX 11, this is so far slower than IB on Azure added in #69
(dask-benchmark) mroeschke@dgx11:~$ nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 mlx5_0 mlx5_1 mlx5_2 mlx5_3 CPU Affinity NUMA Affinity
GPU0 X NV1 NV1 NV2 NV2 SYS SYS SYS PIX PHB SYS SYS 0-19,40-59 0
GPU1 NV1 X NV2 NV1 SYS NV2 SYS SYS PIX PHB SYS SYS 0-19,40-59 0
GPU2 NV1 NV2 X NV2 SYS SYS NV1 SYS PHB PIX SYS SYS 0-19,40-59 0
GPU3 NV2 NV1 NV2 X SYS SYS SYS NV1 PHB PIX SYS SYS 0-19,40-59 0
GPU4 NV2 SYS SYS SYS X NV1 NV1 NV2 SYS SYS PIX PHB 20-39,60-79 1
GPU5 SYS NV2 SYS SYS NV1 X NV2 NV1 SYS SYS PIX PHB 20-39,60-79 1
GPU6 SYS SYS NV1 SYS NV1 NV2 X NV2 SYS SYS PHB PIX 20-39,60-79 1
GPU7 SYS SYS SYS NV1 NV2 NV1 NV2 X SYS SYS PHB PIX 20-39,60-79 1
mlx5_0 PIX PIX PHB PHB SYS SYS SYS SYS X PHB SYS SYS
mlx5_1 PHB PHB PIX PIX SYS SYS SYS SYS PHB X SYS SYS
mlx5_2 SYS SYS SYS SYS PIX PIX PHB PHB SYS SYS X PHB
mlx5_3 SYS SYS SYS SYS PHB PHB PIX PIX SYS SYS PHB X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
UCX
(dask-benchmark) mroeschke@dgx11:~$ python -m dask_cuda.benchmarks.local_cudf_merge --devs 0,1,2,3,4,5,6,7 --chunk-size 100_000_000 --protocol ucx
Merge benchmark
--------------------------------------------------------------------------------
Backend | dask
Merge type | gpu
Rows-per-chunk | 100000000
Base-chunks | 8
Other-chunks | 8
Broadcast | default
Protocol | ucx
Device(s) | 0,1,2,3,4,5,6,7
RMM Pool | True
Frac-match | 0.3
TCP | None
InfiniBand | None
NVLink | None
Worker thread(s) | 1
Data processed | 23.84 GiB
Number of workers | 8
================================================================================
Wall clock | Throughput
--------------------------------------------------------------------------------
16.12 s | 1.48 GiB/s
15.64 s | 1.52 GiB/s
21.54 s | 1.11 GiB/s
================================================================================
Throughput | 1.34 GiB/s +/- 119.48 MiB/s
Bandwidth | 133.58 MiB/s +/- 12.66 MiB/s
Wall clock | 17.76 s +/- 2.67 s
================================================================================
TCP
(dask-benchmark) mroeschke@dgx11:~$ python -m dask_cuda.benchmarks.local_cudf_merge --devs 0,1,2,3,4,5,6,7 --chunk-size 100_000_000
Merge benchmark
--------------------------------------------------------------------------------
Backend | dask
Merge type | gpu
Rows-per-chunk | 100000000
Base-chunks | 8
Other-chunks | 8
Broadcast | default
Protocol | tcp
Device(s) | 0,1,2,3,4,5,6,7
RMM Pool | True
Frac-match | 0.3
Worker thread(s) | 1
Data processed | 23.84 GiB
Number of workers | 8
================================================================================
Wall clock | Throughput
--------------------------------------------------------------------------------
59.60 s | 409.63 MiB/s
53.44 s | 456.87 MiB/s
52.81 s | 462.28 MiB/s
================================================================================
Throughput | 441.62 MiB/s +/- 14.13 MiB/s
Bandwidth | 32.21 MiB/s +/- 467.30 kiB/s
Wall clock | 55.28 s +/- 3.06 s
================================================================================
@jacobtomlinson were you thinking of having this be a separate guide
in a similar format as the IB on Azure guide?
@pentschev This is not what I had expected given our conversation on #69 Could there be more to the DGX setup?
Additional DGX11 info
(dask-benchmark) mroeschke@dgx11:~$ ibv_devinfo
hca_id: mlx5_1
transport: InfiniBand (0)
fw_ver: 12.26.4012
node_guid: 506b:4b03:0028:d542
sys_image_guid: 506b:4b03:0028:d542
vendor_id: 0x02c9
vendor_part_id: 4115
hw_ver: 0x0
board_id: MT_2180110032
phys_port_cnt: 1
Device ports:
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 1
port_lid: 29
port_lmc: 0x00
link_layer: InfiniBand
hca_id: mlx5_3
transport: InfiniBand (0)
fw_ver: 12.26.4012
node_guid: 506b:4b03:0028:d6de
sys_image_guid: 506b:4b03:0028:d6de
vendor_id: 0x02c9
vendor_part_id: 4115
hw_ver: 0x0
board_id: MT_2180110032
phys_port_cnt: 1
Device ports:
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 1
port_lid: 43
port_lmc: 0x00
link_layer: InfiniBand
hca_id: mlx5_0
transport: InfiniBand (0)
fw_ver: 12.26.4012
node_guid: 506b:4b03:001b:6dbc
sys_image_guid: 506b:4b03:001b:6dbc
vendor_id: 0x02c9
vendor_part_id: 4115
hw_ver: 0x0
board_id: MT_2180110032
phys_port_cnt: 1
Device ports:
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 1
port_lid: 28
port_lmc: 0x00
link_layer: InfiniBand
hca_id: mlx5_2
transport: InfiniBand (0)
fw_ver: 12.26.4012
node_guid: 506b:4b03:0035:ebc2
sys_image_guid: 506b:4b03:0035:ebc2
vendor_id: 0x02c9
vendor_part_id: 4115
hw_ver: 0x0
board_id: MT_2180110032
phys_port_cnt: 1
Device ports:
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 1
port_lid: 42
port_lmc: 0x00
link_layer: InfiniBand
(dask-benchmark) mroeschke@dgx11:~$ ip addr show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: enp1s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether d8:c4:97:64:f7:24 brd ff:ff:ff:ff:ff:ff
inet 10.33.227.161/24 brd 10.33.227.255 scope global dynamic enp1s0f0
valid_lft 6979sec preferred_lft 6979sec
inet6 fe80::dac4:97ff:fe64:f724/64 scope link
valid_lft forever preferred_lft forever
3: enp1s0f1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
link/ether d8:c4:97:64:f7:25 brd ff:ff:ff:ff:ff:ff
4: ib0: <BROADCAST,MULTICAST> mtu 4092 qdisc noop state DOWN group default qlen 256
link/infiniband 20:00:11:17:fe:80:00:00:00:00:00:00:50:6b:4b:03:00:1b:6d:bc brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
5: ib1: <BROADCAST,MULTICAST> mtu 4092 qdisc noop state DOWN group default qlen 256
link/infiniband 20:00:11:17:fe:80:00:00:00:00:00:00:50:6b:4b:03:00:28:d5:42 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
6: ib2: <BROADCAST,MULTICAST> mtu 4092 qdisc noop state DOWN group default qlen 256
link/infiniband 20:00:11:17:fe:80:00:00:00:00:00:00:50:6b:4b:03:00:35:eb:c2 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
7: ib3: <BROADCAST,MULTICAST> mtu 4092 qdisc noop state DOWN group default qlen 256
link/infiniband 20:00:11:17:fe:80:00:00:00:00:00:00:50:6b:4b:03:00:28:d6:de brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
8: docker0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether 02:42:07:0a:9d:b7 brd ff:ff:ff:ff:ff:ff
inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
valid_lft forever preferred_lft forever
inet6 fe80::42:7ff:fe0a:9db7/64 scope link
valid_lft forever preferred_lft forever
172: vethcd38231@if171: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP group default
link/ether 22:93:9e:3a:56:da brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet6 fe80::2093:9eff:fe3a:56da/64 scope link
valid_lft forever preferred_lft forever
cc @pentschev in case you have thoughts on why this is slower.
@mroeschke it would be great to have a guide on configuring IB but I think the main goal here is to compare performance between on-prem machines and cloud machines. The output of this would most likely be a blog post. In terms of narrative I would imagine it would go something like:
My assumption is that TCP is the slowest and IB on-prem is the fastest, with IB on Azure being somewhere in the middle. The results don't seem to quite be showing that right now so we should dig into what is going on.
Sorry, I've been out due to some personal reasons and now on PTO and didn't see this before. I know dgx11 is on MOFED 4.x which isn't supported by us anymore, I would suggest testing on a machine with MOFED >= 5.5.1.0.3.2. I'm also linking you to an internal discussion on that machine as others were reporting issues with IB there as well.
Thanks Peter! Hope you enjoy your time off!
Cloud benchmark comparing performance with on-prem (infra for merge benchmark, tpc-ds)