Cloud benchmark comparing performance with on-prem

mmccarty commented 2 years ago

Cloud benchmark comparing performance with on-prem (infra for merge benchmark, tpc-ds)

mroeschke commented 1 year ago

On DGX 11, this is so far slower than IB on Azure added in #69

(dask-benchmark) mroeschke@dgx11:~$ nvidia-smi topo -m
    GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    mlx5_0  mlx5_1  mlx5_2  mlx5_3  CPU Affinity    NUMA Affinity
GPU0     X  NV1 NV1 NV2 NV2 SYS SYS SYS PIX PHB SYS SYS 0-19,40-59  0
GPU1    NV1  X  NV2 NV1 SYS NV2 SYS SYS PIX PHB SYS SYS 0-19,40-59  0
GPU2    NV1 NV2  X  NV2 SYS SYS NV1 SYS PHB PIX SYS SYS 0-19,40-59  0
GPU3    NV2 NV1 NV2  X  SYS SYS SYS NV1 PHB PIX SYS SYS 0-19,40-59  0
GPU4    NV2 SYS SYS SYS  X  NV1 NV1 NV2 SYS SYS PIX PHB 20-39,60-79 1
GPU5    SYS NV2 SYS SYS NV1  X  NV2 NV1 SYS SYS PIX PHB 20-39,60-79 1
GPU6    SYS SYS NV1 SYS NV1 NV2  X  NV2 SYS SYS PHB PIX 20-39,60-79 1
GPU7    SYS SYS SYS NV1 NV2 NV1 NV2  X  SYS SYS PHB PIX 20-39,60-79 1
mlx5_0  PIX PIX PHB PHB SYS SYS SYS SYS  X  PHB SYS SYS
mlx5_1  PHB PHB PIX PIX SYS SYS SYS SYS PHB  X  SYS SYS
mlx5_2  SYS SYS SYS SYS PIX PIX PHB PHB SYS SYS  X  PHB
mlx5_3  SYS SYS SYS SYS PHB PHB PIX PIX SYS SYS PHB  X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

UCX

(dask-benchmark) mroeschke@dgx11:~$ python -m dask_cuda.benchmarks.local_cudf_merge --devs 0,1,2,3,4,5,6,7 --chunk-size 100_000_000 --protocol ucx

Merge benchmark
--------------------------------------------------------------------------------
Backend                   | dask
Merge type                | gpu
Rows-per-chunk            | 100000000
Base-chunks               | 8
Other-chunks              | 8
Broadcast                 | default
Protocol                  | ucx
Device(s)                 | 0,1,2,3,4,5,6,7
RMM Pool                  | True
Frac-match                | 0.3
TCP                       | None
InfiniBand                | None
NVLink                    | None
Worker thread(s)          | 1
Data processed            | 23.84 GiB
Number of workers         | 8
================================================================================
Wall clock                | Throughput
--------------------------------------------------------------------------------
16.12 s                   | 1.48 GiB/s
15.64 s                   | 1.52 GiB/s
21.54 s                   | 1.11 GiB/s
================================================================================
Throughput                | 1.34 GiB/s +/- 119.48 MiB/s
Bandwidth                 | 133.58 MiB/s +/- 12.66 MiB/s
Wall clock                | 17.76 s +/- 2.67 s
================================================================================

TCP

(dask-benchmark) mroeschke@dgx11:~$ python -m dask_cuda.benchmarks.local_cudf_merge --devs 0,1,2,3,4,5,6,7 --chunk-size 100_000_000

Merge benchmark
--------------------------------------------------------------------------------
Backend                   | dask
Merge type                | gpu
Rows-per-chunk            | 100000000
Base-chunks               | 8
Other-chunks              | 8
Broadcast                 | default
Protocol                  | tcp
Device(s)                 | 0,1,2,3,4,5,6,7
RMM Pool                  | True
Frac-match                | 0.3
Worker thread(s)          | 1
Data processed            | 23.84 GiB
Number of workers         | 8
================================================================================
Wall clock                | Throughput
--------------------------------------------------------------------------------
59.60 s                   | 409.63 MiB/s
53.44 s                   | 456.87 MiB/s
52.81 s                   | 462.28 MiB/s
================================================================================
Throughput                | 441.62 MiB/s +/- 14.13 MiB/s
Bandwidth                 | 32.21 MiB/s +/- 467.30 kiB/s
Wall clock                | 55.28 s +/- 3.06 s
================================================================================

mroeschke commented 1 year ago

@jacobtomlinson were you thinking of having this be a separate guide in a similar format as the IB on Azure guide?

mmccarty commented 1 year ago

@pentschev This is not what I had expected given our conversation on #69 Could there be more to the DGX setup?

mroeschke commented 1 year ago

Additional DGX11 info

(dask-benchmark) mroeschke@dgx11:~$ ibv_devinfo
hca_id: mlx5_1
    transport:          InfiniBand (0)
    fw_ver:             12.26.4012
    node_guid:          506b:4b03:0028:d542
    sys_image_guid:         506b:4b03:0028:d542
    vendor_id:          0x02c9
    vendor_part_id:         4115
    hw_ver:             0x0
    board_id:           MT_2180110032
    phys_port_cnt:          1
    Device ports:
        port:   1
            state:          PORT_ACTIVE (4)
            max_mtu:        4096 (5)
            active_mtu:     4096 (5)
            sm_lid:         1
            port_lid:       29
            port_lmc:       0x00
            link_layer:     InfiniBand

hca_id: mlx5_3
    transport:          InfiniBand (0)
    fw_ver:             12.26.4012
    node_guid:          506b:4b03:0028:d6de
    sys_image_guid:         506b:4b03:0028:d6de
    vendor_id:          0x02c9
    vendor_part_id:         4115
    hw_ver:             0x0
    board_id:           MT_2180110032
    phys_port_cnt:          1
    Device ports:
        port:   1
            state:          PORT_ACTIVE (4)
            max_mtu:        4096 (5)
            active_mtu:     4096 (5)
            sm_lid:         1
            port_lid:       43
            port_lmc:       0x00
            link_layer:     InfiniBand

hca_id: mlx5_0
    transport:          InfiniBand (0)
    fw_ver:             12.26.4012
    node_guid:          506b:4b03:001b:6dbc
    sys_image_guid:         506b:4b03:001b:6dbc
    vendor_id:          0x02c9
    vendor_part_id:         4115
    hw_ver:             0x0
    board_id:           MT_2180110032
    phys_port_cnt:          1
    Device ports:
        port:   1
            state:          PORT_ACTIVE (4)
            max_mtu:        4096 (5)
            active_mtu:     4096 (5)
            sm_lid:         1
            port_lid:       28
            port_lmc:       0x00
            link_layer:     InfiniBand

hca_id: mlx5_2
    transport:          InfiniBand (0)
    fw_ver:             12.26.4012
    node_guid:          506b:4b03:0035:ebc2
    sys_image_guid:         506b:4b03:0035:ebc2
    vendor_id:          0x02c9
    vendor_part_id:         4115
    hw_ver:             0x0
    board_id:           MT_2180110032
    phys_port_cnt:          1
    Device ports:
        port:   1
            state:          PORT_ACTIVE (4)
            max_mtu:        4096 (5)
            active_mtu:     4096 (5)
            sm_lid:         1
            port_lid:       42
            port_lmc:       0x00
            link_layer:     InfiniBand

(dask-benchmark) mroeschke@dgx11:~$ ip addr show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: enp1s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether d8:c4:97:64:f7:24 brd ff:ff:ff:ff:ff:ff
    inet 10.33.227.161/24 brd 10.33.227.255 scope global dynamic enp1s0f0
       valid_lft 6979sec preferred_lft 6979sec
    inet6 fe80::dac4:97ff:fe64:f724/64 scope link
       valid_lft forever preferred_lft forever
3: enp1s0f1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
    link/ether d8:c4:97:64:f7:25 brd ff:ff:ff:ff:ff:ff
4: ib0: <BROADCAST,MULTICAST> mtu 4092 qdisc noop state DOWN group default qlen 256
    link/infiniband 20:00:11:17:fe:80:00:00:00:00:00:00:50:6b:4b:03:00:1b:6d:bc brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
5: ib1: <BROADCAST,MULTICAST> mtu 4092 qdisc noop state DOWN group default qlen 256
    link/infiniband 20:00:11:17:fe:80:00:00:00:00:00:00:50:6b:4b:03:00:28:d5:42 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
6: ib2: <BROADCAST,MULTICAST> mtu 4092 qdisc noop state DOWN group default qlen 256
    link/infiniband 20:00:11:17:fe:80:00:00:00:00:00:00:50:6b:4b:03:00:35:eb:c2 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
7: ib3: <BROADCAST,MULTICAST> mtu 4092 qdisc noop state DOWN group default qlen 256
    link/infiniband 20:00:11:17:fe:80:00:00:00:00:00:00:50:6b:4b:03:00:28:d6:de brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
8: docker0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
    link/ether 02:42:07:0a:9d:b7 brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
       valid_lft forever preferred_lft forever
    inet6 fe80::42:7ff:fe0a:9db7/64 scope link
       valid_lft forever preferred_lft forever
172: vethcd38231@if171: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP group default
    link/ether 22:93:9e:3a:56:da brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet6 fe80::2093:9eff:fe3a:56da/64 scope link
       valid_lft forever preferred_lft forever

jacobtomlinson commented 1 year ago

cc @pentschev in case you have thoughts on why this is slower.

@mroeschke it would be great to have a guide on configuring IB but I think the main goal here is to compare performance between on-prem machines and cloud machines. The output of this would most likely be a blog post. In terms of narrative I would imagine it would go something like:

When using distributed tools on the cloud like Dask you typically use TCP over virtual networking
When working on prem you can get access to more exotic hardware like IB to accelerate communication
Some cloud platforms like Azure can also provide this kind of networking, isn't that great for cloud users, let's see how performance compares
Conclusion: Isn't UCX with IB and NVLink awesome!

My assumption is that TCP is the slowest and IB on-prem is the fastest, with IB on Azure being somewhere in the middle. The results don't seem to quite be showing that right now so we should dig into what is going on.

pentschev commented 1 year ago

Sorry, I've been out due to some personal reasons and now on PTO and didn't see this before. I know dgx11 is on MOFED 4.x which isn't supported by us anymore, I would suggest testing on a machine with MOFED >= 5.5.1.0.3.2. I'm also linking you to an internal discussion on that machine as others were reporting issues with IB there as well.

mmccarty commented 1 year ago

Thanks Peter! Hope you enjoy your time off!

rapidsai / deployment

Cloud benchmark comparing performance with on-prem #17