Please help understanding the performance of AmgX with GPUDirect communication

mkre commented 3 years ago

Describe the bug

We are in the process of evaluating the performance of AmgX on our GPU cluster. AmgX has an optional setting to enable GPUDirect MPI communication. However, it seems like enabling this causes a performance decline instead of an increase, compared to the vanilla implementation using host staging. I added a simple timing instrumentation to these two AmgX functions (one of which is being called depending on the AmgX GPUDirect setting):

Here are the timings of the first 50 invocations of both functions:

vanilla

Host ISend 35604 B took 283 us
Host ISend 35604 B took 185 us
Host ISend 35604 B took 17 us
Host ISend 35604 B took 15 us
Host ISend 35604 B took 14 us
Host ISend 35604 B took 14 us
Host ISend 35604 B took 14 us
Host ISend 35604 B took 14 us
Host ISend 35604 B took 19 us
Host ISend 35604 B took 14 us
Host ISend 35604 B took 14 us
Host ISend 35604 B took 14 us
Host ISend 35604 B took 14 us
Host ISend 35604 B took 14 us
Host ISend 35604 B took 14 us
Host ISend 35604 B took 17 us
Host ISend 35604 B took 14 us
Host ISend 35604 B took 15 us
Host ISend 35604 B took 14 us
Host ISend 35604 B took 15 us
Host ISend 35604 B took 184 us
Host ISend 13436 B took 17 us
Host ISend 5144 B took 12 us
Host ISend 2016 B took 12 us
Host ISend 804 B took 12 us
Host ISend 360 B took 11 us
Host ISend 260 B took 11 us
Host ISend 196 B took 12 us
Host ISend 84 B took 10 us
Host ISend 35604 B took 18 us
Host ISend 35604 B took 169 us
Host ISend 35604 B took 109 us
Host ISend 13436 B took 72 us
Host ISend 5144 B took 28 us
Host ISend 2016 B took 11 us
Host ISend 804 B took 10 us
Host ISend 360 B took 10 us
Host ISend 260 B took 9 us
Host ISend 196 B took 9 us
Host ISend 84 B took 10 us
Host ISend 20 B took 9 us
Host ISend 84 B took 9 us
Host ISend 84 B took 10 us
Host ISend 84 B took 9 us
Host ISend 84 B took 10 us
Host ISend 196 B took 9 us
Host ISend 196 B took 9 us
Host ISend 196 B took 8 us
Host ISend 196 B took 9 us
Host ISend 260 B took 9 us

GPUDirect

GPUDirect ISend 35604 B took 18222 us
GPUDirect ISend 35604 B took 13308 us
GPUDirect ISend 35604 B took 12 us
GPUDirect ISend 35604 B took 9 us
GPUDirect ISend 35604 B took 8 us
GPUDirect ISend 35604 B took 8 us
GPUDirect ISend 35604 B took 8 us
GPUDirect ISend 35604 B took 8 us
GPUDirect ISend 35604 B took 12 us
GPUDirect ISend 35604 B took 9 us
GPUDirect ISend 35604 B took 8 us
GPUDirect ISend 35604 B took 8 us
GPUDirect ISend 35604 B took 8 us
GPUDirect ISend 35604 B took 8 us
GPUDirect ISend 35604 B took 9 us
GPUDirect ISend 35604 B took 11 us
GPUDirect ISend 35604 B took 10 us
GPUDirect ISend 35604 B took 8 us
GPUDirect ISend 35604 B took 8 us
GPUDirect ISend 35604 B took 9 us
GPUDirect ISend 35604 B took 13617 us
GPUDirect ISend 13436 B took 6563 us
GPUDirect ISend 5144 B took 7727 us
GPUDirect ISend 2016 B took 6630 us
GPUDirect ISend 804 B took 6635 us
GPUDirect ISend 360 B took 6660 us
GPUDirect ISend 260 B took 6568 us
GPUDirect ISend 196 B took 6870 us
GPUDirect ISend 80 B took 6772 us
GPUDirect ISend 35604 B took 13203 us
GPUDirect ISend 35604 B took 13060 us
GPUDirect ISend 35604 B took 13108 us
GPUDirect ISend 13436 B took 6560 us
GPUDirect ISend 5144 B took 6630 us
GPUDirect ISend 2016 B took 6625 us
GPUDirect ISend 804 B took 6580 us
GPUDirect ISend 360 B took 6536 us
GPUDirect ISend 260 B took 6529 us
GPUDirect ISend 196 B took 6539 us
GPUDirect ISend 80 B took 6608 us
GPUDirect ISend 16 B took 6535 us
GPUDirect ISend 80 B took 6576 us
GPUDirect ISend 80 B took 9 us
GPUDirect ISend 80 B took 6 us
GPUDirect ISend 80 B took 8 us
GPUDirect ISend 196 B took 6739 us
GPUDirect ISend 196 B took 7 us
GPUDirect ISend 196 B took 6 us
GPUDirect ISend 196 B took 7 us
GPUDirect ISend 260 B took 6578 us

It becomes obvious that some invocations of this function are significantly more expensive when using GPUDirect. Specifically, it seems like the first invocation for a given buffer size is very expensive. On the other hand, the fastest invocations for a given buffer size are faster when using GPUDirect compared to vanilla (as expected).

FWIW, I have checked the performance of our Open MPI + UCX stack using osu_bw and osu_latency and it is looking alright:

# OSU MPI-CUDA Bandwidth Test v5.7
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size      Bandwidth (MB/s)
1                       2.51
2                       5.08
4                      10.13
8                      20.14
16                     40.35
32                     73.64
64                    147.13
128                   289.01
256                   569.73
512                  1077.44
1024                 1960.84
2048                 3190.99
4096                 4942.88
8192                 6507.32
16384                6545.94
32768                6622.82
65536                6581.28
131072               6568.10
262144              12188.64
524288              12251.55
1048576             12270.51
2097152             12277.70
4194304             12289.72

# OSU MPI-CUDA Latency Test v5.7
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size          Latency (us)
0                       1.98
1                       2.54
2                       2.51
4                       2.51
8                       2.51
16                      2.52
32                      2.54
64                      2.73
128                     2.74
256                     2.80
512                     2.88
1024                    3.35
2048                    3.47
4096                    4.35
8192                    4.96
16384                   7.22
32768                   9.51
65536                  13.96
131072                 24.91
262144                 29.43
524288                 51.07
1048576                93.68
2097152               179.32
4194304               353.82

Is there any explanation or remedy for this behavior?

Steps to Reproduce

ucx_info -v
# UCT version=1.11.1 revision c58db6b
# configured with: --prefix=/u/ydfb4q/tpl/ucx/build/1.11.1/install --disable-optimizations --disable-logging --disable-debug --disable-assertions --disable-params-check --disable-static --with-verbs=/u/ydfb4q/tpl/ucx/mofed-4.6/usr --with-rdmacm=/u/ydfb4q/tpl/ucx/mofed-4.6/usr --with-knem=/u/ydfb4q/tpl/ucx/mofed-4.6/opt/knem-1.1.3.90mlnx1 --without-java --with-gdrcopy=/u/ydfb4q/.gradle/caches/cda/tpls/gdrcopy-2.1-linux-x86_64 --with-cuda=/u/ydfb4q/.gradle/caches/cda/tpls/cuda_toolkit-11.0.2-full-linux-x86_64

No UCX environment variables used

Setup and versions

2 similar nodes, each with the following setup:

> cat /etc/centos-release
CentOS Linux release 7.6.1810 (Core)
> ofed_info -s
MLNX_OFED_LINUX-4.6-1.0.1.1:
> nvidia-smi
Fri Sep 10 05:23:44 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01    Driver Version: 460.73.01    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:15:00.0 Off |                    0 |
| N/A   31C    P0    40W / 300W |     32MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  Off  | 00000000:16:00.0 Off |                    0 |
| N/A   32C    P0    40W / 300W |     32MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  Off  | 00000000:3A:00.0 Off |                    0 |
| N/A   32C    P0    41W / 300W |     32MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  Off  | 00000000:3B:00.0 Off |                    0 |
| N/A   32C    P0    40W / 300W |     32MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  Tesla V100-SXM2...  Off  | 00000000:89:00.0 Off |                    0 |
| N/A   30C    P0    41W / 300W |     32MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM2...  Off  | 00000000:8A:00.0 Off |                    0 |
| N/A   34C    P0    41W / 300W |     32MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM2...  Off  | 00000000:B2:00.0 Off |                    0 |
| N/A   33C    P0    39W / 300W |     32MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM2...  Off  | 00000000:B3:00.0 Off |                    0 |
| N/A   33C    P0    41W / 300W |     32MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
> nvidia-smi topo -m
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    mlx5_0  mlx5_1  mlx5_2  CPU Affinity    NUMA Affinity
GPU0     X      NV1     NV1     NV2     NV2     SYS     SYS     SYS     PIX     NODE    NODE    0-19,40-59      0
GPU1    NV1      X      NV2     NV1     SYS     NV2     SYS     SYS     PIX     NODE    NODE    0-19,40-59      0
GPU2    NV1     NV2      X      NV2     SYS     SYS     NV1     SYS     NODE    NODE    NODE    0-19,40-59      0
GPU3    NV2     NV1     NV2      X      SYS     SYS     SYS     NV1     NODE    NODE    NODE    0-19,40-59      0
GPU4    NV2     SYS     SYS     SYS      X      NV1     NV1     NV2     SYS     SYS     SYS     20-39,60-79     1
GPU5    SYS     NV2     SYS     SYS     NV1      X      NV2     NV1     SYS     SYS     SYS     20-39,60-79     1
GPU6    SYS     SYS     NV1     SYS     NV1     NV2      X      NV2     SYS     SYS     SYS     20-39,60-79     1
GPU7    SYS     SYS     SYS     NV1     NV2     NV1     NV2      X      SYS     SYS     SYS     20-39,60-79     1
mlx5_0  PIX     PIX     NODE    NODE    SYS     SYS     SYS     SYS      X      NODE    NODE
mlx5_1  NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE     X      PIX
mlx5_2  NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE    PIX      X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks
> lsmod|grep -P '(nv_peer|gdrdrv|nvidia)'
gdrdrv                 17982  0
nv_peer_mem            13163  0
nvidia_drm             48854  0
nvidia_modeset       1221742  1 nvidia_drm
nvidia_uvm            983887  2
nvidia              34081992  823 nv_peer_mem,gdrdrv,nvidia_modeset,nvidia_uvm
drm_kms_helper        179394  2 mgag200,nvidia_drm
drm                   429744  5 ttm,drm_kms_helper,mgag200,nvidia_drm
ib_core               300520  11 rdma_cm,ib_cm,iw_cm,nv_peer_mem,mlx4_ib,mlx5_ib,ib_ucm,ib_umad,ib_uverbs,rdma_ucm,ib_ipoib

Additional information (depending on the issue)

ucx_info -d: ucx_info_d.txt
Open MPI 4.0.3
Configure result - config.log: config.log

mkre commented 3 years ago

Anyone got an idea about this? Pinging @bureddy...

bureddy commented 3 years ago

@mkre the first transfer with GPUdirect is expected to high overhead with GPUDirectRDMA because it invloves cuda memory registration with IB HCA. is it possible to reuse the buffer from the application ?

mkre commented 3 years ago

@bureddy I guess that question would be one for the AmgX devs to answer. Do you know of anyone seeing performance benefits from using GPUdirect with AmgX? Should I raise my question on the AmgX issue tracker, or do you know anyone working on AmgX you could ping here (might be a long shot, but given that you are now working for the same company...)?

openucx / ucx