SUMMIT IB Configuration

quasiben commented 4 years ago

On summit, the nodes have the following configuration:

> jsrun -n 1 -c 42 -g 6 -a 1 --smpiargs='off' nvidia-smi topo -m
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    mlx5_0  mlx5_1  mlx5_2mlx5_3   CPU Affinity
GPU0     X      NV2     NV2     SYS     SYS     SYS     NODE    NODE    SYS   SYS      0-83
GPU1    NV2      X      NV2     SYS     SYS     SYS     NODE    NODE    SYS   SYS      0-83
GPU2    NV2     NV2      X      SYS     SYS     SYS     NODE    NODE    SYS   SYS      0-83
GPU3    SYS     SYS     SYS      X      NV2     NV2     SYS     SYS     NODE  NODE     88-171
GPU4    SYS     SYS     SYS     NV2      X      NV2     SYS     SYS     NODE  NODE     88-171
GPU5    SYS     SYS     SYS     NV2     NV2      X      SYS     SYS     NODE  NODE     88-171
mlx5_0  NODE    NODE    NODE    SYS     SYS     SYS      X      PIX     SYS   SYS
mlx5_1  NODE    NODE    NODE    SYS     SYS     SYS     PIX      X      SYS   SYS
mlx5_2  SYS     SYS     SYS     NODE    NODE    NODE    SYS     SYS      X    PIX
mlx5_3  SYS     SYS     SYS     NODE    NODE    NODE    SYS     SYS     PIX    X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing a single PCIe switch
  NV#  = Connection traversing a bonded set of # NVLinks

Each node has 6 GPUs and 4 MLNX Devices. I'm not sure what the optimal pairing of GPU and MLNX device should be. Normally, I would rely on hwloc to figure this out however, on Summit i get errors like the following (when using --net-devices='auto' with dask-cuda):

ImportError: /gpfs/alpine/world-shared/bif128/rapids-env/lib/python3.7/site-packages/ucp/_libs/topological_distance.cpython-37m-powerpc64le-linux-gnu.so: undefined symbol: hwloc_topology_set_io_types_filter

Still, I can set up the worker manually with something like: GPU 0

CUDA_VISIBLE_DEVICES=0 dask-cuda-worker --scheduler-file $SCHEDULER_FILE --nthreads 1 --memory-limit 85GB --device-memory-limit 15GB --rmm-pool-size 15GB --death-timeout 60 --enable-infiniband --enable-nvlink --net-devices="mlx5_0:1" --interface ib0

GPU 1

CUDA_VISIBLE_DEVICES=1 dask-cuda-worker --scheduler-file $SCHEDULER_FILE --nthreads 1 --memory-limit 85GB --device-memory-limit 15GB --rmm-pool-size 15GB --death-timeout 60 --enable-infiniband --enable-nvlink --net-devices="mlx5_0:1" --interface ib0

And so on. What should --net-devices and --interface be set to for each of the six GPUs ?

cc @MattBBaker in case he has thoughts

jglaser commented 4 years ago

for building ucx-py, I have to use module load hwloc. Perhaps that's also necessary for running it in your case. EDIT: I see you are using the environment I built, so yes, definitely try that.

quasiben commented 4 years ago

module load hwloc doesn't seem to work. I think there may be build issues with the UCX binary. For now, @pentschev and I rebuilt the UCX binary and UCX-Py based on the instructions here: https://ucx-py.readthedocs.io/en/latest/install.html#ucx-ofed

We are now going through a set of benchmark tests to confirm proper IB setup. For example, we can test at a high level with a dask-cuda benchmark of merging dataframes . We can also test point-to-point comms with ucx-py.

With the merge benchmark we are not seeing good performance. We are expecting ~11 GB/s for comms over IB

(rapids-env) (rapids-env) CUPY_CACHE_DIR=$MEMBERWORK/gen119/cupy-cache python local_cudf_merge.py --scheduler-addr ucx://10.134.13.13:8786 --no-rmm-pool

Merge benchmark
-------------------------------
data-processed | 192.00 MB
===============================
Wall-clock     | Throughput
-------------------------------
1.31 s         | 146.35 MB/s
1.16 s         | 165.25 MB/s
670.40 ms      | 286.40 MB/s

For point-to-point IB benchmark comms on Summit we observe ~11 GB/s (though there is a suspicion something is incorrect about this test)..

CUPY_CACHE_DIR=$MEMBERWORK/gen119/cupy-cache UCX_NET_DEVICES=mlx5_3:1 UCX_TLS=tcp,sockcm,c uda_copy,rc UCX_SOCKADDR_TLS_PRIORITY=sockcm python local-send-recv.py --server-dev 0 --client-dev 5 --object_type rmm --reuse-alloc --n-bytes 1GB --client-only --server-address 10.41.13.13 --port 59696 --n-iter 100

Roundtrip benchmark
--------------------------
n_iter      | 100
n_bytes     | 1000.00 MB
object      | rmm
reuse alloc | True
==========================
Device(s)   | 0, 5
Average     | 10.99 GB/s

quasiben commented 4 years ago

Should also mention that @MattBBaker also joined us for the IB testing session. Thanks @MattBBaker!

pentschev commented 4 years ago

I've also been running UCX-Py benchmarks on a DGX-1 and I see around 10GB/s for 100MB transfers:

# DGX-1
$ UCX_NET_DEVICES=mlx5_0:1 UCX_TLS=tcp,sockcm,cuda_copy,rc_x UCX_SOCKADDR_TLS_PRIORITY=sockcm python local-send-recv.py --server-dev 0 --client-dev 1 --object_type rmm --reuse-alloc --n-bytes 100MB --port 12345
Roundtrip benchmark
--------------------------
n_iter      | 10
n_bytes     | 100.00 MB
object      | rmm
reuse alloc | True
==========================
Device(s)   | 0, 1
Average     | 1.43 GB/s
--------------------------
Iterations
--------------------------
000         |164.12 MB/s
001         | 10.09 GB/s
002         | 10.06 GB/s
003         |  9.96 GB/s
004         | 10.07 GB/s
005         | 10.05 GB/s
006         | 10.00 GB/s
007         |  9.91 GB/s
008         |  9.95 GB/s
009         |  9.96 GB/s

However, I see only 300MB/s using mlx5_N:1 devices and 2-3GB/s for ib0 on ORNL's Summit:

$ UCX_NET_DEVICES=mlx5_0:1 UCX_TLS=tcp,sockcm,cuda_copy,rc_x UCX_SOCKADDR_TLS_PRIORITY=sockcm python local-send-recv.py --server-dev 0 --client-dev 1 --object_type rmm --reuse-alloc --n-bytes 100MB --port 12345
Roundtrip benchmark
--------------------------
n_iter      | 10
n_bytes     | 100.00 MB
object      | rmm
reuse alloc | True
==========================
Device(s)   | 0, 1
Average     | 314.09 MB/s
--------------------------
Iterations
--------------------------
000         |315.63 MB/s
001         |313.87 MB/s
002         |314.63 MB/s
003         |313.69 MB/s
004         |311.87 MB/s
005         |314.88 MB/s
006         |313.43 MB/s
007         |313.83 MB/s
008         |315.14 MB/s
009         |314.02 MB/s

$ UCX_NET_DEVICES=ib0 UCX_TLS=tcp,sockcm,cuda_copy,rc_x UCX_SOCKADDR_TLS_PRIORITY=sockcm python local-send-recv.py --server-dev 0 --client-dev 1 --object_type rmm --reuse-alloc --n-bytes 100MB --port 12345
Roundtrip benchmark
--------------------------
n_iter      | 10
n_bytes     | 100.00 MB
object      | rmm
reuse alloc | True
==========================
Device(s)   | 0, 1
Average     | 3.40 GB/s
--------------------------
Iterations
--------------------------
000         |  3.28 GB/s
001         |  3.42 GB/s
002         |  3.42 GB/s
003         |  3.40 GB/s
004         |  3.41 GB/s
005         |  3.41 GB/s
006         |  3.41 GB/s
007         |  3.41 GB/s
008         |  3.42 GB/s
009         |  3.42 GB/s

MattBBaker commented 4 years ago

Pinging @shamisp

shamisp commented 4 years ago

Can you try regular rc and not rc_x ?

quasiben commented 4 years ago

Can you try regular rc and not rc_x ?

Originally we were trying with rc and not rc_x and saw similar results

yosefe commented 4 years ago

@pentschev what is the difference in https://github.com/rapidsai/ucx-py/issues/616#issuecomment-700770501 between the 10 GBs and 300 MBs command lines?

pentschev commented 4 years ago

@pentschev what is the difference in #616 (comment) between the 10 GBs and 300 MBs command lines?

Sorry @yosefe , I forgot to write that's on Summit, I updated the comment now. The first report is on a DGX-1, the remaining two are both on Summit.

pentschev commented 4 years ago

What's really triggering me now are the errors below:

[1601415622.654729] [h13n02:130948:0]          ib_md.c:1149 UCX  DEBUG mlx5_1: failed to create registration cache: Unsupported operation
[1601415622.661220] [h13n02:130948:0]          ib_md.c:1149 UCX  DEBUG mlx5_3: failed to create registration cache: Unsupported operation
[1601415622.666450] [h13n02:130948:0]          ib_md.c:1149 UCX  DEBUG mlx5_0: failed to create registration cache: Unsupported operation
[1601415622.670426] [h13n02:130948:0]          ib_md.c:1149 UCX  DEBUG mlx5_2: failed to create registration cache: Unsupported operation
[1601415622.673312] [h13n02:130948:0]        knem_md.c:374  UCX  DEBUG Could not create registration cache: Unsupported operation
[1601415625.195927] [h13n02:130962:0]          ib_md.c:1149 UCX  DEBUG mlx5_1: failed to create registration cache: Unsupported operation
[1601415625.201908] [h13n02:130962:0]          ib_md.c:1149 UCX  DEBUG mlx5_3: failed to create registration cache: Unsupported operation
[1601415625.207023] [h13n02:130962:0]          ib_md.c:1149 UCX  DEBUG mlx5_0: failed to create registration cache: Unsupported operation
[1601415625.211004] [h13n02:130962:0]          ib_md.c:1149 UCX  DEBUG mlx5_2: failed to create registration cache: Unsupported operation
[1601415625.213830] [h13n02:130962:0]        knem_md.c:374  UCX  DEBUG Could not create registration cache: Unsupported operation
[1601415627.302291] [h13n02:130962:0]         ucp_mm.c:137  UCX  DEBUG failed to register address 0x200083800000 mem_type bit 0x2 length 100000000 on md[2]=mlx5_0: Unsupported operation (md reg_mem_types 0x1)
[1601415627.804828] [h13n02:130962:0]         ucp_mm.c:137  UCX  DEBUG failed to register address 0x200083800000 mem_type bit 0x2 length 100000000 on md[2]=mlx5_0: Unsupported operation (md reg_mem_types 0x1)
[1601415627.548121] [h13n02:130948:0]         ucp_mm.c:137  UCX  DEBUG failed to register address 0x2000a3800000 mem_type bit 0x2 length 100000000 on md[2]=mlx5_0: Unsupported operation (md reg_mem_types 0x1)
[1601415628.049413] [h13n02:130948:0]         ucp_mm.c:137  UCX  DEBUG failed to register address 0x2000a3800000 mem_type bit 0x2 length 100000000 on md[2]=mlx5_0: Unsupported operation (md reg_mem_types 0x1)
[1601415628.548910] [h13n02:130948:0]         ucp_mm.c:137  UCX  DEBUG failed to register address 0x2000a3800000 mem_type bit 0x2 length 100000000 on md[2]=mlx5_0: Unsupported operation (md reg_mem_types 0x1)
[1601415629.048608] [h13n02:130948:0]         ucp_mm.c:137  UCX  DEBUG failed to register address 0x2000a3800000 mem_type bit 0x2 length 100000000 on md[2]=mlx5_0: Unsupported operation (md reg_mem_types 0x1)
[1601415629.549037] [h13n02:130948:0]         ucp_mm.c:137  UCX  DEBUG failed to register address 0x2000a3800000 mem_type bit 0x2 length 100000000 on md[2]=mlx5_0: Unsupported operation (md reg_mem_types 0x1)

Those errors don't happen on a DGX-1, and we see 10GB/s bandwidth there. They also don't happen on Summit for host memory, and I can also measure 10GB/s in that case. It only happens for CUDA memory and then we see very low bandwidth, and if you look closely at the timestamp, they're happening 500ms apart from each other, which probably means there's some sort of timeout happening and that's what makes it so slow.

Any ideas of what could be the reason for those errors?

benjha commented 4 years ago

Hi all,

The next plot is what I got by running the local-send-recv.py benchmark in a Summit node. It uses ucx-py 0.16. Note that performance closes to 50GB/s as message size increases, which I would say is an expected behavior.

UCX-py and Theo Bandwidth

These results follows the same trend reported in (slide 46)

Pritchard et al. "Getting It Right with Open MPI: Best Practices for Deployment andTuning of Open MPI" Exascale Computing Project Annual Meeting 2020, 2020-02-03 (Houston,Texas, United States)

ecp

Feel free to review the compilation flags I used on UCX in

/gpfs/alpine/world-shared/stf011/benjha/ucx/ucx/ucx.sh

and the LSF script with UCX env. variables I used to execute the local-send-recv.py benchmark in

/gpfs/alpine/world-shared/stf011/benjha/ucx/lsf/launch_test.lsf
/gpfs/alpine/world-shared/stf011/benjha/ucx/lsf/run_test.sh

jglaser commented 4 years ago

The next plot is what I got by running the local-send-recv.py benchmark in a Summit node. It uses ucx-py 0.16. Note that performance closes to 50GB/s as message size increases, which I would say is an expected behavior.

I didn't observe any intra-node issues with ucx-py either, @benjha . However, we're mainly worried about internode performance over IB.

quasiben commented 4 years ago

For those without access to @benjha's files, the test was executed with the following config:

export UCX_RNDV_SCHEME='get_zcopy'
export UCX_NET_DEVICES='mlx5_0:1,mlx5_3:1'
export UCX_MAX_RNDV_RAILS=2
export UCX_TLS='rc_x,sm,cuda_copy,cuda_ipc,gdr_copy'
export UCX_RNDV_THRESH=1
#export UCX_TCP_TX_SEG_SIZE='10M'

UCX was built with:

./configure CC=gcc CXX=g++ \
    CXXFLAGS="-m64 -mcpu=powerpc64 -mtune=powerpc64 -O3" \
    CFLAGS="-m64 -mcpu=powerpc64 -mtune=powerpc64 -O3" \
    --prefix=$INSTALL_DIR \
    --enable-compiler-opt=3 --enable-optimizations \
    --enable-mt --with-mcpu=powerpc64le \
    --enable-debug \
    --with-cuda=$CUDA_DIR --with-knem=$KNEM_DIR \
    --with-verbs='/usr' \
    --with-gdrcopy=$OLCF_GDRCOPY_ROOT \
    --with-rc --with-ud --with-dc --with-mlx5-dv \
    --with-cm --with-dm

And the test was between GPUs 0,1:

python local-send-recv.py --server-dev 0 --client-dev 1 --object_type rmm --reuse-alloc --n-bytes $MSG_SIZE --port 12345

As @jglaser notes, I think we either want internode testing or testing just IB between GPUs: UCX_TLS=rc,cuda_copy

pentschev commented 4 years ago

The important part in above's flags is that you have UCX_TLS=...,cuda_ipc, which means intranode transfers will go over NVLink. However, removing cuda_ipc exposes the issue with IB, and if we have IB issues intranode we'll also have issues internode. Debugging is infinitely easier within a single node, so I'm trying to resolve that before we scale multi-node.

pentschev commented 4 years ago

According to the message [1601415627.302291] [h13n02:130962:0] ucp_mm.c:137 UCX DEBUG failed to register address 0x200083800000 mem_type bit 0x2 length 100000000 on md[2]=mlx5_0: Unsupported operation (md reg_mem_types 0x1), we only support memory type 0x1 (reg_mem_types), which is UCS_MEMORY_TYPE_HOST, as per https://github.com/openucx/ucx/blob/6b295583dab5e1673de2edc21d343bca14cbaf93/src/ucs/memory/memory_type.h#L26-L33 . We try to register memory type UCS_MEMORY_TYPE_CUDA, and that fails. On our DGX-1 build, reg_mem_types is 0x17, meaning it also supports CUDA as bit 2 is set as well.

I think our Summit build is really missing some configuration that gets picked up automatically when building on DGX-1 and we probably need to specify something else at build time.

pentschev commented 4 years ago

@hoopoepg and I spent a couple hours looking into various different configurations on Summit and trying to replicate the environment that we seek works on a DGX-1 without success. We tried enabling/disabling knem and gdrcopy, as well as checking that required CUDA capabilities are correctly built into UCX's Summit builds, but both cases still give us the same registration errors as before with low performance. We also verified that the libraries we're using on Summit (RMM, CuPy) are linking dynamically to CUDA, as pointed by @hoopoepg that static linking may cause issues when UCX tries to capture cudaMallocs. Finally I tried building a CUDA 10.1 environment on DGX-1 (to match the CUDA version Summit uses) and I'm still able to get 10GB/s on the DGX with that build.

Right now we don't have a solution for the errors above and low performance. Also pinging @Akshay-Venkatesh and @spotluri in case they have ideas.

pentschev commented 4 years ago

cc @bureddy in case you have any ideas as well.

pentschev commented 4 years ago

I just tried running with UCX_IB_GPU_DIRECT_RDMA=no on the DGX-1 and I now see the same issues as on Summit:

$ UCX_IB_GPU_DIRECT_RDMA=no UCX_NET_DEVICES=mlx5_0:1 UCX_TLS=tcp,sockcm,cuda_copy,rc_x UCX_SOCKADDR_TLS_PRIORITY=sockcm python local-send-recv.py --server-dev 0 --cli
ent-dev 1 --object_type cupy --reuse-alloc --n-bytes 100MB --port 12345
Roundtrip benchmark
--------------------------
n_iter      | 10
n_bytes     | 100.00 MB
object      | cupy
reuse alloc | True
==========================
Device(s)   | 0, 1
Average     | 624.59 MB/s
--------------------------
Iterations
--------------------------
000         |636.15 MB/s
001         |638.04 MB/s
002         |600.98 MB/s
003         |631.51 MB/s
004         |595.57 MB/s
005         |628.88 MB/s
006         |631.80 MB/s
007         |624.86 MB/s
008         |629.48 MB/s
009         |631.78 MB/s

And looking at UCX debug info I see the same errors that I see on Summit:

[1601469008.461828] [dgx13:29555:0]          ib_md.c:719  UCX  DEBUG registered memory 0x7fac80dd0000..0x7fac80e55000 on mlx5_0 lkey 0xfbc6e rkey 0xfbc6e access 0xf flags 0xe4
[1601469010.202777] [dgx13:29646:0]          ib_md.c:719  UCX  DEBUG registered memory 0x7f923789b000..0x7f9237920000 on mlx5_0 lkey 0xffa1f rkey 0xffa1f access 0xf flags 0xe4
[1601469010.210819] [dgx13:29646:0]          ib_md.c:719  UCX  DEBUG registered memory 0x7f9226000000..0x7f9226600000 on mlx5_0 lkey 0x187993 rkey 0x187993 access 0xf flags 0xe4
[1601469010.209081] [dgx13:29555:0]          ib_md.c:719  UCX  DEBUG registered memory 0x7fac67800000..0x7fac67e00000 on mlx5_0 lkey 0x189096 rkey 0x189096 access 0xf flags 0xe4
[1601469010.218336] [dgx13:29555:0]          ib_md.c:719  UCX  DEBUG registered memory 0x7fa9d8400000..0x7fa9daa00000 on mlx5_0 lkey 0x9426c rkey 0x9426c access 0xf flags 0xe4
[1601469010.234924] [dgx13:29646:0]          ib_md.c:719  UCX  DEBUG registerxf/0x3f
[1601469010.223760] [dgx13:29555:0]          ib_md.c:719  UCX  DEBUG registered memory 0x7fa9d2a00000..0x7fa9d3e00000 on mlx5_0 lkey 0x7f7d3 rkey 0x7f7d3 access 0xf flags 0xe4
[1601469010.240738] [dgx13:29646:0]          ib_md.c:719  UCX  DEBUG registered memory 0x7f8f92a00000..0x7f8f93e00000 on mlx5_0 lkey 0x7dfbb rkey 0x7dfbb access 0xf flags 0xe4
[1601469011.701783] [dgx13:29646:0]         ucp_mm.c:140  UCX  DEBUG failed to register address 0x7f8f4e000000 mem_tyCE deactivate iface 0x559c85738e70 force=0 acount=1 aifaces=3
[1601469010.238758] [dgx13:29555:0]          ib_md.c:719  UCX  DEBUG registered memory 0x7fabf9200000..0x7fabf9c00000 on mlx5_0 lkey 0xcd096 rkey 0xcd096 access 0xf flags 0xe4
[1601469011.704771] [dgx13:29646:0]          ib_md.c:719  UCX  DEBUG registered memory 0x5589cb400000..0x5589cbe00000 on mlx5_0 lkey 0x586f3 rkey 0x586f3 access 0xf flags 0xe4
[1601469011.723128] [dgx13:29646:0]          ib_md.c:719  UCX  DEBUG registered memory 0x5589cc000000..0x5589cca00000 on mlx5_0 lkey 0xd295b rkey 0xd295b access 0xf flags 0xe4
[1601469011.876672] [dgx13:29555:0]          ib_md.c:719  UCX  DEBUG registered memory 0x559cb1200000..0x559cb1c00000 on mlx5_0 lkey 0x82a6d rkey 0x82a6d access 0xf flags 0xe4
[1601469012.029100] [dgx13:29646:0]         ucp_mm.c:140  UCX  DEBUG failed to register address 0x7f8f4e000000 mem_type bit 0x2 length 100000000 on md[2]=mlx5_0: Unsupported operation (md reg_mem_types 0x15)
[1601469012.346932] [dgx13:29646:0]         ucp_mm.c:140  UCX  DEBUG failed to register address 0x7f8f4e000000 mem_type bit 0x2 length 100000000 on md[2]=mlx5_0: Unsupported operation (md reg_mem_types 0x15)
[1601469012.176187] [dgx13:29555:0]         ucp_mm.c:140  UCX  DEBUG failed to register address 0x7fa988000000 mem_type bit 0x2 length 100000000 on md[2]=mlx5_0: Unsupported operation (md reg_mem_types 0x15)
[1601469012.504545] [dgx13:29555:0]         ucp_mm.c:140  UCX  DEBUG failed to register address 0x7fa988000000 mem_type bit 0x2 length 100000000 on md[2]=mlx5_0: Unsupported operation (md reg_mem_types 0x15)
[1601469012.676050] [dgx13:29646:0]         ucp_mm.c:140  UCX  DEBUG failed to register address 0x7f8f4e000000 mem_type bit 0x2 length 100000000 on md[2]=mlx5_0: Unsupported operation (md reg_mem_types 0x15)
[1601469012.835303] [dgx13:29555:0]         ucp_mm.c:140  UCX  DEBUG failed to register address 0x7fa988000000 mem_type bit 0x2 length 100000000 on md[2]=mlx5_0: Unsupported operation (md reg_mem_types 0x15)
[1601469012.985642] [dgx13:29646:0]         ucp_mm.c:140  UCX  DEBUG failed to register address 0x7f8f4e000000 mem_type bit 0x2 length 100000000 on md[2]=mlx5_0: Unsupported operation (md reg_mem_types 0x15)
[1601469013.168962] [dgx13:29555:0]         ucp_mm.c:140  UCX  DEBUG failed to register address 0x7fa988000000 mem_type bit 0x2 length 100000000 on md[2]=mlx5_0: Unsupported operation (md reg_mem_types 0x15)
[1601469013.324069] [dgx13:29646:0]         ucp_mm.c:140  UCX  DEBUG failed to register address 0x7f8f4e000000 mem_type bit 0x2 length 100000000 on md[2]=mlx5_0: Unsupported operation (md reg_mem_types 0x15)
[1601469013.499957] [dgx13:29555:0]         ucp_mm.c:140  UCX  DEBUG failed to register address 0x7fa988000000 mem_type bit 0x2 length 100000000 on md[2]=mlx5_0: Unsupported operation (md reg_mem_types 0x15)
[1601469013.825191] [dgx13:29555:0]         ucp_mm.c:140  UCX  DEBUG failed to register address 0x7fa988000000 mem_type bit 0x2 length 100000000 on md[2]=mlx5_0: Unsupported operation (md reg_mem_types 0x15)
[1601469013.649949] [dgx13:29646:0]         ucp_mm.c:140  UCX  DEBUG failed to register address 0x7f8f4e000000 mem_type bit 0x2 length 100000000 on md[2]=mlx5_0: Unsupported operation (md reg_mem_types 0x15)
[1601469013.974739] [dgx13:29646:0]         ucp_mm.c:140  UCX  DEBUG failed to register address 0x7f8f4e000000 mem_type bit 0x2 length 100000000 on md[2]=mlx5_0: Unsupported operation (md reg_mem_types 0x15)
[1601469014.302293] [dgx13:29646:0]         ucp_mm.c:140  UCX  DEBUG failed to register address 0x7f8f4e000000 mem_type bit 0x2 length 100000000 on md[2]=mlx5_0: Unsupported operation (md reg_mem_types 0x15)
[1601469014.150769] [dgx13:29555:0]         ucp_mm.c:140  UCX  DEBUG failed to register address 0x7fa988000000 mem_type bit 0x2 length 100000000 on md[2]=mlx5_0: Unsupported operation (md reg_mem_types 0x15)
[1601469014.629782] [dgx13:29646:0]         ucp_mm.c:140  UCX  DEBUG failed to register address 0x7f8f4e000000 mem_type bit 0x2 length 100 arm iface 0x559c8578e110 returned Device is busy
[1601469014.478059] [dgx13:29555:0]         ucp_mm.c:140  UCX  DEBUG failed to register address 0x7fa988000000 mem_type bit 0x2 length 100000000 on md[2]=mlx5_0: Unsupported operation (md reg_mem_types 0x15)
[1601469014.804649] [dgx13:29555:0]         ucp_mm.c:140  UCX  DEBUG failed to register address 0x7fa988000000 mem_type bit 0x2 length 100000000 on md[2]=mlx5_0: Unsupported operation (md reg_mem_types 0x15)

It looks like this is some configuration on Summit's compute nodes. I remember we had this kind of problem in our DGX-1s in the past and they were resolved at the system level by our devops team with some software updates and configurations.

@jglaser @benjha is this something you can check with Summit admins?

@Akshay-Venkatesh I remember you helped our devops team finding out about configurations, have you ever tested GPUDirect RDMA on Summit?

Akshay-Venkatesh commented 4 years ago

I've tested GPUDirect RDMA in the past and I just ran again to double check. Seems like performance is as expected:

$ date
Wed Sep 30 10:03:59 EDT 2020
$ ucx_info -v
# UCT version=1.10.0 revision 8e96fc6
# configured with: --disable-logging --disable-debug --disable-assertions --disable-params-check --prefix=$UCX_HOME --enable-mt --with-cuda=/sw/summit/cuda/11.0.2 --with-gdrcopy=/sw/summit/gdrcopy/2.0

$ ompi_info | grep Configure
 Configured architecture: powerpc64le-unknown-linux-gnu
  Configure command line: '--prefix=$OMPI_HOME' '--enable-oshmem' '--enable-orterun-prefix-by-default' '--with-cuda=/sw/summit/cuda/11.0.2' '--with-ucx=$UCX_HOME' '--with-ucx-libdir=$UCX_HOME/lib' '--enable-mca-no-build=btl-uct' '--with-pmix=internal'

$ mpirun -np 2 --npernode 1 --oversubscribe --host e03n16,h31n13 --mca btl ^openib,smcuda -x LD_LIBRARY_PATH -x UCX_MEMTYPE_CACHE=y -x UCX_TLS=rc_x,mm,cuda_copy,gdr_copy,cuda_ipc $PWD/get_local_rank_ompi_hca mpi/pt2pt/osu_latency D D
local rank 0: using hca mlx5_0:1,mlx5_3:1
local rank 0: using hca mlx5_0:1,mlx5_3:1
# OSU MPI-CUDA Latency Test v5.6.2
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size          Latency (us)
0                       2.21
1                       3.18
2                       3.18
4                       3.17
8                       3.16
16                      3.16
32                      3.20
64                      3.23
128                     3.34
256                     3.36
512                     3.43
1024                    3.63
2048                    4.77
4096                    4.87
8192                    6.48
16384                   8.48
32768                  11.15
65536                  13.50
131072                 16.61
262144                 24.01
524288                 36.59
1048576                60.69
2097152               110.05
4194304               267.14

$ mpirun -np 2 --npernode 1 --oversubscribe --host e03n16,h31n13 --mca btl ^openib,smcuda -x LD_LIBRARY_PATH -x UCX_MEMTYPE_CACHE=y -x UCX_TLS=rc_x,mm,cuda_copy,gdr_copy,cuda_ipc $PWD/get_local_rank_ompi_hca mpi/pt2pt/osu_bw D D
local rank 0: using hca mlx5_0:1,mlx5_3:1
local rank 0: using hca mlx5_0:1,mlx5_3:1
# OSU MPI-CUDA Bandwidth Test v5.6.2
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size      Bandwidth (MB/s)
1                       1.34
2                       2.60
4                       5.33
8                      10.66
16                     20.74
32                     42.43
64                     81.09
128                   161.11
256                   319.54
512                   598.04
1024                 1111.55
2048                 1941.04
4096                 3279.54
8192                 5083.34
16384                5221.31
32768               13783.94
65536               18016.02
131072              20156.46
262144              21512.21
524288              22104.22
1048576             22373.62
2097152             22469.21
4194304             18302.49

The OpenMPI build used to get these results doesn't use wakeup feature so that may change things but I'm not sure if UCXpy uses wakeup or not.

pentschev commented 4 years ago

UCX-Py uses the wakeup feature by default, but I tried disabling it and running in non-blocking mode to see if that would change anything and I still see the same errors.

The registration errors that we see come from the UCX layer, and not from UCX-Py though. It may still be the case that we're misconfiguring something, but I don't see any hints as to what's causing that, except for what I wrote in https://github.com/rapidsai/ucx-py/issues/616#issuecomment-701362602 . Does anyone know how could we identify what's causing that, or if anyone has suggestions to something that we should be doing differently in Summit that we don't need to do on a DGX-1, that's very welcome.

quasiben commented 4 years ago

To reiterate @pentschev, we have successfully tested UCX-Py and very large workloads on many systems. When we have seen errors in the past and generally these have pointed to system configuration issues but we don't know how to identify them easily. For example, in the past we found some machines without nv_peer_mem (this one was a bit obvious). Are the MLNX configuration issues we can easily check ? Would someone have time to review both systems with us ?

pentschev commented 4 years ago

We just discussed this offline with @yosefe and he pointed that we need to unset UCX_MEM_MMAP_HOOK_MODE. This is set by default in Summit or by some of its modules. Doing that resolves the UCX-Py issues:

Roundtrip benchmark
--------------------------
n_iter      | 10
n_bytes     | 100.00 MB
object      | rmm
reuse alloc | True
==========================
Device(s)   | 0, 0
Average     | 6.85 GB/s
--------------------------
Iterations
--------------------------
000         |  1.97 GB/s
001         |  9.45 GB/s
002         |  9.45 GB/s
003         |  9.44 GB/s
004         |  9.44 GB/s
005         |  9.46 GB/s
006         |  9.46 GB/s
007         |  9.47 GB/s
008         |  9.47 GB/s
009         |  9.46 GB/s

@jglaser @benjha can you try that as well and see how it performs?

benjha commented 4 years ago

Can anyone list what are the pieces needed so we can verify with HPC ops if they are set up ?

pentschev commented 4 years ago

Can anyone list what are the pieces needed so we can verify with HPC ops if they are set up ?

It seems that unset UCX_MEM_MMAP_HOOK_MODE was everything I needed, could you try your current scripts with that and see if they perform better? The setup seems to be correct, GPUDirect RDMA worked when I unset that variable.

benjha commented 4 years ago

Result of local-send-recv.py benchmark with the next flags

export UCX_RNDV_SCHEME='get_zcopy'
export UCX_NET_DEVICES='mlx5_0:1,mlx5_3:1'
export UCX_MAX_RNDV_RAILS=2
export UCX_TLS='rc_x,sm,cuda_copy'
export UCX_RNDV_THRESH=1

Server Running at 10.41.21.51:60474
Client connecting to server at 10.41.21.51:60474
Roundtrip benchmark
--------------------------
n_iter      | 10
n_bytes     | 100.00 MB
object      | rmm
reuse alloc | True
==========================
Device(s)   | 0, 1
Average     | 294.34 MB/s
--------------------------
Iterations
--------------------------
000         |317.24 MB/s
001         |286.21 MB/s
002         |281.35 MB/s
003         |287.58 MB/s
004         |282.21 MB/s
005         |291.50 MB/s
006         |296.12 MB/s
007         |303.41 MB/s
008         |298.58 MB/s
009         |302.97 MB/s

adding unset UCX_MEM_MMAP_HOOK_MODE to the above env. variables as @pentschev suggested, results in:

Server Running at 10.41.21.53:37109
Client connecting to server at 10.41.21.53:37109
Roundtrip benchmark
--------------------------
n_iter      | 10
n_bytes     | 100.00 MB
object      | rmm
reuse alloc | True
==========================
Device(s)   | 0, 1
Average     | 9.69 GB/s
--------------------------
Iterations
--------------------------
000         |  8.07 GB/s
001         | 10.93 GB/s
002         |  9.81 GB/s
003         |  9.81 GB/s
004         |  9.78 GB/s
005         |  9.81 GB/s
006         |  9.80 GB/s
007         |  9.81 GB/s
008         |  9.81 GB/s
009         |  9.78 GB/s

Btw, all my runs got this error:

Traceback (most recent call last):
  File "/gpfs/alpine/world-shared/stf011/benjha/ucx/ucx-py/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/gpfs/alpine/world-shared/stf011/benjha/ucx/ucx-py/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/gpfs/alpine/stf011/world-shared/benjha/ucx/ucx/ucx-py/benchmarks/local-send-recv.py", line 55, in server
    devices=[args.server_dev],
  File "/gpfs/alpine/world-shared/stf011/benjha/ucx/ucx-py/lib/python3.7/site-packages/rmm/rmm.py", line 77, in reinitialize
    log_file_name=log_file_name,
  File "rmm/_lib/memory_resource.pyx", line 305, in rmm._lib.memory_resource._initialize
  File "rmm/_lib/memory_resource.pyx", line 365, in rmm._lib.memory_resource._initialize
  File "rmm/_lib/memory_resource.pyx", line 64, in rmm._lib.memory_resource.PoolMemoryResource.__cinit__
MemoryError: std::bad_alloc: CUDA error at: ../include/rmm/mr/device/cuda_memory_resource.hpp:68: cudaErrorMemoryAllocation out of memory

pentschev commented 4 years ago

@benjha could you give us more information on how you're setting things up when you see the OOM errors? I think OOM is not directly related to the issue we're discussing here, so to avoid that we end up with an endless thread, I would suggest starting a new issue in this repo to discuss that.

jglaser commented 4 years ago

I can confirm @benjha 's errors. With -rmm-pool-size=8G on the 16GB V100s I get

Exception: MemoryError('std::bad_alloc: RMM failure at: ../include/rmm/mr/device/pool_memory_resource.hpp:167: Maximum pool size exceeded')

for TPCXBB queries that ran fine previously (but slowly).

Without that argument, I see

Exception: MemoryError('std::bad_alloc: CUDA error at: ../include/rmm/mr/device/cuda_memory_resource.hpp:68: cudaErrorMemoryAllocation out of memory')

On a positive note, without the UCX_MEM_MMAP_HOOK_MODE, UCX_RNDV_SCHEME=auto seems to be working (issue #615 )

Environment variables for the workers

UCX_TLS=rc_x,sm,cuda_copy,cuda_ipc,gdr_copy
UCX_MAX_RNDV_RAILS=2
UCX_NET_DEVICES=mlx5_0:1,mlx5_3:1
UCX_MEMTYPE_CACHE=y

command line for the workers

UCX_RNDV_SCHEME=auto jsrun -n 36 -a 1 -g 6 -c 42 -b rs -D UCX_MEM_MMAP_HOOK_MODE --smpiargs="-disable_gpu_hooks" dask-cuda-worker --scheduler-file my-scheduler-ucx.json --memory-limit 160GB --enable-infiniband --enable-nvlink --death-timeout 60 --interface ib0 --nthreads 1 --local-directory /mnt/bb/$USER

I haven't tested the 32GB GPUs yet.

pentschev commented 4 years ago

Can you try to use a pool size that's very close the total amount of GPU memory? Those are 16GB GPUs, so I'd recommend 15GB, or 14GB if 15GB still is too much. The cuda_ipc transport can't unregister memory which prevents such buffers from being release, that's why we need the pool to be used for all allocations of the application.

jglaser commented 4 years ago

Can you try to use a pool size that's very close the total amount of GPU memory? Those are 16GB GPUs, so I'd recommend 15GB, or 14GB if 15GB still is too much. The cuda_ipc transport can't unregister memory which prevents such buffers from being release, that's why we need the pool to be used for all allocations of the application.

No luck yet with either of these pool sizes. Will try on the 32 GB GPUs as soon as I get access.

benjha commented 4 years ago

UCX_RNDV_SCHEME=auto jsrun -n 36 -a 1 -g 6 -c 42 -b rs -D UCX_MEM_MMAP_HOOK_MODE --smpiargs="-disable_gpu_hooks" dask-cuda-worker --scheduler-file my-scheduler-ucx.json --memory-limit 160GB --enable-infiniband --enable-nvlink --death-timeout 60 --interface ib0 --nthreads 1 --local-directory /mnt/bb/$USER

When doing the RAPIDS performance evaluation, we found that in some cases fat workers (e.g. 1 worker with 6 GPUs, 1 worker per node) worked better than thin workers (1 worker per GPU, 6 workers per node), in particular SVD's CuPy performed better with fat workers and cuDF worked better with thin workers.

It might be something worth to explore with BSQL @jglaser

pentschev commented 4 years ago

When doing the RAPIDS performance evaluation, we found that in some cases fat workers (e.g. 1 worker with 6 GPUs, 1 worker per node)

How do you address other GPUs then? CuPy for example is going to always address GPU 0, which is fine if you have multiple workers each addressing a different GPU, so each worker is always working on GPU 0 relative to the CUDA_VISIBLE_DEVICES ordering, but if you have a single process addressing multiple GPUs, then CuPy won't be able to automatically do work on all GPUs.

benjha commented 4 years ago

jsrun allows the isolation of resources as you describe. On the other hand, I thought DASK distributed the load across GPUs of the same worker, is this the way it works with CuPy ? Anyway, for some reason I ended up using 1 GPU per worker...

pentschev commented 4 years ago

jsrun allows the isolation of resources as you describe.

That's correct, but when you isolate resources via jsrun, you'll be effectively creating a worker per resource, in that case a resource being a GPU.

On the other hand, I thought DASK distributed the load across GPUs of the same worker, is this the way it work with CuPy ?

Mainline Dask will do no addressing of GPUs at all, so libraries such as CuPy and cuDF will run by default on GPU 0, meaning all other GPUs are idle. On the other hand, Dask-CUDA was specifically written to support a one-process(worker)-per-GPU model, in which we set CUDA_VISIBLE_DEVICES for each worker in a round-robin fashion, that means that every worker will see a different GPU when it addresses GPU 0. You can, of course, address GPUs other than 0 with CuPy, etc., but that's not handled by Dask today in any scenario, and there's no plan to do that in the future that I know of.

Anyway, for some reason I ended up using 1 GPU per worker...

As I mentioned above, this is the only supported case by Dask-CUDA today, so it feels that you'd naturally end up using that. However, if you are certain you used a single Dask worker with multiple GPUs, that's something I'd be interested in knowing how it was done, it's not technically impossible but likely very challenging.

jglaser commented 4 years ago

Here's a datapoint with 4MB message size and UCX master (ucx_perftest)... It does look like the bandwidth went up to 10GB (CUDA) and 13GB/s (unified memory), without having to modify the rendezvous scheme.

cuda

(rapids-env) bash-4.2$ UCX_TLS=rc_x,sm,cuda_copy,gdr_copy,cuda_ipc jsrun -D UCX_MEM_MMAP_HOOK_MODE -n 2 -a 1 -g 6 -c 42 -b packed:smt:1 --smpiargs="-disable_gpu_hooks" ucx_perftest -m cuda -t tag_bw -s "4194304" -n 10 -T 1
Warning: PAMI CUDA HOOK disabled
Warning: PAMI CUDA HOOK disabled
+--------------+--------------+-----------------------------+---------------------+-----------------------+
|              |              |      overhead (usec)        |   bandwidth (MB/s)  |  message rate (msg/s) |
+--------------+--------------+---------+---------+---------+----------+----------+-----------+-----------+
|    Stage     | # iterations | typical | average | overall |  average |  overall |  average  |  overall  |
+--------------+--------------+---------+---------+---------+----------+----------+-----------+-----------+
Final:                    10     0.000   378.990   378.990    10554.36   10554.36        2639        2639

cuda-managed

(rapids-env) bash-4.2$ UCX_TLS=rc_x,sm,cuda_copy,gdr_copy,cuda_ipc jsrun -D UCX_MEM_MMAP_HOOK_MODE -n 2 -a 1 -g 6 -c 42 -b packed:smt:1 --smpiargs="-disable_gpu_hooks" ucx_perftest -m cuda-managed -t tag_bw -s "4194304" -n 10 -T 1
Warning: PAMI CUDA HOOK disabled
Warning: PAMI CUDA HOOK disabled
+--------------+--------------+-----------------------------+---------------------+-----------------------+
|              |              |      overhead (usec)        |   bandwidth (MB/s)  |  message rate (msg/s) |
+--------------+--------------+---------+---------+---------+----------+----------+-----------+-----------+
|    Stage     | # iterations | typical | average | overall |  average |  overall |  average  |  overall  |
+--------------+--------------+---------+---------+---------+----------+----------+-----------+-----------+
Final:                    10     0.000   303.507   303.507    13179.27   13179.27        3295        3295

I have yet to run the benchmark again.. hopefully I won't see the OOM errors on the 32GB GPUs.

pentschev commented 4 years ago

I'm happy to see we're doing better.

I remember it was very challenging for folks to get memory utilization correctly for TPCx-BB, and indeed adding UCX to the workflow changes the requirements a bit, but we shouldn't double the memory utilization or something fo that sort. Keep in mind that we can't use managed memory with CUDA IPC, so we lose that ability and increase the perceived memory utilization. It's also important to use --device-memory-limit in various TPCx-BB queries to enable dask-cuda spilling to system memory, and I remember reading comments from @beckernick that optimal value was around 50% of the GPU memory for that parameter.

jglaser commented 4 years ago

On a side note, what exactly is the limitation of managed memory w/regard to IPC/NVLINK?

pentschev commented 4 years ago

It's a CUDA IPC limitation in itself, see https://github.com/rapidsai/ucx-py/issues/409 for some discussion.

quasiben commented 3 years ago

I think we can close this now. @jglaser are you ok with that ?

rapidsai / ucx-py

SUMMIT IB Configuration #616