Closed quasiben closed 3 years ago
for building ucx-py, I have to use module load hwloc
. Perhaps that's also necessary for running it in your case. EDIT: I see you are using the environment I built, so yes, definitely try that.
module load hwloc
doesn't seem to work. I think there may be build issues with the UCX binary. For now, @pentschev and I rebuilt the UCX binary and UCX-Py based on the instructions here:
https://ucx-py.readthedocs.io/en/latest/install.html#ucx-ofed
We are now going through a set of benchmark tests to confirm proper IB setup. For example, we can test at a high level with a dask-cuda benchmark of merging dataframes . We can also test point-to-point comms with ucx-py.
With the merge benchmark we are not seeing good performance. We are expecting ~11 GB/s for comms over IB
(rapids-env) (rapids-env) CUPY_CACHE_DIR=$MEMBERWORK/gen119/cupy-cache python local_cudf_merge.py --scheduler-addr ucx://10.134.13.13:8786 --no-rmm-pool
Merge benchmark
-------------------------------
data-processed | 192.00 MB
===============================
Wall-clock | Throughput
-------------------------------
1.31 s | 146.35 MB/s
1.16 s | 165.25 MB/s
670.40 ms | 286.40 MB/s
For point-to-point IB benchmark comms on Summit we observe ~11 GB/s (though there is a suspicion something is incorrect about this test)..
CUPY_CACHE_DIR=$MEMBERWORK/gen119/cupy-cache UCX_NET_DEVICES=mlx5_3:1 UCX_TLS=tcp,sockcm,c uda_copy,rc UCX_SOCKADDR_TLS_PRIORITY=sockcm python local-send-recv.py --server-dev 0 --client-dev 5 --object_type rmm --reuse-alloc --n-bytes 1GB --client-only --server-address 10.41.13.13 --port 59696 --n-iter 100
Roundtrip benchmark
--------------------------
n_iter | 100
n_bytes | 1000.00 MB
object | rmm
reuse alloc | True
==========================
Device(s) | 0, 5
Average | 10.99 GB/s
Should also mention that @MattBBaker also joined us for the IB testing session. Thanks @MattBBaker!
I've also been running UCX-Py benchmarks on a DGX-1 and I see around 10GB/s for 100MB transfers:
# DGX-1
$ UCX_NET_DEVICES=mlx5_0:1 UCX_TLS=tcp,sockcm,cuda_copy,rc_x UCX_SOCKADDR_TLS_PRIORITY=sockcm python local-send-recv.py --server-dev 0 --client-dev 1 --object_type rmm --reuse-alloc --n-bytes 100MB --port 12345
Roundtrip benchmark
--------------------------
n_iter | 10
n_bytes | 100.00 MB
object | rmm
reuse alloc | True
==========================
Device(s) | 0, 1
Average | 1.43 GB/s
--------------------------
Iterations
--------------------------
000 |164.12 MB/s
001 | 10.09 GB/s
002 | 10.06 GB/s
003 | 9.96 GB/s
004 | 10.07 GB/s
005 | 10.05 GB/s
006 | 10.00 GB/s
007 | 9.91 GB/s
008 | 9.95 GB/s
009 | 9.96 GB/s
However, I see only 300MB/s using mlx5_N:1
devices and 2-3GB/s for ib0
on ORNL's Summit:
$ UCX_NET_DEVICES=mlx5_0:1 UCX_TLS=tcp,sockcm,cuda_copy,rc_x UCX_SOCKADDR_TLS_PRIORITY=sockcm python local-send-recv.py --server-dev 0 --client-dev 1 --object_type rmm --reuse-alloc --n-bytes 100MB --port 12345
Roundtrip benchmark
--------------------------
n_iter | 10
n_bytes | 100.00 MB
object | rmm
reuse alloc | True
==========================
Device(s) | 0, 1
Average | 314.09 MB/s
--------------------------
Iterations
--------------------------
000 |315.63 MB/s
001 |313.87 MB/s
002 |314.63 MB/s
003 |313.69 MB/s
004 |311.87 MB/s
005 |314.88 MB/s
006 |313.43 MB/s
007 |313.83 MB/s
008 |315.14 MB/s
009 |314.02 MB/s
$ UCX_NET_DEVICES=ib0 UCX_TLS=tcp,sockcm,cuda_copy,rc_x UCX_SOCKADDR_TLS_PRIORITY=sockcm python local-send-recv.py --server-dev 0 --client-dev 1 --object_type rmm --reuse-alloc --n-bytes 100MB --port 12345
Roundtrip benchmark
--------------------------
n_iter | 10
n_bytes | 100.00 MB
object | rmm
reuse alloc | True
==========================
Device(s) | 0, 1
Average | 3.40 GB/s
--------------------------
Iterations
--------------------------
000 | 3.28 GB/s
001 | 3.42 GB/s
002 | 3.42 GB/s
003 | 3.40 GB/s
004 | 3.41 GB/s
005 | 3.41 GB/s
006 | 3.41 GB/s
007 | 3.41 GB/s
008 | 3.42 GB/s
009 | 3.42 GB/s
Pinging @shamisp
Can you try regular rc and not rc_x ?
Can you try regular rc and not rc_x ?
Originally we were trying with rc
and not rc_x
and saw similar results
@pentschev what is the difference in https://github.com/rapidsai/ucx-py/issues/616#issuecomment-700770501 between the 10 GBs and 300 MBs command lines?
@pentschev what is the difference in #616 (comment) between the 10 GBs and 300 MBs command lines?
Sorry @yosefe , I forgot to write that's on Summit, I updated the comment now. The first report is on a DGX-1, the remaining two are both on Summit.
What's really triggering me now are the errors below:
[1601415622.654729] [h13n02:130948:0] ib_md.c:1149 UCX DEBUG mlx5_1: failed to create registration cache: Unsupported operation
[1601415622.661220] [h13n02:130948:0] ib_md.c:1149 UCX DEBUG mlx5_3: failed to create registration cache: Unsupported operation
[1601415622.666450] [h13n02:130948:0] ib_md.c:1149 UCX DEBUG mlx5_0: failed to create registration cache: Unsupported operation
[1601415622.670426] [h13n02:130948:0] ib_md.c:1149 UCX DEBUG mlx5_2: failed to create registration cache: Unsupported operation
[1601415622.673312] [h13n02:130948:0] knem_md.c:374 UCX DEBUG Could not create registration cache: Unsupported operation
[1601415625.195927] [h13n02:130962:0] ib_md.c:1149 UCX DEBUG mlx5_1: failed to create registration cache: Unsupported operation
[1601415625.201908] [h13n02:130962:0] ib_md.c:1149 UCX DEBUG mlx5_3: failed to create registration cache: Unsupported operation
[1601415625.207023] [h13n02:130962:0] ib_md.c:1149 UCX DEBUG mlx5_0: failed to create registration cache: Unsupported operation
[1601415625.211004] [h13n02:130962:0] ib_md.c:1149 UCX DEBUG mlx5_2: failed to create registration cache: Unsupported operation
[1601415625.213830] [h13n02:130962:0] knem_md.c:374 UCX DEBUG Could not create registration cache: Unsupported operation
[1601415627.302291] [h13n02:130962:0] ucp_mm.c:137 UCX DEBUG failed to register address 0x200083800000 mem_type bit 0x2 length 100000000 on md[2]=mlx5_0: Unsupported operation (md reg_mem_types 0x1)
[1601415627.804828] [h13n02:130962:0] ucp_mm.c:137 UCX DEBUG failed to register address 0x200083800000 mem_type bit 0x2 length 100000000 on md[2]=mlx5_0: Unsupported operation (md reg_mem_types 0x1)
[1601415627.548121] [h13n02:130948:0] ucp_mm.c:137 UCX DEBUG failed to register address 0x2000a3800000 mem_type bit 0x2 length 100000000 on md[2]=mlx5_0: Unsupported operation (md reg_mem_types 0x1)
[1601415628.049413] [h13n02:130948:0] ucp_mm.c:137 UCX DEBUG failed to register address 0x2000a3800000 mem_type bit 0x2 length 100000000 on md[2]=mlx5_0: Unsupported operation (md reg_mem_types 0x1)
[1601415628.548910] [h13n02:130948:0] ucp_mm.c:137 UCX DEBUG failed to register address 0x2000a3800000 mem_type bit 0x2 length 100000000 on md[2]=mlx5_0: Unsupported operation (md reg_mem_types 0x1)
[1601415629.048608] [h13n02:130948:0] ucp_mm.c:137 UCX DEBUG failed to register address 0x2000a3800000 mem_type bit 0x2 length 100000000 on md[2]=mlx5_0: Unsupported operation (md reg_mem_types 0x1)
[1601415629.549037] [h13n02:130948:0] ucp_mm.c:137 UCX DEBUG failed to register address 0x2000a3800000 mem_type bit 0x2 length 100000000 on md[2]=mlx5_0: Unsupported operation (md reg_mem_types 0x1)
Those errors don't happen on a DGX-1, and we see 10GB/s bandwidth there. They also don't happen on Summit for host memory, and I can also measure 10GB/s in that case. It only happens for CUDA memory and then we see very low bandwidth, and if you look closely at the timestamp, they're happening 500ms apart from each other, which probably means there's some sort of timeout happening and that's what makes it so slow.
Any ideas of what could be the reason for those errors?
Hi all,
The next plot is what I got by running the local-send-recv.py benchmark in a Summit node. It uses ucx-py 0.16. Note that performance closes to 50GB/s as message size increases, which I would say is an expected behavior.
These results follows the same trend reported in (slide 46)
Pritchard et al. "Getting It Right with Open MPI: Best Practices for Deployment andTuning of Open MPI" Exascale Computing Project Annual Meeting 2020, 2020-02-03 (Houston,Texas, United States)
Feel free to review the compilation flags I used on UCX in
/gpfs/alpine/world-shared/stf011/benjha/ucx/ucx/ucx.sh
and the LSF script with UCX env. variables I used to execute the local-send-recv.py benchmark in
/gpfs/alpine/world-shared/stf011/benjha/ucx/lsf/launch_test.lsf
/gpfs/alpine/world-shared/stf011/benjha/ucx/lsf/run_test.sh
The next plot is what I got by running the local-send-recv.py benchmark in a Summit node. It uses ucx-py 0.16. Note that performance closes to 50GB/s as message size increases, which I would say is an expected behavior.
I didn't observe any intra-node issues with ucx-py either, @benjha . However, we're mainly worried about internode performance over IB.
For those without access to @benjha's files, the test was executed with the following config:
export UCX_RNDV_SCHEME='get_zcopy'
export UCX_NET_DEVICES='mlx5_0:1,mlx5_3:1'
export UCX_MAX_RNDV_RAILS=2
export UCX_TLS='rc_x,sm,cuda_copy,cuda_ipc,gdr_copy'
export UCX_RNDV_THRESH=1
#export UCX_TCP_TX_SEG_SIZE='10M'
UCX was built with:
./configure CC=gcc CXX=g++ \
CXXFLAGS="-m64 -mcpu=powerpc64 -mtune=powerpc64 -O3" \
CFLAGS="-m64 -mcpu=powerpc64 -mtune=powerpc64 -O3" \
--prefix=$INSTALL_DIR \
--enable-compiler-opt=3 --enable-optimizations \
--enable-mt --with-mcpu=powerpc64le \
--enable-debug \
--with-cuda=$CUDA_DIR --with-knem=$KNEM_DIR \
--with-verbs='/usr' \
--with-gdrcopy=$OLCF_GDRCOPY_ROOT \
--with-rc --with-ud --with-dc --with-mlx5-dv \
--with-cm --with-dm
And the test was between GPUs 0,1:
python local-send-recv.py --server-dev 0 --client-dev 1 --object_type rmm --reuse-alloc --n-bytes $MSG_SIZE --port 12345
As @jglaser notes, I think we either want internode testing or testing just IB between GPUs: UCX_TLS=rc,cuda_copy
The important part in above's flags is that you have UCX_TLS=...,cuda_ipc
, which means intranode transfers will go over NVLink. However, removing cuda_ipc
exposes the issue with IB, and if we have IB issues intranode we'll also have issues internode. Debugging is infinitely easier within a single node, so I'm trying to resolve that before we scale multi-node.
According to the message [1601415627.302291] [h13n02:130962:0] ucp_mm.c:137 UCX DEBUG failed to register address 0x200083800000 mem_type bit 0x2 length 100000000 on md[2]=mlx5_0: Unsupported operation (md reg_mem_types 0x1)
, we only support memory type 0x1 (reg_mem_types
), which is UCS_MEMORY_TYPE_HOST
, as per https://github.com/openucx/ucx/blob/6b295583dab5e1673de2edc21d343bca14cbaf93/src/ucs/memory/memory_type.h#L26-L33 . We try to register memory type UCS_MEMORY_TYPE_CUDA
, and that fails. On our DGX-1 build, reg_mem_types
is 0x17, meaning it also supports CUDA as bit 2 is set as well.
I think our Summit build is really missing some configuration that gets picked up automatically when building on DGX-1 and we probably need to specify something else at build time.
@hoopoepg and I spent a couple hours looking into various different configurations on Summit and trying to replicate the environment that we seek works on a DGX-1 without success. We tried enabling/disabling knem and gdrcopy, as well as checking that required CUDA capabilities are correctly built into UCX's Summit builds, but both cases still give us the same registration errors as before with low performance. We also verified that the libraries we're using on Summit (RMM, CuPy) are linking dynamically to CUDA, as pointed by @hoopoepg that static linking may cause issues when UCX tries to capture cudaMalloc
s. Finally I tried building a CUDA 10.1 environment on DGX-1 (to match the CUDA version Summit uses) and I'm still able to get 10GB/s on the DGX with that build.
Right now we don't have a solution for the errors above and low performance. Also pinging @Akshay-Venkatesh and @spotluri in case they have ideas.
cc @bureddy in case you have any ideas as well.
I just tried running with UCX_IB_GPU_DIRECT_RDMA=no
on the DGX-1 and I now see the same issues as on Summit:
$ UCX_IB_GPU_DIRECT_RDMA=no UCX_NET_DEVICES=mlx5_0:1 UCX_TLS=tcp,sockcm,cuda_copy,rc_x UCX_SOCKADDR_TLS_PRIORITY=sockcm python local-send-recv.py --server-dev 0 --cli
ent-dev 1 --object_type cupy --reuse-alloc --n-bytes 100MB --port 12345
Roundtrip benchmark
--------------------------
n_iter | 10
n_bytes | 100.00 MB
object | cupy
reuse alloc | True
==========================
Device(s) | 0, 1
Average | 624.59 MB/s
--------------------------
Iterations
--------------------------
000 |636.15 MB/s
001 |638.04 MB/s
002 |600.98 MB/s
003 |631.51 MB/s
004 |595.57 MB/s
005 |628.88 MB/s
006 |631.80 MB/s
007 |624.86 MB/s
008 |629.48 MB/s
009 |631.78 MB/s
And looking at UCX debug info I see the same errors that I see on Summit:
[1601469008.461828] [dgx13:29555:0] ib_md.c:719 UCX DEBUG registered memory 0x7fac80dd0000..0x7fac80e55000 on mlx5_0 lkey 0xfbc6e rkey 0xfbc6e access 0xf flags 0xe4
[1601469010.202777] [dgx13:29646:0] ib_md.c:719 UCX DEBUG registered memory 0x7f923789b000..0x7f9237920000 on mlx5_0 lkey 0xffa1f rkey 0xffa1f access 0xf flags 0xe4
[1601469010.210819] [dgx13:29646:0] ib_md.c:719 UCX DEBUG registered memory 0x7f9226000000..0x7f9226600000 on mlx5_0 lkey 0x187993 rkey 0x187993 access 0xf flags 0xe4
[1601469010.209081] [dgx13:29555:0] ib_md.c:719 UCX DEBUG registered memory 0x7fac67800000..0x7fac67e00000 on mlx5_0 lkey 0x189096 rkey 0x189096 access 0xf flags 0xe4
[1601469010.218336] [dgx13:29555:0] ib_md.c:719 UCX DEBUG registered memory 0x7fa9d8400000..0x7fa9daa00000 on mlx5_0 lkey 0x9426c rkey 0x9426c access 0xf flags 0xe4
[1601469010.234924] [dgx13:29646:0] ib_md.c:719 UCX DEBUG registerxf/0x3f
[1601469010.223760] [dgx13:29555:0] ib_md.c:719 UCX DEBUG registered memory 0x7fa9d2a00000..0x7fa9d3e00000 on mlx5_0 lkey 0x7f7d3 rkey 0x7f7d3 access 0xf flags 0xe4
[1601469010.240738] [dgx13:29646:0] ib_md.c:719 UCX DEBUG registered memory 0x7f8f92a00000..0x7f8f93e00000 on mlx5_0 lkey 0x7dfbb rkey 0x7dfbb access 0xf flags 0xe4
[1601469011.701783] [dgx13:29646:0] ucp_mm.c:140 UCX DEBUG failed to register address 0x7f8f4e000000 mem_tyCE deactivate iface 0x559c85738e70 force=0 acount=1 aifaces=3
[1601469010.238758] [dgx13:29555:0] ib_md.c:719 UCX DEBUG registered memory 0x7fabf9200000..0x7fabf9c00000 on mlx5_0 lkey 0xcd096 rkey 0xcd096 access 0xf flags 0xe4
[1601469011.704771] [dgx13:29646:0] ib_md.c:719 UCX DEBUG registered memory 0x5589cb400000..0x5589cbe00000 on mlx5_0 lkey 0x586f3 rkey 0x586f3 access 0xf flags 0xe4
[1601469011.723128] [dgx13:29646:0] ib_md.c:719 UCX DEBUG registered memory 0x5589cc000000..0x5589cca00000 on mlx5_0 lkey 0xd295b rkey 0xd295b access 0xf flags 0xe4
[1601469011.876672] [dgx13:29555:0] ib_md.c:719 UCX DEBUG registered memory 0x559cb1200000..0x559cb1c00000 on mlx5_0 lkey 0x82a6d rkey 0x82a6d access 0xf flags 0xe4
[1601469012.029100] [dgx13:29646:0] ucp_mm.c:140 UCX DEBUG failed to register address 0x7f8f4e000000 mem_type bit 0x2 length 100000000 on md[2]=mlx5_0: Unsupported operation (md reg_mem_types 0x15)
[1601469012.346932] [dgx13:29646:0] ucp_mm.c:140 UCX DEBUG failed to register address 0x7f8f4e000000 mem_type bit 0x2 length 100000000 on md[2]=mlx5_0: Unsupported operation (md reg_mem_types 0x15)
[1601469012.176187] [dgx13:29555:0] ucp_mm.c:140 UCX DEBUG failed to register address 0x7fa988000000 mem_type bit 0x2 length 100000000 on md[2]=mlx5_0: Unsupported operation (md reg_mem_types 0x15)
[1601469012.504545] [dgx13:29555:0] ucp_mm.c:140 UCX DEBUG failed to register address 0x7fa988000000 mem_type bit 0x2 length 100000000 on md[2]=mlx5_0: Unsupported operation (md reg_mem_types 0x15)
[1601469012.676050] [dgx13:29646:0] ucp_mm.c:140 UCX DEBUG failed to register address 0x7f8f4e000000 mem_type bit 0x2 length 100000000 on md[2]=mlx5_0: Unsupported operation (md reg_mem_types 0x15)
[1601469012.835303] [dgx13:29555:0] ucp_mm.c:140 UCX DEBUG failed to register address 0x7fa988000000 mem_type bit 0x2 length 100000000 on md[2]=mlx5_0: Unsupported operation (md reg_mem_types 0x15)
[1601469012.985642] [dgx13:29646:0] ucp_mm.c:140 UCX DEBUG failed to register address 0x7f8f4e000000 mem_type bit 0x2 length 100000000 on md[2]=mlx5_0: Unsupported operation (md reg_mem_types 0x15)
[1601469013.168962] [dgx13:29555:0] ucp_mm.c:140 UCX DEBUG failed to register address 0x7fa988000000 mem_type bit 0x2 length 100000000 on md[2]=mlx5_0: Unsupported operation (md reg_mem_types 0x15)
[1601469013.324069] [dgx13:29646:0] ucp_mm.c:140 UCX DEBUG failed to register address 0x7f8f4e000000 mem_type bit 0x2 length 100000000 on md[2]=mlx5_0: Unsupported operation (md reg_mem_types 0x15)
[1601469013.499957] [dgx13:29555:0] ucp_mm.c:140 UCX DEBUG failed to register address 0x7fa988000000 mem_type bit 0x2 length 100000000 on md[2]=mlx5_0: Unsupported operation (md reg_mem_types 0x15)
[1601469013.825191] [dgx13:29555:0] ucp_mm.c:140 UCX DEBUG failed to register address 0x7fa988000000 mem_type bit 0x2 length 100000000 on md[2]=mlx5_0: Unsupported operation (md reg_mem_types 0x15)
[1601469013.649949] [dgx13:29646:0] ucp_mm.c:140 UCX DEBUG failed to register address 0x7f8f4e000000 mem_type bit 0x2 length 100000000 on md[2]=mlx5_0: Unsupported operation (md reg_mem_types 0x15)
[1601469013.974739] [dgx13:29646:0] ucp_mm.c:140 UCX DEBUG failed to register address 0x7f8f4e000000 mem_type bit 0x2 length 100000000 on md[2]=mlx5_0: Unsupported operation (md reg_mem_types 0x15)
[1601469014.302293] [dgx13:29646:0] ucp_mm.c:140 UCX DEBUG failed to register address 0x7f8f4e000000 mem_type bit 0x2 length 100000000 on md[2]=mlx5_0: Unsupported operation (md reg_mem_types 0x15)
[1601469014.150769] [dgx13:29555:0] ucp_mm.c:140 UCX DEBUG failed to register address 0x7fa988000000 mem_type bit 0x2 length 100000000 on md[2]=mlx5_0: Unsupported operation (md reg_mem_types 0x15)
[1601469014.629782] [dgx13:29646:0] ucp_mm.c:140 UCX DEBUG failed to register address 0x7f8f4e000000 mem_type bit 0x2 length 100 arm iface 0x559c8578e110 returned Device is busy
[1601469014.478059] [dgx13:29555:0] ucp_mm.c:140 UCX DEBUG failed to register address 0x7fa988000000 mem_type bit 0x2 length 100000000 on md[2]=mlx5_0: Unsupported operation (md reg_mem_types 0x15)
[1601469014.804649] [dgx13:29555:0] ucp_mm.c:140 UCX DEBUG failed to register address 0x7fa988000000 mem_type bit 0x2 length 100000000 on md[2]=mlx5_0: Unsupported operation (md reg_mem_types 0x15)
It looks like this is some configuration on Summit's compute nodes. I remember we had this kind of problem in our DGX-1s in the past and they were resolved at the system level by our devops team with some software updates and configurations.
@jglaser @benjha is this something you can check with Summit admins?
@Akshay-Venkatesh I remember you helped our devops team finding out about configurations, have you ever tested GPUDirect RDMA on Summit?
I've tested GPUDirect RDMA in the past and I just ran again to double check. Seems like performance is as expected:
$ date
Wed Sep 30 10:03:59 EDT 2020
$ ucx_info -v
# UCT version=1.10.0 revision 8e96fc6
# configured with: --disable-logging --disable-debug --disable-assertions --disable-params-check --prefix=$UCX_HOME --enable-mt --with-cuda=/sw/summit/cuda/11.0.2 --with-gdrcopy=/sw/summit/gdrcopy/2.0
$ ompi_info | grep Configure
Configured architecture: powerpc64le-unknown-linux-gnu
Configure command line: '--prefix=$OMPI_HOME' '--enable-oshmem' '--enable-orterun-prefix-by-default' '--with-cuda=/sw/summit/cuda/11.0.2' '--with-ucx=$UCX_HOME' '--with-ucx-libdir=$UCX_HOME/lib' '--enable-mca-no-build=btl-uct' '--with-pmix=internal'
$ mpirun -np 2 --npernode 1 --oversubscribe --host e03n16,h31n13 --mca btl ^openib,smcuda -x LD_LIBRARY_PATH -x UCX_MEMTYPE_CACHE=y -x UCX_TLS=rc_x,mm,cuda_copy,gdr_copy,cuda_ipc $PWD/get_local_rank_ompi_hca mpi/pt2pt/osu_latency D D
local rank 0: using hca mlx5_0:1,mlx5_3:1
local rank 0: using hca mlx5_0:1,mlx5_3:1
# OSU MPI-CUDA Latency Test v5.6.2
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size Latency (us)
0 2.21
1 3.18
2 3.18
4 3.17
8 3.16
16 3.16
32 3.20
64 3.23
128 3.34
256 3.36
512 3.43
1024 3.63
2048 4.77
4096 4.87
8192 6.48
16384 8.48
32768 11.15
65536 13.50
131072 16.61
262144 24.01
524288 36.59
1048576 60.69
2097152 110.05
4194304 267.14
$ mpirun -np 2 --npernode 1 --oversubscribe --host e03n16,h31n13 --mca btl ^openib,smcuda -x LD_LIBRARY_PATH -x UCX_MEMTYPE_CACHE=y -x UCX_TLS=rc_x,mm,cuda_copy,gdr_copy,cuda_ipc $PWD/get_local_rank_ompi_hca mpi/pt2pt/osu_bw D D
local rank 0: using hca mlx5_0:1,mlx5_3:1
local rank 0: using hca mlx5_0:1,mlx5_3:1
# OSU MPI-CUDA Bandwidth Test v5.6.2
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size Bandwidth (MB/s)
1 1.34
2 2.60
4 5.33
8 10.66
16 20.74
32 42.43
64 81.09
128 161.11
256 319.54
512 598.04
1024 1111.55
2048 1941.04
4096 3279.54
8192 5083.34
16384 5221.31
32768 13783.94
65536 18016.02
131072 20156.46
262144 21512.21
524288 22104.22
1048576 22373.62
2097152 22469.21
4194304 18302.49
The OpenMPI build used to get these results doesn't use wakeup feature so that may change things but I'm not sure if UCXpy uses wakeup or not.
UCX-Py uses the wakeup feature by default, but I tried disabling it and running in non-blocking mode to see if that would change anything and I still see the same errors.
The registration errors that we see come from the UCX layer, and not from UCX-Py though. It may still be the case that we're misconfiguring something, but I don't see any hints as to what's causing that, except for what I wrote in https://github.com/rapidsai/ucx-py/issues/616#issuecomment-701362602 . Does anyone know how could we identify what's causing that, or if anyone has suggestions to something that we should be doing differently in Summit that we don't need to do on a DGX-1, that's very welcome.
To reiterate @pentschev, we have successfully tested UCX-Py and very large workloads on many systems. When we have seen errors in the past and generally these have pointed to system configuration issues but we don't know how to identify them easily. For example, in the past we found some machines without nv_peer_mem
(this one was a bit obvious). Are the MLNX configuration issues we can easily check ? Would someone have time to review both systems with us ?
We just discussed this offline with @yosefe and he pointed that we need to unset UCX_MEM_MMAP_HOOK_MODE
. This is set by default in Summit or by some of its modules. Doing that resolves the UCX-Py issues:
Roundtrip benchmark
--------------------------
n_iter | 10
n_bytes | 100.00 MB
object | rmm
reuse alloc | True
==========================
Device(s) | 0, 0
Average | 6.85 GB/s
--------------------------
Iterations
--------------------------
000 | 1.97 GB/s
001 | 9.45 GB/s
002 | 9.45 GB/s
003 | 9.44 GB/s
004 | 9.44 GB/s
005 | 9.46 GB/s
006 | 9.46 GB/s
007 | 9.47 GB/s
008 | 9.47 GB/s
009 | 9.46 GB/s
@jglaser @benjha can you try that as well and see how it performs?
Can anyone list what are the pieces needed so we can verify with HPC ops if they are set up ?
Can anyone list what are the pieces needed so we can verify with HPC ops if they are set up ?
It seems that unset UCX_MEM_MMAP_HOOK_MODE
was everything I needed, could you try your current scripts with that and see if they perform better? The setup seems to be correct, GPUDirect RDMA worked when I unset that variable.
Result of local-send-recv.py
benchmark with the next flags
export UCX_RNDV_SCHEME='get_zcopy'
export UCX_NET_DEVICES='mlx5_0:1,mlx5_3:1'
export UCX_MAX_RNDV_RAILS=2
export UCX_TLS='rc_x,sm,cuda_copy'
export UCX_RNDV_THRESH=1
Server Running at 10.41.21.51:60474
Client connecting to server at 10.41.21.51:60474
Roundtrip benchmark
--------------------------
n_iter | 10
n_bytes | 100.00 MB
object | rmm
reuse alloc | True
==========================
Device(s) | 0, 1
Average | 294.34 MB/s
--------------------------
Iterations
--------------------------
000 |317.24 MB/s
001 |286.21 MB/s
002 |281.35 MB/s
003 |287.58 MB/s
004 |282.21 MB/s
005 |291.50 MB/s
006 |296.12 MB/s
007 |303.41 MB/s
008 |298.58 MB/s
009 |302.97 MB/s
adding unset UCX_MEM_MMAP_HOOK_MODE
to the above env. variables as @pentschev suggested, results in:
Server Running at 10.41.21.53:37109
Client connecting to server at 10.41.21.53:37109
Roundtrip benchmark
--------------------------
n_iter | 10
n_bytes | 100.00 MB
object | rmm
reuse alloc | True
==========================
Device(s) | 0, 1
Average | 9.69 GB/s
--------------------------
Iterations
--------------------------
000 | 8.07 GB/s
001 | 10.93 GB/s
002 | 9.81 GB/s
003 | 9.81 GB/s
004 | 9.78 GB/s
005 | 9.81 GB/s
006 | 9.80 GB/s
007 | 9.81 GB/s
008 | 9.81 GB/s
009 | 9.78 GB/s
Btw, all my runs got this error:
Traceback (most recent call last):
File "/gpfs/alpine/world-shared/stf011/benjha/ucx/ucx-py/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/gpfs/alpine/world-shared/stf011/benjha/ucx/ucx-py/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/gpfs/alpine/stf011/world-shared/benjha/ucx/ucx/ucx-py/benchmarks/local-send-recv.py", line 55, in server
devices=[args.server_dev],
File "/gpfs/alpine/world-shared/stf011/benjha/ucx/ucx-py/lib/python3.7/site-packages/rmm/rmm.py", line 77, in reinitialize
log_file_name=log_file_name,
File "rmm/_lib/memory_resource.pyx", line 305, in rmm._lib.memory_resource._initialize
File "rmm/_lib/memory_resource.pyx", line 365, in rmm._lib.memory_resource._initialize
File "rmm/_lib/memory_resource.pyx", line 64, in rmm._lib.memory_resource.PoolMemoryResource.__cinit__
MemoryError: std::bad_alloc: CUDA error at: ../include/rmm/mr/device/cuda_memory_resource.hpp:68: cudaErrorMemoryAllocation out of memory
@benjha could you give us more information on how you're setting things up when you see the OOM errors? I think OOM is not directly related to the issue we're discussing here, so to avoid that we end up with an endless thread, I would suggest starting a new issue in this repo to discuss that.
I can confirm @benjha 's errors. With -rmm-pool-size=8G
on the 16GB V100s I get
Exception: MemoryError('std::bad_alloc: RMM failure at: ../include/rmm/mr/device/pool_memory_resource.hpp:167: Maximum pool size exceeded')
for TPCXBB queries that ran fine previously (but slowly).
Without that argument, I see
Exception: MemoryError('std::bad_alloc: CUDA error at: ../include/rmm/mr/device/cuda_memory_resource.hpp:68: cudaErrorMemoryAllocation out of memory')
On a positive note, without the UCX_MEM_MMAP_HOOK_MODE
, UCX_RNDV_SCHEME=auto
seems to be working (issue #615 )
Environment variables for the workers
UCX_TLS=rc_x,sm,cuda_copy,cuda_ipc,gdr_copy
UCX_MAX_RNDV_RAILS=2
UCX_NET_DEVICES=mlx5_0:1,mlx5_3:1
UCX_MEMTYPE_CACHE=y
command line for the workers
UCX_RNDV_SCHEME=auto jsrun -n 36 -a 1 -g 6 -c 42 -b rs -D UCX_MEM_MMAP_HOOK_MODE --smpiargs="-disable_gpu_hooks" dask-cuda-worker --scheduler-file my-scheduler-ucx.json --memory-limit 160GB --enable-infiniband --enable-nvlink --death-timeout 60 --interface ib0 --nthreads 1 --local-directory /mnt/bb/$USER
I haven't tested the 32GB GPUs yet.
Can you try to use a pool size that's very close the total amount of GPU memory? Those are 16GB GPUs, so I'd recommend 15GB, or 14GB if 15GB still is too much. The cuda_ipc
transport can't unregister memory which prevents such buffers from being release, that's why we need the pool to be used for all allocations of the application.
Can you try to use a pool size that's very close the total amount of GPU memory? Those are 16GB GPUs, so I'd recommend 15GB, or 14GB if 15GB still is too much. The
cuda_ipc
transport can't unregister memory which prevents such buffers from being release, that's why we need the pool to be used for all allocations of the application.
No luck yet with either of these pool sizes. Will try on the 32 GB GPUs as soon as I get access.
UCX_RNDV_SCHEME=auto jsrun -n 36 -a 1 -g 6 -c 42 -b rs -D UCX_MEM_MMAP_HOOK_MODE --smpiargs="-disable_gpu_hooks" dask-cuda-worker --scheduler-file my-scheduler-ucx.json --memory-limit 160GB --enable-infiniband --enable-nvlink --death-timeout 60 --interface ib0 --nthreads 1 --local-directory /mnt/bb/$USER
When doing the RAPIDS performance evaluation, we found that in some cases fat workers (e.g. 1 worker with 6 GPUs, 1 worker per node) worked better than thin workers (1 worker per GPU, 6 workers per node), in particular SVD's CuPy performed better with fat workers and cuDF worked better with thin workers.
It might be something worth to explore with BSQL @jglaser
When doing the RAPIDS performance evaluation, we found that in some cases fat workers (e.g. 1 worker with 6 GPUs, 1 worker per node)
How do you address other GPUs then? CuPy for example is going to always address GPU 0, which is fine if you have multiple workers each addressing a different GPU, so each worker is always working on GPU 0 relative to the CUDA_VISIBLE_DEVICES
ordering, but if you have a single process addressing multiple GPUs, then CuPy won't be able to automatically do work on all GPUs.
jsrun allows the isolation of resources as you describe. On the other hand, I thought DASK distributed the load across GPUs of the same worker, is this the way it works with CuPy ? Anyway, for some reason I ended up using 1 GPU per worker...
jsrun allows the isolation of resources as you describe.
That's correct, but when you isolate resources via jsrun, you'll be effectively creating a worker per resource, in that case a resource being a GPU.
On the other hand, I thought DASK distributed the load across GPUs of the same worker, is this the way it work with CuPy ?
Mainline Dask will do no addressing of GPUs at all, so libraries such as CuPy and cuDF will run by default on GPU 0, meaning all other GPUs are idle. On the other hand, Dask-CUDA was specifically written to support a one-process(worker)-per-GPU model, in which we set CUDA_VISIBLE_DEVICES
for each worker in a round-robin fashion, that means that every worker will see a different GPU when it addresses GPU 0. You can, of course, address GPUs other than 0 with CuPy, etc., but that's not handled by Dask today in any scenario, and there's no plan to do that in the future that I know of.
Anyway, for some reason I ended up using 1 GPU per worker...
As I mentioned above, this is the only supported case by Dask-CUDA today, so it feels that you'd naturally end up using that. However, if you are certain you used a single Dask worker with multiple GPUs, that's something I'd be interested in knowing how it was done, it's not technically impossible but likely very challenging.
Here's a datapoint with 4MB message size and UCX master (ucx_perftest
)... It does look like the bandwidth went up to 10GB (CUDA) and 13GB/s (unified memory), without having to modify the rendezvous scheme.
cuda
(rapids-env) bash-4.2$ UCX_TLS=rc_x,sm,cuda_copy,gdr_copy,cuda_ipc jsrun -D UCX_MEM_MMAP_HOOK_MODE -n 2 -a 1 -g 6 -c 42 -b packed:smt:1 --smpiargs="-disable_gpu_hooks" ucx_perftest -m cuda -t tag_bw -s "4194304" -n 10 -T 1
Warning: PAMI CUDA HOOK disabled
Warning: PAMI CUDA HOOK disabled
+--------------+--------------+-----------------------------+---------------------+-----------------------+
| | | overhead (usec) | bandwidth (MB/s) | message rate (msg/s) |
+--------------+--------------+---------+---------+---------+----------+----------+-----------+-----------+
| Stage | # iterations | typical | average | overall | average | overall | average | overall |
+--------------+--------------+---------+---------+---------+----------+----------+-----------+-----------+
Final: 10 0.000 378.990 378.990 10554.36 10554.36 2639 2639
cuda-managed
(rapids-env) bash-4.2$ UCX_TLS=rc_x,sm,cuda_copy,gdr_copy,cuda_ipc jsrun -D UCX_MEM_MMAP_HOOK_MODE -n 2 -a 1 -g 6 -c 42 -b packed:smt:1 --smpiargs="-disable_gpu_hooks" ucx_perftest -m cuda-managed -t tag_bw -s "4194304" -n 10 -T 1
Warning: PAMI CUDA HOOK disabled
Warning: PAMI CUDA HOOK disabled
+--------------+--------------+-----------------------------+---------------------+-----------------------+
| | | overhead (usec) | bandwidth (MB/s) | message rate (msg/s) |
+--------------+--------------+---------+---------+---------+----------+----------+-----------+-----------+
| Stage | # iterations | typical | average | overall | average | overall | average | overall |
+--------------+--------------+---------+---------+---------+----------+----------+-----------+-----------+
Final: 10 0.000 303.507 303.507 13179.27 13179.27 3295 3295
I have yet to run the benchmark again.. hopefully I won't see the OOM errors on the 32GB GPUs.
I'm happy to see we're doing better.
I remember it was very challenging for folks to get memory utilization correctly for TPCx-BB, and indeed adding UCX to the workflow changes the requirements a bit, but we shouldn't double the memory utilization or something fo that sort. Keep in mind that we can't use managed memory with CUDA IPC, so we lose that ability and increase the perceived memory utilization. It's also important to use --device-memory-limit
in various TPCx-BB queries to enable dask-cuda spilling to system memory, and I remember reading comments from @beckernick that optimal value was around 50% of the GPU memory for that parameter.
On a side note, what exactly is the limitation of managed memory w/regard to IPC/NVLINK?
It's a CUDA IPC limitation in itself, see https://github.com/rapidsai/ucx-py/issues/409 for some discussion.
I think we can close this now. @jglaser are you ok with that ?
On summit, the nodes have the following configuration:
Each node has 6 GPUs and 4 MLNX Devices. I'm not sure what the optimal pairing of GPU and MLNX device should be. Normally, I would rely on
hwloc
to figure this out however, on Summit i get errors like the following (when using--net-devices='auto' with dask-cuda
):Still, I can set up the worker manually with something like: GPU 0
GPU 1
And so on. What should
--net-devices
and--interface
be set to for each of the six GPUs ?cc @MattBBaker in case he has thoughts