High latency sending from device memory on Summit

devreal commented 1 year ago

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

Open MPI main branch (0ba5d47dc49ca333f99fb9e90e213191acad667b)

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Installed with CUDA support (11.0.3 - the default on Summit) and UCX 1.14.0.

Loaded modules:

$ module list

Currently Loaded Modules:
  1) lsf-tools/2.0   3) darshan-runtime/3.4.0-lite   5) DefApps     7) nsight-compute/2021.2.1      9) cuda/11.0.3
  2) hsi/5.0.2.p5    4) xalt/1.2.1                   6) gcc/9.3.0   8) nsight-systems/2021.3.1.54  10) gdrcopy/2.3

UCX config summary:

configure: =========================================================
configure: UCX build configuration:
configure:         Build prefix:   /ccs/home/jschuchart/opt-summit/ucx-1.14.0
configure:    Configuration dir:   ${prefix}/etc/ucx
configure:   Preprocessor flags:   -DCPU_FLAGS="" -I${abs_top_srcdir}/src -I${abs_top_builddir} -I${abs_top_builddir}/src
configure:           C compiler:   gcc -O3 -g -Wall -Werror -funwind-tables -Wno-missing-field-initializers -Wno-unused-parameter -Wno-unused-label -Wno-long-long -Wno-endif-labels -Wno-sign-compare -Wno-multichar -Wno-deprecated-declarations -Winvalid-pch -Wno-pointer-sign -Werror-implicit-function-declaration -Wno-format-zero-length -Wnested-externs -Wshadow -Werror=declaration-after-statement
configure:         C++ compiler:   g++ -O3 -g -Wall -Werror -funwind-tables -Wno-missing-field-initializers -Wno-unused-parameter -Wno-unused-label -Wno-long-long -Wno-endif-labels -Wno-sign-compare -Wno-multichar -Wno-deprecated-declarations -Winvalid-pch
configure:         Multi-thread:   disabled
configure:         NUMA support:   enabled
configure:            MPI tests:   disabled
configure:          VFS support:   no
configure:        Devel headers:   no
configure: io_demo CUDA support:   no
configure:             Bindings:   < >
configure:          UCS modules:   < >
configure:          UCT modules:   < cuda ib rdmacm cma knem >
configure:         CUDA modules:   < gdrcopy >
configure:         ROCM modules:   < >
configure:           IB modules:   < >
configure:          UCM modules:   < cuda >
configure:         Perf modules:   < cuda >
configure: =========================================================

If you are building/installing from a git clone, please copy-n-paste the output from `git submodule status`.

$ git submodule status
 7d25bd021b57d4e3cea40d23bd3662180c269827 3rd-party/openpmix (v1.1.3-3825-g7d25bd02)
 4725d89abe53c52343eeb49c90986c4d407d6392 3rd-party/prrte (psrvr-v2.0.0rc1-4609-g4725d89abe)
 237ceff1a8ed996d855d69f372be9aaea44919ea config/oac (237ceff)

Please describe the system on which you are running

Operating system/version:
Computer hardware:
Network type:

Details of the problem

I am seeing high latencies in the osu_latency benchmark (OSU benchmarks 7.0.1) when using pml/ucx:

$ UCX_IB_GPU_DIRECT_RDMA=yes OMPI_MCA_pml=ucx jsrun -n 2 -a 1 -c 1 -g 1 -r 1 -b none ~/src/osu-micro-benchmarks-7.0.1/build-main/c/mpi/pt2pt/osu_latency -d cuda -m $((2*4096)) D D
accelerator device: cuda
# OSU MPI-CUDA Latency Test v7.0
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size          Latency (us)
0                       2.73
1                     437.98
2                     441.63
4                     433.29
8                     438.72
16                    433.87
32                    439.61
64                    437.90
128                   432.96
256                   443.43
512                   431.55
1024                  439.35
2048                  436.77
4096                  440.73
8192                  441.46

If I disable UCX_IB_GPU_DIRECT_RDMA the latency is slightly lower, but still too high:

$ UCX_IB_GPU_DIRECT_RDMA=no OMPI_MCA_coll=^han,cuda OMPI_MCA_pml=ucx jsrun -n 2 -a 1 -c 1 -g 1 -r 1 -b none ~/src/osu-micro-benchmarks-7.0.1/build-main/c/mpi/pt2pt/osu_latency -d cuda -m $((2*4096)) D D
accelerator device: cuda
# OSU MPI-CUDA Latency Test v7.0
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size          Latency (us)
0                       2.70
1                     370.96
2                     362.72
4                     369.19
8                     363.51
16                    371.36
32                    362.82
64                    368.86
128                   363.66
256                   369.75
512                   365.19
1024                  369.61
2048                  365.51
4096                  372.03
8192                  368.54

With pml/ob1 and btl/uct the latency is reasonable but I get a Segfault on messages above 4KB:

$ UCX_IB_GPU_DIRECT_RDMA=yes OMPI_MCA_coll=^han,cuda OMPI_MCA_pml=^ucx jsrun -n 2 -a 1 -c 1 -g 1 -r 1 -b none ~/src/osu-micro-benchmarks-7.0.1/build-main/c/mpi/pt2pt/osu_latency -d cuda -m $((2*4096)) D D
accelerator device: cuda
# OSU MPI-CUDA Latency Test v7.0
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size          Latency (us)
0                       2.23
1                      29.48
2                      29.44
4                      29.46
8                      29.42
16                     29.43
32                     29.50
64                     29.54
128                    29.87
256                    29.92
512                    30.32
1024                   30.63
2048                   31.47
4096                   32.78
[e28n13:249326:0:249326] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x200053200000)
==== backtrace (tid: 249326) ====
 0 0x00000000000ada48 __memcpy_power7()  :0
 1 0x00000000000297e4 uct_rc_verbs_ep_am_bcopy()  /ccs/home/jschuchart/src/ucx/ucx-1.14.0/build/src/uct/ib/../../../../src/uct/ib/rc/verbs/rc_verbs_ep.c:317
 2 0x00000000000e2fd8 uct_ep_am_bcopy()  /ccs/home/jschuchart/opt-summit/ucx-1.14.0/include/uct/api/uct.h:3020
 3 0x00000000000e2fd8 mca_btl_uct_send_frag()  /ccs/home/jschuchart/src/openmpi/ompi-clean/build/opal/mca/btl/uct/../../../../../opal/mca/btl/uct/btl_uct_am.c:212
 4 0x00000000000d9d30 mca_btl_uct_component_progress_pending()  /ccs/home/jschuchart/src/openmpi/ompi-clean/build/opal/mca/btl/uct/../../../../../opal/mca/btl/uct/btl_uct_component.c:600
 5 0x00000000000d9fc4 mca_btl_uct_component_progress()  /ccs/home/jschuchart/src/openmpi/ompi-clean/build/opal/mca/btl/uct/../../../../../opal/mca/btl/uct/btl_uct_component.c:649
 6 0x000000000003289c opal_progress()  /ccs/home/jschuchart/src/openmpi/ompi-clean/build/opal/../../opal/runtime/opal_progress.c:224
 7 0x0000000000383410 ompi_request_wait_completion()  /ccs/home/jschuchart/src/openmpi/ompi-clean/build/ompi/mca/pml/ob1/../../../../../ompi/request/request.h:492
 8 0x0000000000385de4 mca_pml_ob1_send()  /ccs/home/jschuchart/src/openmpi/ompi-clean/build/ompi/mca/pml/ob1/../../../../../ompi/mca/pml/ob1/pml_ob1_isend.c:327
 9 0x0000000000189f98 PMPI_Send()  /ccs/home/jschuchart/src/openmpi/ompi-clean/build/ompi/mpi/c/../../../../ompi/mpi/c/send.c:93
10 0x00000000100032d8 main()  /autofs/nccs-svm1_home1/jschuchart/src/osu-micro-benchmarks-7.0.1/build-main/c/mpi/pt2pt/../../../../c/mpi/pt2pt/osu_latency.c:168
11 0x0000000000024078 .annobin_libc_start.c()  libc-start.c:0
12 0x0000000000024264 __libc_start_main()  ???:0
=================================
ERROR:  One or more process (first noticed rank 0) terminated with signal 11

I repeated the experiment with CUDA 11.5.2 (the newest available module on Summit). The latencies are similar, but with pml/ucx I get this warning:

[1679092339.660667] [d35n17:308900:0]  cuda_ipc_iface.c:135  UCX  ERROR nvmlInit_v2() failed: Driver/library version mismatch

What am I doing wrong here? Anything I should set to get latencies that are closer to those I get with host memory?

nysal commented 1 year ago

I think this is similar to an issue we reported here - https://github.com/openucx/ucx/issues/8761 You can try setting UCX_MEM_CUDA_HOOK_MODE=reloc as a workaround. The bistro cuda hook is missing some functionality for powerpc.

devreal commented 1 year ago

Thanks @nysal. I can confirm that setting UCX_MEM_CUDA_HOOK_MODE=reloc yields good latencies. I suggest keeping this issue open until https://github.com/openucx/ucx/issues/8761 has been resolved, in case others come across this.

open-mpi / ompi