Open thomasgillis opened 2 years ago
Maybe there is some issue with xpmem on this system.
Can you try UCX_TLS=^xpmem
as a workaround?
In addition, the UCX version is pretty old. I would recommend trying the latest v1.11 release: https://github.com/openucx/ucx/releases/tag/v1.11.2 maybe there is some system security or limits config that prevents xpmem from attaching peer process memory?
@yosefe thanks a lot for the very quick answer! I have just submitted with UCX_TLS=^xpmem
.
I would love to update to the latest version but unfortunately I am a limited in what I can do/not do on the machine. I will request the update to the sys-admins but they are still testing it so I am unsure it's gonna happen anytime soon unfortunately (unless CRAY is reading this thread and maybe could consider updating their UCX?)
In meantime, let's understand how UCX_TLS affects the error message: it's unexpected that when setting UCX_TLS=dc_mlx5
or UCX_TLS=ud_mlx5
we would still see errors from xpmem. Is this really the case?
maybe there is some system security or limits config that prevents xpmem from attaching peer process memory?
@yosefe can it be a problem trying to do xpmem_attach()
for 0-size region?
PMPI_Win_create(294)......: MPI_Win_create(base=0x3abb880, size=0, disp_unit=2, info=0x9c000000, MPI_COMM_WORLD, win=0xee88a8) failed
it seems that RCACHE reports that this is a 4096-length region (it aligns by system page size):
mm_xpmem.c:149 UCX ERROR failed to attach xpmem apid 0x4a000121fb offset 0x3aef000 length 4096: Cannot allocate memory
If I'm not mistaken, we've already fixed PageTable granularity issue.
@dmitrygx good catch! i guess we should just ignore 0-size regions and create some kind of dummy rkey for them
In meantime, let's understand how UCX_TLS affects the error message: it's unexpected that when setting
UCX_TLS=dc_mlx5
orUCX_TLS=ud_mlx5
we would still see errors from xpmem. Is this really the case?
Yes, it's what I get when using export UCX_TLS=dc_mlx5,sm,self
.
EDIT: I have submitted using export UCX_TLS=^xpmem,dc_mlx5,sm,self
, let's see
Is the list exclusive? (a TLS not in the list cannot be used?) when setting export UCX_TLS=^xpmem
only I got a nice error in the Init and the hook part.
@dmitrygx good catch! i guess we should just ignore 0-size regions and create some kind of dummy rkey for them
As a hack I can allocate a bit of memory and not use is. What is the minimum memory I need to dedicate to the window so that it works? 4096 bits?
@yosefe I get this weird error message with export UCX_TLS=^xpmem,dc_mlx5,sm,self
:
UCX WARN transport '^xpmem' is not available, please use one or more of: cma, dc, dc_mlx5, dc_x, ib, mm, posix, rc, rc_mlx5, rc_v, rc_verbs, rc_x, self, shm, sm, sysv, tcp, ud, ud_mlx5, ud_v, ud_verbs, ud_x, xpmem
What did I do wrong?
What did I do wrong?
It is not allowed to mix "^UCX_TLS=dc,self,posix,sysv,cma
- according to you output you have cma
/posix
/sysv
- SHM transports which could be used instead of xpmem
@dmitrygx good catch! i guess we should just ignore 0-size regions and create some kind of dummy rkey for them
As a hack I can allocate a bit of memory and not use is. What is the minimum memory I need to dedicate to the window so that it works? 4096 bits?
yes, 4096 bytes could be ok. but is there any chance to check UCX v1.11 as proposed by @yosefe?
@dmitrygx good catch! i guess we should just ignore 0-size regions and create some kind of dummy rkey for them
As a hack I can allocate a bit of memory and not use is. What is the minimum memory I need to dedicate to the window so that it works? 4096 bits?
yes, 4096 bytes could be ok. but is there any chance to check UCX v1.11 as proposed by @yosefe?
I will try the 4096 thingy, thanks! I requested an update of the UCX, I will keep you posted
@dmitrygx good catch! i guess we should just ignore 0-size regions and create some kind of dummy rkey for them
As a hack I can allocate a bit of memory and not use is. What is the minimum memory I need to dedicate to the window so that it works? 4096 bits?
yes, 4096 bytes could be ok. but is there any chance to check UCX v1.11 as proposed by @yosefe?
I will try the 4096 thingy, thanks! I requested an update of the UCX, I will keep you posted
@thomasgillis btw, it is interesting why you didn't get the error from UCT MD when trying to register 0-byte buffer - https://github.com/openucx/ucx/blob/6ee161f0e720551f9affafc9b05acddb9cd55355/src/uct/base/uct_md.c#L437 BTW, do you use MPICH/CH4 UCX netmod?
yes, I think so (but the Cray modules are such a maze that I am not entirely sure):
MPICH Version: 3.4a2
MPICH Release date: unreleased development copy
MPICH Device: ch4:ucx
MPICH configure: --prefix=/workspace/install-cray-ucx --without-mpe --enable-fortran=all --enable-shared --enable-sharedlibs=gcc --enable-debuginfo --enable-yield=sched_yield --enable-mpit-pvars=nem,cray_rma_stat,cray_gni_stat,cray_coll_stat,cray_mpiio_stat,cray_misc_stat --enable-g=mem --with-device=ch4:ucx --with-ucx-include=/workspace/pebuildenv/opt/ucx/1.8.0/include --with-ucx-lib=/workspace/pebuildenv/opt/ucx/1.8.0/lib --with-namepublisher=file --with-shared-memory=sysv --with-pmiext=pmi_cray_ext.h --with-pmi=cray --with-weak-pmiext=cray --disable-allowport --with-pm=gforker --with-file-system=ufs+lustre+cray+gpfs+nfs --disable-cxx --enable-threads=runtime --disable-long-double --enable-fast=O2
MPICH CC: /opt/gcc/9.1.0/bin/gcc -I/opt/cray/dvs/2.12_4.0.91-7.0.1.0_88.1__gb115a1b9/include -I/opt/cray/pe/gtl/0.0.1/include -D_CRAY_UCX -D_CRAY_CH4 -DHAVE_LUSTRE_COMP_LAYOUT_SUPPORT -O2
MPICH CXX: no
MPICH F77: ftn -O2
MPICH FC: ftn -em -Wl,--as-needed -O2
MPICH Custom Information:
@yosefe @dmitrygx
I looks like the runs with UCX_TLS=dc,self,posix,sysv,cma
could complete normally. So I presume that the xpmem
is the cause of the issue (it's not open-source right?)
For the rest, I will request the update of the packages asap but at least now I can run.
If something else shows up I will reopen the issue. Thanks a lot for your (very prompt) help, much appreciated!
xpmem is actually an open source and we maintain a clone of it in http://github.com/openucx/xpmem even though we have a workaround i think better keep this issue open since we don't expect 0-size buffer to be passed to xpmem
The version of XPMEM used by Cray is not exactly the same version as open source, which is based on some older revision of Cray version. @yosefe is the issue that XPMEM does not handle 0 size mapping ?
The version of XPMEM used by Cray is not exactly the same version as open source, which is based on some older revision of Cray version. @yosefe is the issue that XPMEM does not handle 0 size mapping ?
If it helps, the Cray version is xpmem/2.2.40-7.0.1.0_3.1__g1d7a24d.shasta
@thomasgillis It does not really help :( it is proprietary code, versioning, etc. We have no idea what it really maps to.
@shamisp @thomasgillis I'd start checking if taking the latest UCX helps to fix @thomasgillis's issue. is it possible to download UCX by following the instruction (see below)?
1. git clone https://github.com/openucx/ucx.git
2. ./autogen.sh && ./contrib/configure-devel --enable-debug --with-xpmem=yes --prefix=$PWD/install/ && make clean && make -j install
3. run your application by adding `-genv LD_LIBRARY_PATH=<ucx_repo_path>/install/lib/:$LD_LIBRARY_PATH` to `mpiexec`
So, let's if we will find more information running the devel version of UCX. If it helps, then you could switch to UCX release mode by reconfiguring it using ./contrib/configure-release
instead of ./contrib/configure-devel --enable-debug
.
Thanks!
@dmitrygx okay will try. What is the best way to know what version of UCX is taken by MPICH? is there any verbose variable I can use?
For what it's worth, I encountered a very similar error
...
---- Real-time Memory Report at c_bands before calling an iterative solver
872 MiB given to the printing process from OS
704 MiB allocation reported by mallinfo(arena+hblkhd)
215916 MiB available memory on the node where the printing process lives
------------------
[1655138239.652422] [ip-0A238060:22247:0] mm_xpmem.c:143 UCX ERROR failed to attach xpmem apid 0x27000056e7 offset 0x20fe8000 length 16384: Cannot allocate memory
[1655138239.652958] [ip-0A238060:22247:0] ucp_rkey.c:267 UCX ERROR failed to unpack remote key from remote md[7]: Input/output error
while running the Quantum ESPRESSO quantum mechanical simulation engine compiled with openmpi 4.1.0 and ucx version 1.10.0 revision 96422ce . The error occurs only for a very niche set of inputs (>95% of my calculations run fine), but for the affected inputs it seems the issue occurs somewhat reproducibly.
The backtrace is
==== backtrace (tid: 22247) ====
0 0x0000000000053513 ucs_debug_print_backtrace() /build-result/src/hpcx-v2.8.3-gcc-MLNX_OFED_LINUX-5.2-2.2.3.0-redhat7.6-x86_64/ucx-96422ce/src/ucs/debug/debug.c:656
1 0x00000000000415b6 ucp_rndv_do_rkey_ptr() /build-result/src/hpcx-v2.8.3-gcc-MLNX_OFED_LINUX-5.2-2.2.3.0-redhat7.6-x86_64/ucx-96422ce/src/ucp/rndv/rndv.c:1156
2 0x0000000000041b89 ucp_rndv_receive() /build-result/src/hpcx-v2.8.3-gcc-MLNX_OFED_LINUX-5.2-2.2.3.0-redhat7.6-x86_64/ucx-96422ce/src/ucp/rndv/rndv.c:1279
3 0x000000000004e942 ucp_tag_rndv_process_rts() /build-result/src/hpcx-v2.8.3-gcc-MLNX_OFED_LINUX-5.2-2.2.3.0-redhat7.6-x86_64/ucx-96422ce/src/ucp/tag/tag_rndv.c:45
4 0x0000000000014685 uct_iface_invoke_am() /build-result/src/hpcx-v2.8.3-gcc-MLNX_OFED_LINUX-5.2-2.2.3.0-redhat7.6-x86_64/ucx-96422ce/src/uct/base/uct_iface.h:663
5 0x0000000000014685 uct_mm_iface_process_recv() /build-result/src/hpcx-v2.8.3-gcc-MLNX_OFED_LINUX-5.2-2.2.3.0-redhat7.6-x86_64/ucx-96422ce/src/uct/sm/mm/base/mm_iface.c:233
6 0x0000000000014685 uct_mm_iface_poll_fifo() /build-result/src/hpcx-v2.8.3-gcc-MLNX_OFED_LINUX-5.2-2.2.3.0-redhat7.6-x86_64/ucx-96422ce/src/uct/sm/mm/base/mm_iface.c:282
7 0x0000000000014685 uct_mm_iface_progress() /build-result/src/hpcx-v2.8.3-gcc-MLNX_OFED_LINUX-5.2-2.2.3.0-redhat7.6-x86_64/ucx-96422ce/src/uct/sm/mm/base/mm_iface.c:335
8 0x000000000002f1aa ucs_callbackq_dispatch() /build-result/src/hpcx-v2.8.3-gcc-MLNX_OFED_LINUX-5.2-2.2.3.0-redhat7.6-x86_64/ucx-96422ce/src/ucs/datastruct/callbackq.h:211
9 0x000000000002f1aa uct_worker_progress() /build-result/src/hpcx-v2.8.3-gcc-MLNX_OFED_LINUX-5.2-2.2.3.0-redhat7.6-x86_64/ucx-96422ce/src/uct/api/uct.h:2435
10 0x000000000002f1aa ucp_worker_progress() /build-result/src/hpcx-v2.8.3-gcc-MLNX_OFED_LINUX-5.2-2.2.3.0-redhat7.6-x86_64/ucx-96422ce/src/ucp/core/ucp_worker.c:2405
11 0x00000000000182f4 hmca_bcol_ucx_p2p_progress_fast() bcol_ucx_p2p_component.c:0
12 0x000000000005d6e4 hmca_bcol_ucx_p2p_alltoall_pairwise_progress() ???:0
13 0x0000000000059a18 hmca_bcol_ucx_p2p_alltoall_tuned_progress() ???:0
14 0x000000000006a7e8 hmca_coll_ml_alltoall() ???:0
15 0x0000000000008830 mca_coll_hcoll_alltoall() /tmp/azhpc-images-centos-hpc-20210416/centos/centos-7.x/centos-7.6-hpc/openmpi-4.1.0/ompi/mca/coll/hcoll/coll_hcoll_ops.c:317
16 0x00000000000643d6 MPI_Alltoall() ???:0
17 0x000000000004669d pmpi_alltoall_() ???:0
18 0x000000000099dd3d __fft_scatter_2d_MOD_fft_scatter() ???:0
19 0x0000000000946ecc __fft_parallel_2d_MOD_tg_cft3s() ???:0
20 0x000000000094369e fwfft_y_() ???:0
21 0x0000000000657501 vloc_psi_k_() ???:0
22 0x00000000006199fd h_psi__() ???:0
23 0x000000000061a0a5 h_psi_() ???:0
24 0x000000000086a8a6 pcegterg_() ???:0
25 0x000000000056d649 diag_bands_() ???:0
26 0x000000000057177c c_bands_() ???:0
27 0x000000000040f3f5 electrons_scf_() ???:0
28 0x0000000000419494 electrons_() ???:0
29 0x00000000004c5501 run_pwscf_() ???:0
30 0x0000000000408bc9 MAIN__() pwscf.f90:0
31 0x000000000040891d main() ???:0
32 0x0000000000022495 __libc_start_main() ???:0
33 0x0000000000408946 _start() ???:0
I don't know enough about this topic to understand exactly what is going on here and whether this is the same issue as the one in the original post but I would appreciate any guidance on how to avoid this issue going forward.
EDIT: I have an issue with the combination of UCX + MPICH but I am unsure if it's on the MPICH side or the UCX side. Thanks for your help!
Describe the bug
Depending on the TLS chosen
dc_mlx5
orud_mlx5
I get different runs, all failing with the same error:and
I couldn't get all the needed runs to finish with the same setting.
Steps to Reproduce
version: UCT version=1.9.0 revision 1d0a420
configuration:
Setup and versions
OFED-internal-5.1-2.5.8.0.60:
Additional information (depending on the issue)
ucx_info -d