openucx / ucx

Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
http://www.openucx.org
Other
1.15k stars 426 forks source link

MPICH + UCX 1.9 - MPI_Win_create fails with win_allgather Input/output error #7725

Open thomasgillis opened 2 years ago

thomasgillis commented 2 years ago

EDIT: I have an issue with the combination of UCX + MPICH but I am unsure if it's on the MPICH side or the UCX side. Thanks for your help!

Describe the bug

Depending on the TLS chosen dc_mlx5 or ud_mlx5 I get different runs, all failing with the same error:

mm_xpmem.c:149  UCX  ERROR failed to attach xpmem apid 0x4a000121fb offset 0x3aef000 length 4096: Cannot allocate memory
ucp_rkey.c:270  UCX  ERROR failed to unpack remote key from remote md[7]: Input/output error
pgtable.c:638  UCX  WARN  failed to remove pgtable region0x5b0160 [0x0..0x3715000]

and

MPICH ERROR [Rank 1152] [job id 316661.0] [Mon Nov 29 17:39:18 2021] [nid001077] - Abort(68848399) (rank 1152 in comm 0): Fatal error in PMPI_Win_create: Other MPI error, error stack:
PMPI_Win_create(294)......: MPI_Win_create(base=0x3abb880, size=0, disp_unit=2, info=0x9c000000, MPI_COMM_WORLD, win=0xee88a8) failed
MPID_Win_create(86).......:
MPIDIG_mpi_win_create(858):
win_allgather(114)........:  ucx function returned with failed status(ucx_win.c 114 win_allgather Input/output error)

I couldn't get all the needed runs to finish with the same setting.

Steps to Reproduce

Setup and versions

Additional information (depending on the issue)

yosefe commented 2 years ago

Maybe there is some issue with xpmem on this system. Can you try UCX_TLS=^xpmem as a workaround?

yosefe commented 2 years ago

In addition, the UCX version is pretty old. I would recommend trying the latest v1.11 release: https://github.com/openucx/ucx/releases/tag/v1.11.2 maybe there is some system security or limits config that prevents xpmem from attaching peer process memory?

thomasgillis commented 2 years ago

@yosefe thanks a lot for the very quick answer! I have just submitted with UCX_TLS=^xpmem.

I would love to update to the latest version but unfortunately I am a limited in what I can do/not do on the machine. I will request the update to the sys-admins but they are still testing it so I am unsure it's gonna happen anytime soon unfortunately (unless CRAY is reading this thread and maybe could consider updating their UCX?)

yosefe commented 2 years ago

In meantime, let's understand how UCX_TLS affects the error message: it's unexpected that when setting UCX_TLS=dc_mlx5 or UCX_TLS=ud_mlx5 we would still see errors from xpmem. Is this really the case?

dmitrygx commented 2 years ago

maybe there is some system security or limits config that prevents xpmem from attaching peer process memory?

@yosefe can it be a problem trying to do xpmem_attach() for 0-size region?

PMPI_Win_create(294)......: MPI_Win_create(base=0x3abb880, size=0, disp_unit=2, info=0x9c000000, MPI_COMM_WORLD, win=0xee88a8) failed

it seems that RCACHE reports that this is a 4096-length region (it aligns by system page size):

mm_xpmem.c:149  UCX  ERROR failed to attach xpmem apid 0x4a000121fb offset 0x3aef000 length 4096: Cannot allocate memory

If I'm not mistaken, we've already fixed PageTable granularity issue.

yosefe commented 2 years ago

@dmitrygx good catch! i guess we should just ignore 0-size regions and create some kind of dummy rkey for them

thomasgillis commented 2 years ago

In meantime, let's understand how UCX_TLS affects the error message: it's unexpected that when setting UCX_TLS=dc_mlx5 or UCX_TLS=ud_mlx5 we would still see errors from xpmem. Is this really the case?

Yes, it's what I get when using export UCX_TLS=dc_mlx5,sm,self.

EDIT: I have submitted using export UCX_TLS=^xpmem,dc_mlx5,sm,self, let's see Is the list exclusive? (a TLS not in the list cannot be used?) when setting export UCX_TLS=^xpmem only I got a nice error in the Init and the hook part.

thomasgillis commented 2 years ago

@dmitrygx good catch! i guess we should just ignore 0-size regions and create some kind of dummy rkey for them

As a hack I can allocate a bit of memory and not use is. What is the minimum memory I need to dedicate to the window so that it works? 4096 bits?

thomasgillis commented 2 years ago

@yosefe I get this weird error message with export UCX_TLS=^xpmem,dc_mlx5,sm,self:

UCX  WARN  transport '^xpmem' is not available, please use one or more of: cma, dc, dc_mlx5, dc_x, ib, mm, posix, rc, rc_mlx5, rc_v, rc_verbs, rc_x, self, shm, sm, sysv, tcp, ud, ud_mlx5, ud_v, ud_verbs, ud_x, xpmem

What did I do wrong?

dmitrygx commented 2 years ago

What did I do wrong?

It is not allowed to mix "^" with desired transports. pls, set UCX_TLS=dc,self,posix,sysv,cma - according to you output you have cma/posix/sysv - SHM transports which could be used instead of xpmem

dmitrygx commented 2 years ago

@dmitrygx good catch! i guess we should just ignore 0-size regions and create some kind of dummy rkey for them

As a hack I can allocate a bit of memory and not use is. What is the minimum memory I need to dedicate to the window so that it works? 4096 bits?

yes, 4096 bytes could be ok. but is there any chance to check UCX v1.11 as proposed by @yosefe?

thomasgillis commented 2 years ago

@dmitrygx good catch! i guess we should just ignore 0-size regions and create some kind of dummy rkey for them

As a hack I can allocate a bit of memory and not use is. What is the minimum memory I need to dedicate to the window so that it works? 4096 bits?

yes, 4096 bytes could be ok. but is there any chance to check UCX v1.11 as proposed by @yosefe?

I will try the 4096 thingy, thanks! I requested an update of the UCX, I will keep you posted

dmitrygx commented 2 years ago

@dmitrygx good catch! i guess we should just ignore 0-size regions and create some kind of dummy rkey for them

As a hack I can allocate a bit of memory and not use is. What is the minimum memory I need to dedicate to the window so that it works? 4096 bits?

yes, 4096 bytes could be ok. but is there any chance to check UCX v1.11 as proposed by @yosefe?

I will try the 4096 thingy, thanks! I requested an update of the UCX, I will keep you posted

@thomasgillis btw, it is interesting why you didn't get the error from UCT MD when trying to register 0-byte buffer - https://github.com/openucx/ucx/blob/6ee161f0e720551f9affafc9b05acddb9cd55355/src/uct/base/uct_md.c#L437 BTW, do you use MPICH/CH4 UCX netmod?

thomasgillis commented 2 years ago

yes, I think so (but the Cray modules are such a maze that I am not entirely sure):

MPICH Version:      3.4a2
MPICH Release date: unreleased development copy
MPICH Device:       ch4:ucx
MPICH configure:    --prefix=/workspace/install-cray-ucx --without-mpe --enable-fortran=all --enable-shared --enable-sharedlibs=gcc --enable-debuginfo --enable-yield=sched_yield --enable-mpit-pvars=nem,cray_rma_stat,cray_gni_stat,cray_coll_stat,cray_mpiio_stat,cray_misc_stat --enable-g=mem --with-device=ch4:ucx --with-ucx-include=/workspace/pebuildenv/opt/ucx/1.8.0/include --with-ucx-lib=/workspace/pebuildenv/opt/ucx/1.8.0/lib --with-namepublisher=file --with-shared-memory=sysv --with-pmiext=pmi_cray_ext.h --with-pmi=cray --with-weak-pmiext=cray --disable-allowport --with-pm=gforker --with-file-system=ufs+lustre+cray+gpfs+nfs --disable-cxx --enable-threads=runtime --disable-long-double --enable-fast=O2
MPICH CC:   /opt/gcc/9.1.0/bin/gcc -I/opt/cray/dvs/2.12_4.0.91-7.0.1.0_88.1__gb115a1b9/include -I/opt/cray/pe/gtl/0.0.1/include  -D_CRAY_UCX -D_CRAY_CH4 -DHAVE_LUSTRE_COMP_LAYOUT_SUPPORT   -O2
MPICH CXX:  no
MPICH F77:  ftn   -O2
MPICH FC:   ftn -em -Wl,--as-needed  -O2
MPICH Custom Information:
thomasgillis commented 2 years ago

@yosefe @dmitrygx I looks like the runs with UCX_TLS=dc,self,posix,sysv,cma could complete normally. So I presume that the xpmem is the cause of the issue (it's not open-source right?) For the rest, I will request the update of the packages asap but at least now I can run.

If something else shows up I will reopen the issue. Thanks a lot for your (very prompt) help, much appreciated!

yosefe commented 2 years ago

xpmem is actually an open source and we maintain a clone of it in http://github.com/openucx/xpmem even though we have a workaround i think better keep this issue open since we don't expect 0-size buffer to be passed to xpmem

shamisp commented 2 years ago

The version of XPMEM used by Cray is not exactly the same version as open source, which is based on some older revision of Cray version. @yosefe is the issue that XPMEM does not handle 0 size mapping ?

thomasgillis commented 2 years ago

The version of XPMEM used by Cray is not exactly the same version as open source, which is based on some older revision of Cray version. @yosefe is the issue that XPMEM does not handle 0 size mapping ?

If it helps, the Cray version is xpmem/2.2.40-7.0.1.0_3.1__g1d7a24d.shasta

shamisp commented 2 years ago

@thomasgillis It does not really help :( it is proprietary code, versioning, etc. We have no idea what it really maps to.

dmitrygx commented 2 years ago

@shamisp @thomasgillis I'd start checking if taking the latest UCX helps to fix @thomasgillis's issue. is it possible to download UCX by following the instruction (see below)?

1. git clone https://github.com/openucx/ucx.git
2. ./autogen.sh && ./contrib/configure-devel --enable-debug --with-xpmem=yes --prefix=$PWD/install/ && make clean && make -j install
3. run your application by adding `-genv LD_LIBRARY_PATH=<ucx_repo_path>/install/lib/:$LD_LIBRARY_PATH` to `mpiexec`

So, let's if we will find more information running the devel version of UCX. If it helps, then you could switch to UCX release mode by reconfiguring it using ./contrib/configure-release instead of ./contrib/configure-devel --enable-debug. Thanks!

thomasgillis commented 2 years ago

@dmitrygx okay will try. What is the best way to know what version of UCX is taken by MPICH? is there any verbose variable I can use?

ltalirz commented 2 years ago

For what it's worth, I encountered a very similar error

...
---- Real-time Memory Report at c_bands before calling an iterative solver
           872 MiB given to the printing process from OS
           704 MiB allocation reported by mallinfo(arena+hblkhd)
        215916 MiB available memory on the node where the printing process lives
------------------
[1655138239.652422] [ip-0A238060:22247:0]       mm_xpmem.c:143  UCX  ERROR failed to attach xpmem apid 0x27000056e7 offset 0x20fe8000 length 16384: Cannot allocate memory
[1655138239.652958] [ip-0A238060:22247:0]       ucp_rkey.c:267  UCX  ERROR failed to unpack remote key from remote md[7]: Input/output error

while running the Quantum ESPRESSO quantum mechanical simulation engine compiled with openmpi 4.1.0 and ucx version 1.10.0 revision 96422ce . The error occurs only for a very niche set of inputs (>95% of my calculations run fine), but for the affected inputs it seems the issue occurs somewhat reproducibly.

The backtrace is

==== backtrace (tid:  22247) ====
 0 0x0000000000053513 ucs_debug_print_backtrace()  /build-result/src/hpcx-v2.8.3-gcc-MLNX_OFED_LINUX-5.2-2.2.3.0-redhat7.6-x86_64/ucx-96422ce/src/ucs/debug/debug.c:656
 1 0x00000000000415b6 ucp_rndv_do_rkey_ptr()  /build-result/src/hpcx-v2.8.3-gcc-MLNX_OFED_LINUX-5.2-2.2.3.0-redhat7.6-x86_64/ucx-96422ce/src/ucp/rndv/rndv.c:1156
 2 0x0000000000041b89 ucp_rndv_receive()  /build-result/src/hpcx-v2.8.3-gcc-MLNX_OFED_LINUX-5.2-2.2.3.0-redhat7.6-x86_64/ucx-96422ce/src/ucp/rndv/rndv.c:1279
 3 0x000000000004e942 ucp_tag_rndv_process_rts()  /build-result/src/hpcx-v2.8.3-gcc-MLNX_OFED_LINUX-5.2-2.2.3.0-redhat7.6-x86_64/ucx-96422ce/src/ucp/tag/tag_rndv.c:45
 4 0x0000000000014685 uct_iface_invoke_am()  /build-result/src/hpcx-v2.8.3-gcc-MLNX_OFED_LINUX-5.2-2.2.3.0-redhat7.6-x86_64/ucx-96422ce/src/uct/base/uct_iface.h:663
 5 0x0000000000014685 uct_mm_iface_process_recv()  /build-result/src/hpcx-v2.8.3-gcc-MLNX_OFED_LINUX-5.2-2.2.3.0-redhat7.6-x86_64/ucx-96422ce/src/uct/sm/mm/base/mm_iface.c:233
 6 0x0000000000014685 uct_mm_iface_poll_fifo()  /build-result/src/hpcx-v2.8.3-gcc-MLNX_OFED_LINUX-5.2-2.2.3.0-redhat7.6-x86_64/ucx-96422ce/src/uct/sm/mm/base/mm_iface.c:282
 7 0x0000000000014685 uct_mm_iface_progress()  /build-result/src/hpcx-v2.8.3-gcc-MLNX_OFED_LINUX-5.2-2.2.3.0-redhat7.6-x86_64/ucx-96422ce/src/uct/sm/mm/base/mm_iface.c:335
 8 0x000000000002f1aa ucs_callbackq_dispatch()  /build-result/src/hpcx-v2.8.3-gcc-MLNX_OFED_LINUX-5.2-2.2.3.0-redhat7.6-x86_64/ucx-96422ce/src/ucs/datastruct/callbackq.h:211
 9 0x000000000002f1aa uct_worker_progress()  /build-result/src/hpcx-v2.8.3-gcc-MLNX_OFED_LINUX-5.2-2.2.3.0-redhat7.6-x86_64/ucx-96422ce/src/uct/api/uct.h:2435
10 0x000000000002f1aa ucp_worker_progress()  /build-result/src/hpcx-v2.8.3-gcc-MLNX_OFED_LINUX-5.2-2.2.3.0-redhat7.6-x86_64/ucx-96422ce/src/ucp/core/ucp_worker.c:2405
11 0x00000000000182f4 hmca_bcol_ucx_p2p_progress_fast()  bcol_ucx_p2p_component.c:0
12 0x000000000005d6e4 hmca_bcol_ucx_p2p_alltoall_pairwise_progress()  ???:0
13 0x0000000000059a18 hmca_bcol_ucx_p2p_alltoall_tuned_progress()  ???:0
14 0x000000000006a7e8 hmca_coll_ml_alltoall()  ???:0
15 0x0000000000008830 mca_coll_hcoll_alltoall()  /tmp/azhpc-images-centos-hpc-20210416/centos/centos-7.x/centos-7.6-hpc/openmpi-4.1.0/ompi/mca/coll/hcoll/coll_hcoll_ops.c:317
16 0x00000000000643d6 MPI_Alltoall()  ???:0
17 0x000000000004669d pmpi_alltoall_()  ???:0
18 0x000000000099dd3d __fft_scatter_2d_MOD_fft_scatter()  ???:0
19 0x0000000000946ecc __fft_parallel_2d_MOD_tg_cft3s()  ???:0
20 0x000000000094369e fwfft_y_()  ???:0
21 0x0000000000657501 vloc_psi_k_()  ???:0
22 0x00000000006199fd h_psi__()  ???:0
23 0x000000000061a0a5 h_psi_()  ???:0
24 0x000000000086a8a6 pcegterg_()  ???:0
25 0x000000000056d649 diag_bands_()  ???:0
26 0x000000000057177c c_bands_()  ???:0
27 0x000000000040f3f5 electrons_scf_()  ???:0
28 0x0000000000419494 electrons_()  ???:0
29 0x00000000004c5501 run_pwscf_()  ???:0
30 0x0000000000408bc9 MAIN__()  pwscf.f90:0
31 0x000000000040891d main()  ???:0
32 0x0000000000022495 __libc_start_main()  ???:0
33 0x0000000000408946 _start()  ???:0

I don't know enough about this topic to understand exactly what is going on here and whether this is the same issue as the one in the original post but I would appreciate any guidance on how to avoid this issue going forward.