Closed atb299 closed 2 years ago
Hi. I had a similar experience recently, and this GitHub issue emerged after a quick websearch. An engineer at HP helped me resolve my problem by suggesting that I set the following environment variable:
export UCX_IB_REG_METHODS=direct
Maybe this will be helpful to you, if you haven't already resolved your issue.
From the manpages:
UCX_IB_REG_METHODS This is a UCX ENV variable. It specifies which memory registration method is used by UCX. By default, the rcache users-pace memory registration cache method is used to provide the best performance. However, in certain cases at high scale, when a large amount of memory is being registered with the device, the rcache method may run out of resources. In this case, an error similar to "UCX ERROR ibv_exp_reg_mr(address=0xnn, length=nn, access=0xf) failed: Cannot allocate memory" may occur. To work around this limitation, it may be necessary to request the direct memory registration method by setting this variable: export UCX_IB_REG_METHODS=direct.
Thanks Nick, the same solution has been suggested to me via the Archer2 helpdesk. I'm still in the process of testing it, but queueing time is currently measured in days!
I think there is a problem with the XIOS build. I keep encountering these errors after a couple of hundred timesteps. NEMO looks to be working OK, so I think it is either the communication between NEMO and XIOS or XIOS itself which is at fault:
[1613044995.347146] [nid001310:82375:0] ib_md.c:325 UCX ERROR ibv_reg_mr(address=0xadba4680, length=26400, access=0xf) failed: Cannot allocate memory [1613044995.347220] [nid001310:82375:0] ucp_mm.c:137 UCX ERROR failed to register address 0xadba4680 mem_type bit 0x1 length 26400 on md[5]=mlx5_1: Input/output error (md reg_mem_types 0x15) [1613044995.347226] [nid001310:82375:0] ucp_request.c:269 UCX ERROR failed to register user buffer datatype 0x8 address 0xadba4680 len 26400: Input/output error MPICH ERROR [Rank 859] [job id 101004.0] [Thu Feb 11 12:03:15 2021] [unknown] [nid001310] - Abort(404362511) (rank 859 in comm 0): Fatal error in PMPI_Isend: Other MPI error, error stack: PMPI_Isend(160)......: MPI_Isend(buf=0xadba4680, count=3300, MPI_DOUBLE_PRECISION, dest=942, tag=4, comm=0x84000004, request=0x7ffdff97dd7c) failed MPID_Isend(416)......: MPID_isend_unsafe(92): MPIDI_UCX_send(95)...: returned failed request in UCX netmod(ucx_send.h 95 MPIDI_UCX_send Input/output error)
aborting job: Fatal error in PMPI_Isend: Other MPI error, error stack: PMPI_Isend(160)......: MPI_Isend(buf=0xadba4680, count=3300, MPI_DOUBLE_PRECISION, dest=942, tag=4, comm=0x84000004, request=0x7ffdff97dd7c) failed MPID_Isend(416)......: MPID_isend_unsafe(92): MPIDI_UCX_send(95)...: returned failed request in UCX netmod(ucx_send.h 95 MPIDI_UCX_send Input/output error) [1613044995.365879] [nid001310:82375:0] mm_xpmem.c:82 UCX WARN remote segment id 200014181 apid 10000141c7 is not released, refcount 1 [1613044995.365889] [nid001310:82375:0] mm_xpmem.c:82 UCX WARN remote segment id 200014180 apid f000141c7 is not released, refcount 1 [1613044995.365891] [nid001310:82375:0] mm_xpmem.c:82 UCX WARN remote segment id 200014183 apid 12000141c7 is not released, refcount 1