jdha / ORCHESTRA

Southern Ocean 1/12 NEMO configuration
GNU General Public License v3.0
1 stars 1 forks source link

UCX ERROR #10

Closed atb299 closed 2 years ago

atb299 commented 3 years ago

I think there is a problem with the XIOS build. I keep encountering these errors after a couple of hundred timesteps. NEMO looks to be working OK, so I think it is either the communication between NEMO and XIOS or XIOS itself which is at fault:

[1613044995.347146] [nid001310:82375:0] ib_md.c:325 UCX ERROR ibv_reg_mr(address=0xadba4680, length=26400, access=0xf) failed: Cannot allocate memory [1613044995.347220] [nid001310:82375:0] ucp_mm.c:137 UCX ERROR failed to register address 0xadba4680 mem_type bit 0x1 length 26400 on md[5]=mlx5_1: Input/output error (md reg_mem_types 0x15) [1613044995.347226] [nid001310:82375:0] ucp_request.c:269 UCX ERROR failed to register user buffer datatype 0x8 address 0xadba4680 len 26400: Input/output error MPICH ERROR [Rank 859] [job id 101004.0] [Thu Feb 11 12:03:15 2021] [unknown] [nid001310] - Abort(404362511) (rank 859 in comm 0): Fatal error in PMPI_Isend: Other MPI error, error stack: PMPI_Isend(160)......: MPI_Isend(buf=0xadba4680, count=3300, MPI_DOUBLE_PRECISION, dest=942, tag=4, comm=0x84000004, request=0x7ffdff97dd7c) failed MPID_Isend(416)......: MPID_isend_unsafe(92): MPIDI_UCX_send(95)...: returned failed request in UCX netmod(ucx_send.h 95 MPIDI_UCX_send Input/output error)

aborting job: Fatal error in PMPI_Isend: Other MPI error, error stack: PMPI_Isend(160)......: MPI_Isend(buf=0xadba4680, count=3300, MPI_DOUBLE_PRECISION, dest=942, tag=4, comm=0x84000004, request=0x7ffdff97dd7c) failed MPID_Isend(416)......: MPID_isend_unsafe(92): MPIDI_UCX_send(95)...: returned failed request in UCX netmod(ucx_send.h 95 MPIDI_UCX_send Input/output error) [1613044995.365879] [nid001310:82375:0] mm_xpmem.c:82 UCX WARN remote segment id 200014181 apid 10000141c7 is not released, refcount 1 [1613044995.365889] [nid001310:82375:0] mm_xpmem.c:82 UCX WARN remote segment id 200014180 apid f000141c7 is not released, refcount 1 [1613044995.365891] [nid001310:82375:0] mm_xpmem.c:82 UCX WARN remote segment id 200014183 apid 12000141c7 is not released, refcount 1

nenb commented 3 years ago

Hi. I had a similar experience recently, and this GitHub issue emerged after a quick websearch. An engineer at HP helped me resolve my problem by suggesting that I set the following environment variable:

 export UCX_IB_REG_METHODS=direct

Maybe this will be helpful to you, if you haven't already resolved your issue.

nenb commented 3 years ago

From the manpages:

UCX_IB_REG_METHODS This is a UCX ENV variable. It specifies which memory registration method is used by UCX. By default, the rcache users-pace memory registration cache method is used to provide the best performance. However, in certain cases at high scale, when a large amount of memory is being registered with the device, the rcache method may run out of resources. In this case, an error similar to "UCX ERROR ibv_exp_reg_mr(address=0xnn, length=nn, access=0xf) failed: Cannot allocate memory" may occur. To work around this limitation, it may be necessary to request the direct memory registration method by setting this variable: export UCX_IB_REG_METHODS=direct.

atb299 commented 3 years ago

Thanks Nick, the same solution has been suggested to me via the Archer2 helpdesk. I'm still in the process of testing it, but queueing time is currently measured in days!