ghex-org / GHEX

Generic exascale-ready library for halo-exchange operations on variety of grids/meshes
Other
8 stars 14 forks source link

hardware tag matching fails with UCX backend #116

Open angainor opened 3 years ago

angainor commented 3 years ago

Running ghexbench with UCX_DC_MLX5_TM_ENABLE=y causes an error and a segfault. The same setting works with MPI backend when using OpenMPI on IB networks. Is it something about how we create the worker / UCX context?

[1615309041.680074] [b2237:256544:0] rc_mlx5_common.c:827  UCX  ERROR ibv_exp_create_srq(device=mlx5_0) failed: Cannot allocate memory

==== backtrace (tid: 110170) ====
 0 0x0000000000052e95 ucs_debug_print_backtrace()  /build-result/src/hpcx-v2.7.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.7-x86_64/ucx-v1.9.x/src/ucs/debug/debug.c:656
 1 0x000000000003e54c ucp_address_pack()  /build-result/src/hpcx-v2.7.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.7-x86_64/ucx-v1.9.x/src/ucp/wireup/address.c:832
 2 0x000000000003e54c ucp_address_pack()  /build-result/src/hpcx-v2.7.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.7-x86_64/ucx-v1.9.x/src/ucp/wireup/address.c:844
 3 0x00000000000246bd ucp_worker_get_address()  /build-result/src/hpcx-v2.7.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.7-x86_64/ucx-v1.9.x/src/ucp/core/ucp_worker.c:2241
 4 0x00000000004327a8 gridtools::ghex::tl::ucx::worker_t::worker_t()  ???:0
 5 0x000000000042b646 cartex::runtime::impl::init()  ???:0
 6 0x000000000041da99 cartex::runtime::exchange()  ???:0
 7 0x000000000040afe5 main()  ???:0
 8 0x0000000000022545 __libc_start_main()  ???:0
 9 0x000000000040ca8d _start()  ???:0
=================================