Open angainor opened 3 years ago
Running ghexbench with UCX_DC_MLX5_TM_ENABLE=y causes an error and a segfault. The same setting works with MPI backend when using OpenMPI on IB networks. Is it something about how we create the worker / UCX context?
ghexbench
UCX_DC_MLX5_TM_ENABLE=y
[1615309041.680074] [b2237:256544:0] rc_mlx5_common.c:827 UCX ERROR ibv_exp_create_srq(device=mlx5_0) failed: Cannot allocate memory ==== backtrace (tid: 110170) ==== 0 0x0000000000052e95 ucs_debug_print_backtrace() /build-result/src/hpcx-v2.7.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.7-x86_64/ucx-v1.9.x/src/ucs/debug/debug.c:656 1 0x000000000003e54c ucp_address_pack() /build-result/src/hpcx-v2.7.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.7-x86_64/ucx-v1.9.x/src/ucp/wireup/address.c:832 2 0x000000000003e54c ucp_address_pack() /build-result/src/hpcx-v2.7.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.7-x86_64/ucx-v1.9.x/src/ucp/wireup/address.c:844 3 0x00000000000246bd ucp_worker_get_address() /build-result/src/hpcx-v2.7.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.7-x86_64/ucx-v1.9.x/src/ucp/core/ucp_worker.c:2241 4 0x00000000004327a8 gridtools::ghex::tl::ucx::worker_t::worker_t() ???:0 5 0x000000000042b646 cartex::runtime::impl::init() ???:0 6 0x000000000041da99 cartex::runtime::exchange() ???:0 7 0x000000000040afe5 main() ???:0 8 0x0000000000022545 __libc_start_main() ???:0 9 0x000000000040ca8d _start() ???:0 =================================
Running
ghexbench
withUCX_DC_MLX5_TM_ENABLE=y
causes an error and a segfault. The same setting works with MPI backend when using OpenMPI on IB networks. Is it something about how we create the worker / UCX context?