openucx / ucx

Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
http://www.openucx.org
Other
1.12k stars 421 forks source link

Caught signal 11 (Segmentation fault: address not mapped to object at address (nil)) : UCX 1.9.0 or higher #7283

Open KOBAYASHI-Hiro opened 3 years ago

KOBAYASHI-Hiro commented 3 years ago

Describe the bug

「Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))」occurs in 20-30% of jobs when running mpirun on UCX 1.9.0 or higher. There are no issues with UCX 1.7.0.

==== backtrace (tid: 100539) ==== 0 /hoge/apps/rhel79/ucx/1.9.0-cuda10.2/lib/libucs.so.0(ucs_handle_error+0xe4) [0x2b19bd0b4444] 1 /hoge/apps/rhel79/ucx/1.9.0-cuda10.2/lib/libucs.so.0(+0x2476c) [0x2b19bd0b476c] 2 /hoge/apps/rhel79/ucx/1.9.0-cuda10.2/lib/libucs.so.0(+0x249db) [0x2b19bd0b49db] 3 /usr/lib64/libpthread.so.0(+0xf630) [0x2b19bb75a630] 4 /usr/lib64/libstdc++.so.6(_ZSt14convert_to_vIdEvPKcRT_RSt12_Ios_IostateRKP15__locale_struct+0x30) [0x2b19bafa7970] 5 /usr/lib64/libstdc++.so.6(_ZNKSt7num_getIcSt19istreambuf_iteratorIcSt11char_traitsIcEEE6do_getES3_S3_RSt8ios_baseRSt12_Ios_IostateRd+0xf1) [0x2b19bafbc4b1] 6 /usr/lib64/libstdc++.so.6(_ZNSi10_MextractIdEERSiRT+0x75) [0x2b19bafac285] 7 ./hoge-cuda-mpi() [0x42b2b0] 8 ./hoge-cuda-mpi() [0x41c9fa] 9 ./hoge-cuda-mpi() [0x40600c] 10 /usr/lib64/libc.so.6(libc_start_main+0xf5) [0x2b19bb989555] 11 ./hoge-cuda-mpi() [0x409a08]

If you execute the'ucx_info -d' command on the node after the symptom occurs, gdr_copy, cuda_cpy, and cuda_ipc disappear from the Memory domain.

UCX LOG [node36:112909:0] ib_md.c:379 UCX DEBUG ibv_reg_mr(address=0x2ab6ae000000, length=268435456, access=0xf) failed: Input/output error [node36:112909:0] rcache.c:873 UCX DEBUG failed to register region 0x6eef640 [0x2ab6ae000000..0x2ab6be000000]: Input/output error [node36:112909:0] ucp_mm.c:149 UCX DIAG failed to register address 0x2ab6ae000000 mem_type bit 0x2 length 268435456 on md[7]=mlx5_0 : Input/output error (md reg_mem_types 0x3)

Steps to Reproduce

mpirun -np 4 -npernode 4 ./hoge-cuda-mpi UCX: 1.9.0 or 1.10 .1 or 1.11

Setup and versions

CentOS Linux release 7.9.2009 (Core) Kernel:3.10.0-1160.15.2.el7.x86_64 OFED:5.1-2.5.8 rdma-core-51mlnx1-1.51258.x86_64 libibverbs-51mlnx1-1.51258.x86_64 OpenMPI 4.0.3 GCC 8.3.1 GPU: Tesla V100 32GB PCIe CUDA: 11.2.1 or 10.2 CUDA Driver: 460.32.03 or 465.19.01 nvidia-peer-memory: 1.1-0

$ ibstat CA 'mlx5_0' CA type: MT4123 Number of ports: 1 Firmware version: 20.29.1016 Hardware version: 0 Node GUID: 0x98039b030096952c System image GUID: 0x98039b030096952c Port 1: State: Active Physical state: LinkUp Rate: 100 Base lid: 314 LMC: 0 SM lid: 14 Capability mask: 0x2651e848 Port GUID: 0x98039b030096952c Link layer: InfiniBand CA 'mlx5_1' CA type: MT4123 Number of ports: 1 Firmware version: 20.29.1016 Hardware version: 0 Node GUID: 0x98039b0300968f14 System image GUID: 0x98039b0300968f14 Port 1: State: Active Physical state: LinkUp Rate: 100 Base lid: 309 LMC: 0 SM lid: 14 Capability mask: 0x2651e848 Port GUID: 0x98039b0300968f14 Link layer: InfiniBand CA 'mlx5_2' CA type: MT4123 Number of ports: 1 Firmware version: 20.29.1016 Hardware version: 0 Node GUID: 0x98039b0300968fa0 System image GUID: 0x98039b0300968fa0 Port 1: State: Active Physical state: LinkUp Rate: 100 Base lid: 333 LMC: 0 SM lid: 14 Capability mask: 0x2651e848 Port GUID: 0x98039b0300968fa0 Link layer: InfiniBand CA 'mlx5_3' CA type: MT4123 Number of ports: 1 Firmware version: 20.29.1016 Hardware version: 0 Node GUID: 0x98039b0300968f24 System image GUID: 0x98039b0300968f24 Port 1: State: Active Physical state: LinkUp Rate: 100 Base lid: 316 LMC: 0 SM lid: 14 Capability mask: 0x2651e848 Port GUID: 0x98039b0300968f24 Link layer: InfiniBand

$ ibv_devinfo -vv hca_id: mlx5_0 transport: InfiniBand (0) fw_ver: 20.29.1016 node_guid: 9803:9b03:0096:952c sys_image_guid: 9803:9b03:0096:952c vendor_id: 0x02c9 vendor_part_id: 4123 hw_ver: 0x0 board_id: MT_0000000222 phys_port_cnt: 1 max_mr_size: 0xffffffffffffffff page_size_cap: 0xfffffffffffff000 max_qp: 262144 max_qp_wr: 32768 device_cap_flags: 0xe97e1c36 BAD_PKEY_CNTR BAD_QKEY_CNTR AUTO_PATH_MIG CHANGE_PHY_PORT PORT_ACTIVE_EVENT SYS_IMAGE_GUID RC_RNR_NAK_GEN MEM_WINDOW UD_IP_CSUM XRC MEM_MGT_EXTENSIONS MEM_WINDOW_TYPE_2B MANAGED_FLOW_STEERING Unknown flags: 0xC8480000 max_sge: 30 max_sge_rd: 30 max_cq: 16777216 max_cqe: 4194303 max_mr: 16777216 max_pd: 16777216 max_qp_rd_atom: 16 max_ee_rd_atom: 0 max_res_rd_atom: 4194304 max_qp_init_rd_atom: 16 max_ee_init_rd_atom: 0 atomic_cap: ATOMIC_HCA (1) max_ee: 0 max_rdd: 0 max_mw: 16777216 max_raw_ipv6_qp: 0 max_raw_ethy_qp: 0 max_mcast_grp: 2097152 max_mcast_qp_attach: 240 max_total_mcast_qp_attach: 503316480 max_ah: 2147483647 max_fmr: 0 max_srq: 8388608 max_srq_wr: 32767 max_srq_sge: 31 max_pkeys: 128 local_ca_ack_delay: 16 general_odp_caps: ODP_SUPPORT ODP_SUPPORT_IMPLICIT rc_odp_caps: SUPPORT_SEND SUPPORT_RECV SUPPORT_WRITE SUPPORT_READ SUPPORT_SRQ uc_odp_caps: NO SUPPORT ud_odp_caps: SUPPORT_SEND xrc_odp_caps: SUPPORT_SEND SUPPORT_WRITE SUPPORT_READ SUPPORT_SRQ completion timestamp_mask: 0x7fffffffffffffff hca_core_clock: 156250kHZ device_cap_flags_ex: 0x30000051E97E1C36 PCI_WRITE_END_PADDING Unknown flags: 0x3000004100000000 tso_caps: max_tso: 0 rss_caps: max_rwq_indirection_tables: 0 max_rwq_indirection_table_size: 0 rx_hash_function: 0x0 rx_hash_fields_mask: 0x0 max_wq_type_rq: 0 packet_pacing_caps: qp_rate_limit_min: 0kbps qp_rate_limit_max: 0kbps max_rndv_hdr_size: 64 max_num_tags: 127 max_ops: 32768 max_sge: 1 flags: IBV_TM_CAP_RC

cq moderation caps:
    max_cq_count:   65535
    max_cq_period:  4095 us

maximum available device memory:    262144Bytes

    port:   1
        state:          PORT_ACTIVE (4)
        max_mtu:        4096 (5)
        active_mtu:     4096 (5)
        sm_lid:         14
        port_lid:       314
        port_lmc:       0x00
        link_layer:     InfiniBand
        max_msg_sz:     0x40000000
        port_cap_flags:     0x2251e848
        port_cap_flags2:    0x0032
        max_vl_num:     4 (3)
        bad_pkey_cntr:      0x0
        qkey_viol_cntr:     0x0
        sm_sl:          0
        pkey_tbl_len:       128
        gid_tbl_len:        8
        subnet_timeout:     18
        init_type_reply:    0
        active_width:       2X (16)
        active_speed:       50.0 Gbps (64)
        phys_state:     LINK_UP (5)
        GID[  0]:       fe80:0000:0000:0000:9803:9b03:0096:952c

hca_id: mlx5_1 transport: InfiniBand (0) fw_ver: 20.29.1016 node_guid: 9803:9b03:0096:8f14 sys_image_guid: 9803:9b03:0096:8f14 vendor_id: 0x02c9 vendor_part_id: 4123 hw_ver: 0x0 board_id: MT_0000000222 phys_port_cnt: 1 max_mr_size: 0xffffffffffffffff page_size_cap: 0xfffffffffffff000 max_qp: 262144 max_qp_wr: 32768 device_cap_flags: 0xe97e1c36 BAD_PKEY_CNTR BAD_QKEY_CNTR AUTO_PATH_MIG CHANGE_PHY_PORT PORT_ACTIVE_EVENT SYS_IMAGE_GUID RC_RNR_NAK_GEN MEM_WINDOW UD_IP_CSUM XRC MEM_MGT_EXTENSIONS MEM_WINDOW_TYPE_2B MANAGED_FLOW_STEERING Unknown flags: 0xC8480000 max_sge: 30 max_sge_rd: 30 max_cq: 16777216 max_cqe: 4194303 max_mr: 16777216 max_pd: 16777216 max_qp_rd_atom: 16 max_ee_rd_atom: 0 max_res_rd_atom: 4194304 max_qp_init_rd_atom: 16 max_ee_init_rd_atom: 0 atomic_cap: ATOMIC_HCA (1) max_ee: 0 max_rdd: 0 max_mw: 16777216 max_raw_ipv6_qp: 0 max_raw_ethy_qp: 0 max_mcast_grp: 2097152 max_mcast_qp_attach: 240 max_total_mcast_qp_attach: 503316480 max_ah: 2147483647 max_fmr: 0 max_srq: 8388608 max_srq_wr: 32767 max_srq_sge: 31 max_pkeys: 128 local_ca_ack_delay: 16 general_odp_caps: ODP_SUPPORT ODP_SUPPORT_IMPLICIT rc_odp_caps: SUPPORT_SEND SUPPORT_RECV SUPPORT_WRITE SUPPORT_READ SUPPORT_SRQ uc_odp_caps: NO SUPPORT ud_odp_caps: SUPPORT_SEND xrc_odp_caps: SUPPORT_SEND SUPPORT_WRITE SUPPORT_READ SUPPORT_SRQ completion timestamp_mask: 0x7fffffffffffffff hca_core_clock: 156250kHZ device_cap_flags_ex: 0x30000051E97E1C36 PCI_WRITE_END_PADDING Unknown flags: 0x3000004100000000 tso_caps: max_tso: 0 rss_caps: max_rwq_indirection_tables: 0 max_rwq_indirection_table_size: 0 rx_hash_function: 0x0 rx_hash_fields_mask: 0x0 max_wq_type_rq: 0 packet_pacing_caps: qp_rate_limit_min: 0kbps qp_rate_limit_max: 0kbps max_rndv_hdr_size: 64 max_num_tags: 127 max_ops: 32768 max_sge: 1 flags: IBV_TM_CAP_RC

cq moderation caps:
    max_cq_count:   65535
    max_cq_period:  4095 us

maximum available device memory:    262144Bytes

    port:   1
        state:          PORT_ACTIVE (4)
        max_mtu:        4096 (5)
        active_mtu:     4096 (5)
        sm_lid:         14
        port_lid:       309
        port_lmc:       0x00
        link_layer:     InfiniBand
        max_msg_sz:     0x40000000
        port_cap_flags:     0x2251e848
        port_cap_flags2:    0x0032
        max_vl_num:     4 (3)
        bad_pkey_cntr:      0x0
        qkey_viol_cntr:     0x0
        sm_sl:          0
        pkey_tbl_len:       128
        gid_tbl_len:        8
        subnet_timeout:     18
        init_type_reply:    0
        active_width:       2X (16)
        active_speed:       50.0 Gbps (64)
        phys_state:     LINK_UP (5)
        GID[  0]:       fe80:0000:0000:0000:9803:9b03:0096:8f14

hca_id: mlx5_2 transport: InfiniBand (0) fw_ver: 20.29.1016 node_guid: 9803:9b03:0096:8fa0 sys_image_guid: 9803:9b03:0096:8fa0 vendor_id: 0x02c9 vendor_part_id: 4123 hw_ver: 0x0 board_id: MT_0000000222 phys_port_cnt: 1 max_mr_size: 0xffffffffffffffff page_size_cap: 0xfffffffffffff000 max_qp: 262144 max_qp_wr: 32768 device_cap_flags: 0xe97e1c36 BAD_PKEY_CNTR BAD_QKEY_CNTR AUTO_PATH_MIG CHANGE_PHY_PORT PORT_ACTIVE_EVENT SYS_IMAGE_GUID RC_RNR_NAK_GEN MEM_WINDOW UD_IP_CSUM XRC MEM_MGT_EXTENSIONS MEM_WINDOW_TYPE_2B MANAGED_FLOW_STEERING Unknown flags: 0xC8480000 max_sge: 30 max_sge_rd: 30 max_cq: 16777216 max_cqe: 4194303 max_mr: 16777216 max_pd: 16777216 max_qp_rd_atom: 16 max_ee_rd_atom: 0 max_res_rd_atom: 4194304 max_qp_init_rd_atom: 16 max_ee_init_rd_atom: 0 atomic_cap: ATOMIC_HCA (1) max_ee: 0 max_rdd: 0 max_mw: 16777216 max_raw_ipv6_qp: 0 max_raw_ethy_qp: 0 max_mcast_grp: 2097152 max_mcast_qp_attach: 240 max_total_mcast_qp_attach: 503316480 max_ah: 2147483647 max_fmr: 0 max_srq: 8388608 max_srq_wr: 32767 max_srq_sge: 31 max_pkeys: 128 local_ca_ack_delay: 16 general_odp_caps: ODP_SUPPORT ODP_SUPPORT_IMPLICIT rc_odp_caps: SUPPORT_SEND SUPPORT_RECV SUPPORT_WRITE SUPPORT_READ SUPPORT_SRQ uc_odp_caps: NO SUPPORT ud_odp_caps: SUPPORT_SEND xrc_odp_caps: SUPPORT_SEND SUPPORT_WRITE SUPPORT_READ SUPPORT_SRQ completion timestamp_mask: 0x7fffffffffffffff hca_core_clock: 156250kHZ device_cap_flags_ex: 0x30000051E97E1C36 PCI_WRITE_END_PADDING Unknown flags: 0x3000004100000000 tso_caps: max_tso: 0 rss_caps: max_rwq_indirection_tables: 0 max_rwq_indirection_table_size: 0 rx_hash_function: 0x0 rx_hash_fields_mask: 0x0 max_wq_type_rq: 0 packet_pacing_caps: qp_rate_limit_min: 0kbps qp_rate_limit_max: 0kbps max_rndv_hdr_size: 64 max_num_tags: 127 max_ops: 32768 max_sge: 1 flags: IBV_TM_CAP_RC

cq moderation caps:
    max_cq_count:   65535
    max_cq_period:  4095 us

maximum available device memory:    262144Bytes

    port:   1
        state:          PORT_ACTIVE (4)
        max_mtu:        4096 (5)
        active_mtu:     4096 (5)
        sm_lid:         14
        port_lid:       333
        port_lmc:       0x00
        link_layer:     InfiniBand
        max_msg_sz:     0x40000000
        port_cap_flags:     0x2251e848
        port_cap_flags2:    0x0032
        max_vl_num:     4 (3)
        bad_pkey_cntr:      0x0
        qkey_viol_cntr:     0x0
        sm_sl:          0
        pkey_tbl_len:       128
        gid_tbl_len:        8
        subnet_timeout:     18
        init_type_reply:    0
        active_width:       2X (16)
        active_speed:       50.0 Gbps (64)
        phys_state:     LINK_UP (5)
            GID[  0]:       fe80:0000:0000:0000:9803:9b03:0096:8fa0

hca_id: mlx5_3 transport: InfiniBand (0) fw_ver: 20.29.1016 node_guid: 9803:9b03:0096:8f24 sys_image_guid: 9803:9b03:0096:8f24 vendor_id: 0x02c9 vendor_part_id: 4123 hw_ver: 0x0 board_id: MT_0000000222 phys_port_cnt: 1 max_mr_size: 0xffffffffffffffff page_size_cap: 0xfffffffffffff000 max_qp: 262144 max_qp_wr: 32768 device_cap_flags: 0xe97e1c36 BAD_PKEY_CNTR BAD_QKEY_CNTR AUTO_PATH_MIG CHANGE_PHY_PORT PORT_ACTIVE_EVENT SYS_IMAGE_GUID RC_RNR_NAK_GEN MEM_WINDOW UD_IP_CSUM XRC MEM_MGT_EXTENSIONS MEM_WINDOW_TYPE_2B MANAGED_FLOW_STEERING Unknown flags: 0xC8480000 max_sge: 30 max_sge_rd: 30 max_cq: 16777216 max_cqe: 4194303 max_mr: 16777216 max_pd: 16777216 max_qp_rd_atom: 16 max_ee_rd_atom: 0 max_res_rd_atom: 4194304 max_qp_init_rd_atom: 16 max_ee_init_rd_atom: 0 atomic_cap: ATOMIC_HCA (1) max_ee: 0 max_rdd: 0 max_mw: 16777216 max_raw_ipv6_qp: 0 max_raw_ethy_qp: 0 max_mcast_grp: 2097152 max_mcast_qp_attach: 240 max_total_mcast_qp_attach: 503316480 max_ah: 2147483647 max_fmr: 0 max_srq: 8388608 max_srq_wr: 32767 max_srq_sge: 31 max_pkeys: 128 local_ca_ack_delay: 16 general_odp_caps: ODP_SUPPORT ODP_SUPPORT_IMPLICIT rc_odp_caps: SUPPORT_SEND SUPPORT_RECV SUPPORT_WRITE SUPPORT_READ SUPPORT_SRQ uc_odp_caps: NO SUPPORT ud_odp_caps: SUPPORT_SEND xrc_odp_caps: SUPPORT_SEND SUPPORT_WRITE SUPPORT_READ SUPPORT_SRQ completion timestamp_mask: 0x7fffffffffffffff hca_core_clock: 156250kHZ device_cap_flags_ex: 0x30000051E97E1C36 PCI_WRITE_END_PADDING Unknown flags: 0x3000004100000000 tso_caps: max_tso: 0 rss_caps: max_rwq_indirection_tables: 0 max_rwq_indirection_table_size: 0 rx_hash_function: 0x0 rx_hash_fields_mask: 0x0 max_wq_type_rq: 0 packet_pacing_caps: qp_rate_limit_min: 0kbps qp_rate_limit_max: 0kbps max_rndv_hdr_size: 64 max_num_tags: 127 max_ops: 32768 max_sge: 1 flags: IBV_TM_CAP_RC

cq moderation caps:
    max_cq_count:   65535
    max_cq_period:  4095 us

maximum available device memory:    262144Bytes

    port:   1
        state:          PORT_ACTIVE (4)
        max_mtu:        4096 (5)
        active_mtu:     4096 (5)
        sm_lid:         14
        port_lid:       316
        port_lmc:       0x00
        link_layer:     InfiniBand
        max_msg_sz:     0x40000000
        port_cap_flags:     0x2251e848
        port_cap_flags2:    0x0032
        max_vl_num:     4 (3)
        bad_pkey_cntr:      0x0
        qkey_viol_cntr:     0x0
        sm_sl:          0
        pkey_tbl_len:       128
        gid_tbl_len:        8
        subnet_timeout:     18
        init_type_reply:    0
        active_width:       2X (16)
        active_speed:       50.0 Gbps (64)
        phys_state:     LINK_UP (5)
        GID[  0]:       fe80:0000:0000:0000:9803:9b03:0096:8f24 
yosefe commented 3 years ago

@KOBAYASHI-Hiro

  1. is the program using GPU memory?
  2. segfault seem to be coming from _ZSt14__convert_to_vIdEvPKcRT_RSt12_Ios_IostateRKP15__locale_struct which is not UCX code
  3. does the problem happen with UCX v1.11.1?
thangckt commented 10 months ago

I face the signal 11 error when using UCX 1.15 + OpenMPI 4.1.x + Python 3.11

==== backtrace (tid:  22978) ====
 0 0x00000000001ed0ae _PyEval_EvalFrameDefault()  ???:0
 1 0x000000000020f121 _PyFunction_Vectorcall()  ???:0
 2 0x00000000001ff1b6 object_vacall()  :0
 3 0x000000000022f12a PyObject_CallMethodObjArgs()  ???:0
 4 0x00000000001266f6 PyImport_ImportModuleLevelObject.cold()  :0
 5 0x00000000001f2c75 _PyEval_EvalFrameDefault()  ???:0
 6 0x000000000022fc74 _PyObject_VectorcallTstate.lto_priv.14()  :0
 7 0x00000000001f0a4a _PyEval_EvalFrameDefault()  ???:0
 8 0x00000000002a4d36 _PyEval_Vector()  :0
 9 0x00000000002a43ef PyEval_EvalCode()  ???:0
10 0x00000000002c2f2a run_eval_code_obj()  :0
11 0x00000000002bf343 run_mod()  :0
12 0x00000000002d4300 pyrun_file()  :0
13 0x00000000002d3c5e _PyRun_SimpleFileObject()  ???:0
14 0x00000000002d3a44 _PyRun_AnyFileObject()  ???:0
15 0x00000000002cdbdf Py_RunMain()  ???:0
16 0x0000000000292f97 Py_BytesMain()  ???:0
17 0x000000371281ed1d __libc_start_main()  ???:0
18 0x0000000000292e3d _start()  ???:0
=================================
 0 0x00000000001ed0ae _PyEval_EvalFrameDefault()  ???:0
[com010:22978] *** Process received signal ***
[com010:22978] Signal: Segmentation fault (11)
[com010:22978] Signal code:  (-6)
[com010:22978] Failing at address: 0x2b8000059c2
 1 0x000000000020f121 _PyFunction_Vectorcall()  ???:0
 2 0x00000000001ff1b6 object_vacall()  :0
[com010:22978] [ 0] /lib64/libpthread.so.0(+0x3712c0f7e0)[0x2b40e8f4d7e0]
[com010:22978] [ 1]  3 0x000000000022f12a PyObject_CallMethodObjArgs()  ???:0
/home1/p001cao/app/miniconda3/envs/py11gpaw_ucx/bin/python3.1(_PyEval_EvalFrameDefault+0xeee)[0x2b40e894a0ae]
 4 0x00000000001266f6 PyImport_ImportModuleLevelObject.cold()  :0
 5 0x00000000001f2c75 _PyEval_EvalFrameDefault()  ???:0
 6 0x000000000022fc74 _PyObject_VectorcallTstate.lto_priv.14()  :0
 7 0x00000000001f0a4a _PyEval_EvalFrameDefault()  ???:0
 8 0x00000000002a4d36 _PyEval_Vector()  :0
 9 0x00000000002a43ef PyEval_EvalCode()  ???:0
10 0x00000000002c2f2a run_eval_code_obj()  :0
11 0x00000000002bf343 run_mod()  :0
12 0x00000000002d4300 pyrun_file()  :0
13 0x00000000002d3c5e _PyRun_SimpleFileObject()  ???:0
14 0x00000000002d3a44 _PyRun_AnyFileObject()  ???:0
15 0x00000000002cdbdf Py_RunMain()  ???:0
16 0x0000000000292f97 Py_BytesMain()  ???:0
17 0x000000371281ed1d __libc_start_main()  ???:0
18 0x0000000000292e3d _start()  ???:0
=================================
[com010:22978] [com010:22976] *** Process received signal ***
[com010:22976] Signal: Segmentation fault (11)
[com010:22976] Signal code:  (-6)
[com010:22976] Failing at address: 0x2b8000059c0

The same code can run using OpenIB + OpenMPI without any error.

can anyone give me a help?

Thank you so much.