Open KOBAYASHI-Hiro opened 3 years ago
@KOBAYASHI-Hiro
_ZSt14__convert_to_vIdEvPKcRT_RSt12_Ios_IostateRKP15__locale_struct
which is not UCX codeI face the signal 11 error when using UCX 1.15 + OpenMPI 4.1.x + Python 3.11
==== backtrace (tid: 22978) ====
0 0x00000000001ed0ae _PyEval_EvalFrameDefault() ???:0
1 0x000000000020f121 _PyFunction_Vectorcall() ???:0
2 0x00000000001ff1b6 object_vacall() :0
3 0x000000000022f12a PyObject_CallMethodObjArgs() ???:0
4 0x00000000001266f6 PyImport_ImportModuleLevelObject.cold() :0
5 0x00000000001f2c75 _PyEval_EvalFrameDefault() ???:0
6 0x000000000022fc74 _PyObject_VectorcallTstate.lto_priv.14() :0
7 0x00000000001f0a4a _PyEval_EvalFrameDefault() ???:0
8 0x00000000002a4d36 _PyEval_Vector() :0
9 0x00000000002a43ef PyEval_EvalCode() ???:0
10 0x00000000002c2f2a run_eval_code_obj() :0
11 0x00000000002bf343 run_mod() :0
12 0x00000000002d4300 pyrun_file() :0
13 0x00000000002d3c5e _PyRun_SimpleFileObject() ???:0
14 0x00000000002d3a44 _PyRun_AnyFileObject() ???:0
15 0x00000000002cdbdf Py_RunMain() ???:0
16 0x0000000000292f97 Py_BytesMain() ???:0
17 0x000000371281ed1d __libc_start_main() ???:0
18 0x0000000000292e3d _start() ???:0
=================================
0 0x00000000001ed0ae _PyEval_EvalFrameDefault() ???:0
[com010:22978] *** Process received signal ***
[com010:22978] Signal: Segmentation fault (11)
[com010:22978] Signal code: (-6)
[com010:22978] Failing at address: 0x2b8000059c2
1 0x000000000020f121 _PyFunction_Vectorcall() ???:0
2 0x00000000001ff1b6 object_vacall() :0
[com010:22978] [ 0] /lib64/libpthread.so.0(+0x3712c0f7e0)[0x2b40e8f4d7e0]
[com010:22978] [ 1] 3 0x000000000022f12a PyObject_CallMethodObjArgs() ???:0
/home1/p001cao/app/miniconda3/envs/py11gpaw_ucx/bin/python3.1(_PyEval_EvalFrameDefault+0xeee)[0x2b40e894a0ae]
4 0x00000000001266f6 PyImport_ImportModuleLevelObject.cold() :0
5 0x00000000001f2c75 _PyEval_EvalFrameDefault() ???:0
6 0x000000000022fc74 _PyObject_VectorcallTstate.lto_priv.14() :0
7 0x00000000001f0a4a _PyEval_EvalFrameDefault() ???:0
8 0x00000000002a4d36 _PyEval_Vector() :0
9 0x00000000002a43ef PyEval_EvalCode() ???:0
10 0x00000000002c2f2a run_eval_code_obj() :0
11 0x00000000002bf343 run_mod() :0
12 0x00000000002d4300 pyrun_file() :0
13 0x00000000002d3c5e _PyRun_SimpleFileObject() ???:0
14 0x00000000002d3a44 _PyRun_AnyFileObject() ???:0
15 0x00000000002cdbdf Py_RunMain() ???:0
16 0x0000000000292f97 Py_BytesMain() ???:0
17 0x000000371281ed1d __libc_start_main() ???:0
18 0x0000000000292e3d _start() ???:0
=================================
[com010:22978] [com010:22976] *** Process received signal ***
[com010:22976] Signal: Segmentation fault (11)
[com010:22976] Signal code: (-6)
[com010:22976] Failing at address: 0x2b8000059c0
The same code can run using OpenIB + OpenMPI without any error.
can anyone give me a help?
Thank you so much.
Describe the bug
「Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))」occurs in 20-30% of jobs when running mpirun on UCX 1.9.0 or higher. There are no issues with UCX 1.7.0.
==== backtrace (tid: 100539) ==== 0 /hoge/apps/rhel79/ucx/1.9.0-cuda10.2/lib/libucs.so.0(ucs_handle_error+0xe4) [0x2b19bd0b4444] 1 /hoge/apps/rhel79/ucx/1.9.0-cuda10.2/lib/libucs.so.0(+0x2476c) [0x2b19bd0b476c] 2 /hoge/apps/rhel79/ucx/1.9.0-cuda10.2/lib/libucs.so.0(+0x249db) [0x2b19bd0b49db] 3 /usr/lib64/libpthread.so.0(+0xf630) [0x2b19bb75a630] 4 /usr/lib64/libstdc++.so.6(_ZSt14convert_to_vIdEvPKcRT_RSt12_Ios_IostateRKP15__locale_struct+0x30) [0x2b19bafa7970] 5 /usr/lib64/libstdc++.so.6(_ZNKSt7num_getIcSt19istreambuf_iteratorIcSt11char_traitsIcEEE6do_getES3_S3_RSt8ios_baseRSt12_Ios_IostateRd+0xf1) [0x2b19bafbc4b1] 6 /usr/lib64/libstdc++.so.6(_ZNSi10_MextractIdEERSiRT+0x75) [0x2b19bafac285] 7 ./hoge-cuda-mpi() [0x42b2b0] 8 ./hoge-cuda-mpi() [0x41c9fa] 9 ./hoge-cuda-mpi() [0x40600c] 10 /usr/lib64/libc.so.6(libc_start_main+0xf5) [0x2b19bb989555] 11 ./hoge-cuda-mpi() [0x409a08]
If you execute the'ucx_info -d' command on the node after the symptom occurs, gdr_copy, cuda_cpy, and cuda_ipc disappear from the Memory domain.
UCX LOG [node36:112909:0] ib_md.c:379 UCX DEBUG ibv_reg_mr(address=0x2ab6ae000000, length=268435456, access=0xf) failed: Input/output error [node36:112909:0] rcache.c:873 UCX DEBUG failed to register region 0x6eef640 [0x2ab6ae000000..0x2ab6be000000]: Input/output error [node36:112909:0] ucp_mm.c:149 UCX DIAG failed to register address 0x2ab6ae000000 mem_type bit 0x2 length 268435456 on md[7]=mlx5_0 : Input/output error (md reg_mem_types 0x3)
Steps to Reproduce
mpirun -np 4 -npernode 4 ./hoge-cuda-mpi UCX: 1.9.0 or 1.10 .1 or 1.11
Setup and versions
CentOS Linux release 7.9.2009 (Core) Kernel:3.10.0-1160.15.2.el7.x86_64 OFED:5.1-2.5.8 rdma-core-51mlnx1-1.51258.x86_64 libibverbs-51mlnx1-1.51258.x86_64 OpenMPI 4.0.3 GCC 8.3.1 GPU: Tesla V100 32GB PCIe CUDA: 11.2.1 or 10.2 CUDA Driver: 460.32.03 or 465.19.01 nvidia-peer-memory: 1.1-0
$ ibstat CA 'mlx5_0' CA type: MT4123 Number of ports: 1 Firmware version: 20.29.1016 Hardware version: 0 Node GUID: 0x98039b030096952c System image GUID: 0x98039b030096952c Port 1: State: Active Physical state: LinkUp Rate: 100 Base lid: 314 LMC: 0 SM lid: 14 Capability mask: 0x2651e848 Port GUID: 0x98039b030096952c Link layer: InfiniBand CA 'mlx5_1' CA type: MT4123 Number of ports: 1 Firmware version: 20.29.1016 Hardware version: 0 Node GUID: 0x98039b0300968f14 System image GUID: 0x98039b0300968f14 Port 1: State: Active Physical state: LinkUp Rate: 100 Base lid: 309 LMC: 0 SM lid: 14 Capability mask: 0x2651e848 Port GUID: 0x98039b0300968f14 Link layer: InfiniBand CA 'mlx5_2' CA type: MT4123 Number of ports: 1 Firmware version: 20.29.1016 Hardware version: 0 Node GUID: 0x98039b0300968fa0 System image GUID: 0x98039b0300968fa0 Port 1: State: Active Physical state: LinkUp Rate: 100 Base lid: 333 LMC: 0 SM lid: 14 Capability mask: 0x2651e848 Port GUID: 0x98039b0300968fa0 Link layer: InfiniBand CA 'mlx5_3' CA type: MT4123 Number of ports: 1 Firmware version: 20.29.1016 Hardware version: 0 Node GUID: 0x98039b0300968f24 System image GUID: 0x98039b0300968f24 Port 1: State: Active Physical state: LinkUp Rate: 100 Base lid: 316 LMC: 0 SM lid: 14 Capability mask: 0x2651e848 Port GUID: 0x98039b0300968f24 Link layer: InfiniBand
$ ibv_devinfo -vv hca_id: mlx5_0 transport: InfiniBand (0) fw_ver: 20.29.1016 node_guid: 9803:9b03:0096:952c sys_image_guid: 9803:9b03:0096:952c vendor_id: 0x02c9 vendor_part_id: 4123 hw_ver: 0x0 board_id: MT_0000000222 phys_port_cnt: 1 max_mr_size: 0xffffffffffffffff page_size_cap: 0xfffffffffffff000 max_qp: 262144 max_qp_wr: 32768 device_cap_flags: 0xe97e1c36 BAD_PKEY_CNTR BAD_QKEY_CNTR AUTO_PATH_MIG CHANGE_PHY_PORT PORT_ACTIVE_EVENT SYS_IMAGE_GUID RC_RNR_NAK_GEN MEM_WINDOW UD_IP_CSUM XRC MEM_MGT_EXTENSIONS MEM_WINDOW_TYPE_2B MANAGED_FLOW_STEERING Unknown flags: 0xC8480000 max_sge: 30 max_sge_rd: 30 max_cq: 16777216 max_cqe: 4194303 max_mr: 16777216 max_pd: 16777216 max_qp_rd_atom: 16 max_ee_rd_atom: 0 max_res_rd_atom: 4194304 max_qp_init_rd_atom: 16 max_ee_init_rd_atom: 0 atomic_cap: ATOMIC_HCA (1) max_ee: 0 max_rdd: 0 max_mw: 16777216 max_raw_ipv6_qp: 0 max_raw_ethy_qp: 0 max_mcast_grp: 2097152 max_mcast_qp_attach: 240 max_total_mcast_qp_attach: 503316480 max_ah: 2147483647 max_fmr: 0 max_srq: 8388608 max_srq_wr: 32767 max_srq_sge: 31 max_pkeys: 128 local_ca_ack_delay: 16 general_odp_caps: ODP_SUPPORT ODP_SUPPORT_IMPLICIT rc_odp_caps: SUPPORT_SEND SUPPORT_RECV SUPPORT_WRITE SUPPORT_READ SUPPORT_SRQ uc_odp_caps: NO SUPPORT ud_odp_caps: SUPPORT_SEND xrc_odp_caps: SUPPORT_SEND SUPPORT_WRITE SUPPORT_READ SUPPORT_SRQ completion timestamp_mask: 0x7fffffffffffffff hca_core_clock: 156250kHZ device_cap_flags_ex: 0x30000051E97E1C36 PCI_WRITE_END_PADDING Unknown flags: 0x3000004100000000 tso_caps: max_tso: 0 rss_caps: max_rwq_indirection_tables: 0 max_rwq_indirection_table_size: 0 rx_hash_function: 0x0 rx_hash_fields_mask: 0x0 max_wq_type_rq: 0 packet_pacing_caps: qp_rate_limit_min: 0kbps qp_rate_limit_max: 0kbps max_rndv_hdr_size: 64 max_num_tags: 127 max_ops: 32768 max_sge: 1 flags: IBV_TM_CAP_RC
hca_id: mlx5_1 transport: InfiniBand (0) fw_ver: 20.29.1016 node_guid: 9803:9b03:0096:8f14 sys_image_guid: 9803:9b03:0096:8f14 vendor_id: 0x02c9 vendor_part_id: 4123 hw_ver: 0x0 board_id: MT_0000000222 phys_port_cnt: 1 max_mr_size: 0xffffffffffffffff page_size_cap: 0xfffffffffffff000 max_qp: 262144 max_qp_wr: 32768 device_cap_flags: 0xe97e1c36 BAD_PKEY_CNTR BAD_QKEY_CNTR AUTO_PATH_MIG CHANGE_PHY_PORT PORT_ACTIVE_EVENT SYS_IMAGE_GUID RC_RNR_NAK_GEN MEM_WINDOW UD_IP_CSUM XRC MEM_MGT_EXTENSIONS MEM_WINDOW_TYPE_2B MANAGED_FLOW_STEERING Unknown flags: 0xC8480000 max_sge: 30 max_sge_rd: 30 max_cq: 16777216 max_cqe: 4194303 max_mr: 16777216 max_pd: 16777216 max_qp_rd_atom: 16 max_ee_rd_atom: 0 max_res_rd_atom: 4194304 max_qp_init_rd_atom: 16 max_ee_init_rd_atom: 0 atomic_cap: ATOMIC_HCA (1) max_ee: 0 max_rdd: 0 max_mw: 16777216 max_raw_ipv6_qp: 0 max_raw_ethy_qp: 0 max_mcast_grp: 2097152 max_mcast_qp_attach: 240 max_total_mcast_qp_attach: 503316480 max_ah: 2147483647 max_fmr: 0 max_srq: 8388608 max_srq_wr: 32767 max_srq_sge: 31 max_pkeys: 128 local_ca_ack_delay: 16 general_odp_caps: ODP_SUPPORT ODP_SUPPORT_IMPLICIT rc_odp_caps: SUPPORT_SEND SUPPORT_RECV SUPPORT_WRITE SUPPORT_READ SUPPORT_SRQ uc_odp_caps: NO SUPPORT ud_odp_caps: SUPPORT_SEND xrc_odp_caps: SUPPORT_SEND SUPPORT_WRITE SUPPORT_READ SUPPORT_SRQ completion timestamp_mask: 0x7fffffffffffffff hca_core_clock: 156250kHZ device_cap_flags_ex: 0x30000051E97E1C36 PCI_WRITE_END_PADDING Unknown flags: 0x3000004100000000 tso_caps: max_tso: 0 rss_caps: max_rwq_indirection_tables: 0 max_rwq_indirection_table_size: 0 rx_hash_function: 0x0 rx_hash_fields_mask: 0x0 max_wq_type_rq: 0 packet_pacing_caps: qp_rate_limit_min: 0kbps qp_rate_limit_max: 0kbps max_rndv_hdr_size: 64 max_num_tags: 127 max_ops: 32768 max_sge: 1 flags: IBV_TM_CAP_RC
hca_id: mlx5_2 transport: InfiniBand (0) fw_ver: 20.29.1016 node_guid: 9803:9b03:0096:8fa0 sys_image_guid: 9803:9b03:0096:8fa0 vendor_id: 0x02c9 vendor_part_id: 4123 hw_ver: 0x0 board_id: MT_0000000222 phys_port_cnt: 1 max_mr_size: 0xffffffffffffffff page_size_cap: 0xfffffffffffff000 max_qp: 262144 max_qp_wr: 32768 device_cap_flags: 0xe97e1c36 BAD_PKEY_CNTR BAD_QKEY_CNTR AUTO_PATH_MIG CHANGE_PHY_PORT PORT_ACTIVE_EVENT SYS_IMAGE_GUID RC_RNR_NAK_GEN MEM_WINDOW UD_IP_CSUM XRC MEM_MGT_EXTENSIONS MEM_WINDOW_TYPE_2B MANAGED_FLOW_STEERING Unknown flags: 0xC8480000 max_sge: 30 max_sge_rd: 30 max_cq: 16777216 max_cqe: 4194303 max_mr: 16777216 max_pd: 16777216 max_qp_rd_atom: 16 max_ee_rd_atom: 0 max_res_rd_atom: 4194304 max_qp_init_rd_atom: 16 max_ee_init_rd_atom: 0 atomic_cap: ATOMIC_HCA (1) max_ee: 0 max_rdd: 0 max_mw: 16777216 max_raw_ipv6_qp: 0 max_raw_ethy_qp: 0 max_mcast_grp: 2097152 max_mcast_qp_attach: 240 max_total_mcast_qp_attach: 503316480 max_ah: 2147483647 max_fmr: 0 max_srq: 8388608 max_srq_wr: 32767 max_srq_sge: 31 max_pkeys: 128 local_ca_ack_delay: 16 general_odp_caps: ODP_SUPPORT ODP_SUPPORT_IMPLICIT rc_odp_caps: SUPPORT_SEND SUPPORT_RECV SUPPORT_WRITE SUPPORT_READ SUPPORT_SRQ uc_odp_caps: NO SUPPORT ud_odp_caps: SUPPORT_SEND xrc_odp_caps: SUPPORT_SEND SUPPORT_WRITE SUPPORT_READ SUPPORT_SRQ completion timestamp_mask: 0x7fffffffffffffff hca_core_clock: 156250kHZ device_cap_flags_ex: 0x30000051E97E1C36 PCI_WRITE_END_PADDING Unknown flags: 0x3000004100000000 tso_caps: max_tso: 0 rss_caps: max_rwq_indirection_tables: 0 max_rwq_indirection_table_size: 0 rx_hash_function: 0x0 rx_hash_fields_mask: 0x0 max_wq_type_rq: 0 packet_pacing_caps: qp_rate_limit_min: 0kbps qp_rate_limit_max: 0kbps max_rndv_hdr_size: 64 max_num_tags: 127 max_ops: 32768 max_sge: 1 flags: IBV_TM_CAP_RC
hca_id: mlx5_3 transport: InfiniBand (0) fw_ver: 20.29.1016 node_guid: 9803:9b03:0096:8f24 sys_image_guid: 9803:9b03:0096:8f24 vendor_id: 0x02c9 vendor_part_id: 4123 hw_ver: 0x0 board_id: MT_0000000222 phys_port_cnt: 1 max_mr_size: 0xffffffffffffffff page_size_cap: 0xfffffffffffff000 max_qp: 262144 max_qp_wr: 32768 device_cap_flags: 0xe97e1c36 BAD_PKEY_CNTR BAD_QKEY_CNTR AUTO_PATH_MIG CHANGE_PHY_PORT PORT_ACTIVE_EVENT SYS_IMAGE_GUID RC_RNR_NAK_GEN MEM_WINDOW UD_IP_CSUM XRC MEM_MGT_EXTENSIONS MEM_WINDOW_TYPE_2B MANAGED_FLOW_STEERING Unknown flags: 0xC8480000 max_sge: 30 max_sge_rd: 30 max_cq: 16777216 max_cqe: 4194303 max_mr: 16777216 max_pd: 16777216 max_qp_rd_atom: 16 max_ee_rd_atom: 0 max_res_rd_atom: 4194304 max_qp_init_rd_atom: 16 max_ee_init_rd_atom: 0 atomic_cap: ATOMIC_HCA (1) max_ee: 0 max_rdd: 0 max_mw: 16777216 max_raw_ipv6_qp: 0 max_raw_ethy_qp: 0 max_mcast_grp: 2097152 max_mcast_qp_attach: 240 max_total_mcast_qp_attach: 503316480 max_ah: 2147483647 max_fmr: 0 max_srq: 8388608 max_srq_wr: 32767 max_srq_sge: 31 max_pkeys: 128 local_ca_ack_delay: 16 general_odp_caps: ODP_SUPPORT ODP_SUPPORT_IMPLICIT rc_odp_caps: SUPPORT_SEND SUPPORT_RECV SUPPORT_WRITE SUPPORT_READ SUPPORT_SRQ uc_odp_caps: NO SUPPORT ud_odp_caps: SUPPORT_SEND xrc_odp_caps: SUPPORT_SEND SUPPORT_WRITE SUPPORT_READ SUPPORT_SRQ completion timestamp_mask: 0x7fffffffffffffff hca_core_clock: 156250kHZ device_cap_flags_ex: 0x30000051E97E1C36 PCI_WRITE_END_PADDING Unknown flags: 0x3000004100000000 tso_caps: max_tso: 0 rss_caps: max_rwq_indirection_tables: 0 max_rwq_indirection_table_size: 0 rx_hash_function: 0x0 rx_hash_fields_mask: 0x0 max_wq_type_rq: 0 packet_pacing_caps: qp_rate_limit_min: 0kbps qp_rate_limit_max: 0kbps max_rndv_hdr_size: 64 max_num_tags: 127 max_ops: 32768 max_sge: 1 flags: IBV_TM_CAP_RC