openucx / ucx

Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
http://www.openucx.org
Other
1.13k stars 422 forks source link

ib_device.c:475 UCX WARN IB Async event on mlx5_0: GID table change on port 1 #7613

Open oleotiger opened 2 years ago

oleotiger commented 2 years ago

Describe the bug

When I run benchmark osu compiled with hpcx, I got warnings: [1635835013.823013] [node181:6471 :async] ib_device.c:475 UCX WARN IB Async event on mlx5_0: GID table change on port 1

I have find the issue 1845. Someone said it's not a bug. Why there is warning message with hpc-x? By the way, I'm working with RoCE and Mellanox networkcard is working in Ethernet mode. Opensm is not restarted by anyone.

Steps to Reproduce

Setup and versions

hca_id: mlx5_0
        transport:                      InfiniBand (0)
        fw_ver:                         12.28.2006
        node_guid:                      9803:9b03:0087:f102
        sys_image_guid:                 9803:9b03:0087:f102
        vendor_id:                      0x02c9
        vendor_part_id:                 4115
        hw_ver:                         0x0
        board_id:                       MT_2180110032
        phys_port_cnt:                  1
        max_mr_size:                    0xffffffffffffffff
        page_size_cap:                  0xfffffffffffff000
        max_qp:                         262144
        max_qp_wr:                      32768
        device_cap_flags:               0xed721c36
                                        BAD_PKEY_CNTR
                                        BAD_QKEY_CNTR
                                        AUTO_PATH_MIG
                                        CHANGE_PHY_PORT
                                        PORT_ACTIVE_EVENT
                                        SYS_IMAGE_GUID
                                        RC_RNR_NAK_GEN
                                        MEM_WINDOW
                                        XRC
                                        MEM_MGT_EXTENSIONS
                                        MEM_WINDOW_TYPE_2B
                                        RAW_IP_CSUM
                                        MANAGED_FLOW_STEERING
                                        Unknown flags: 0xC8400000
        max_sge:                        30
        max_sge_rd:                     30
        max_cq:                         16777216
        max_cqe:                        4194303
        max_mr:                         16777216
        max_pd:                         16777216
        max_qp_rd_atom:                 16
        max_ee_rd_atom:                 0
        max_res_rd_atom:                4194304
        max_qp_init_rd_atom:            16
        max_ee_init_rd_atom:            0
        atomic_cap:                     ATOMIC_HCA (1)
        max_ee:                         0
        max_rdd:                        0
        max_mw:                         16777216
        max_raw_ipv6_qp:                0
        max_raw_ethy_qp:                0
        max_mcast_grp:                  2097152
        max_mcast_qp_attach:            240
        max_total_mcast_qp_attach:      503316480
        max_ah:                         2147483647
        max_fmr:                        0
        max_srq:                        8388608
        max_srq_wr:                     32767
        max_srq_sge:                    31
        max_pkeys:                      128
        local_ca_ack_delay:             16
        general_odp_caps:
                                        ODP_SUPPORT
                                        ODP_SUPPORT_IMPLICIT
        rc_odp_caps:
                                        SUPPORT_SEND
                                        SUPPORT_RECV
                                        SUPPORT_WRITE
                                        SUPPORT_READ
                                        SUPPORT_SRQ
        uc_odp_caps:
                                        NO SUPPORT
        ud_odp_caps:
                                        SUPPORT_SEND
        xrc_odp_caps:
                                        SUPPORT_SEND
                                        SUPPORT_WRITE
                                        SUPPORT_READ
                                        SUPPORT_SRQ
        completion timestamp_mask:                      0x7fffffffffffffff
        hca_core_clock:                 156250kHZ
        raw packet caps:
                                        C-VLAN stripping offload
                                        Scatter FCS offload
                                        IP csum offload
                                        Delay drop
        device_cap_flags_ex:            0x20000055ED721C36
                                        RAW_SCATTER_FCS
                                        PCI_WRITE_END_PADDING
                                        Unknown flags: 0x2000004100000000
        tso_caps:
                max_tso:                        262144
                supported_qp:
                                        SUPPORT_RAW_PACKET
        rss_caps:
                max_rwq_indirection_tables:                     65536
                max_rwq_indirection_table_size:                 2048
                rx_hash_function:                               0x1
                rx_hash_fields_mask:                            0x800000FF
                supported_qp:
                                        SUPPORT_RAW_PACKET
        max_wq_type_rq:                 8388608
        packet_pacing_caps:
                qp_rate_limit_min:      0kbps
                qp_rate_limit_max:      0kbps
        tag matching not supported

        cq moderation caps:
                max_cq_count:   65535
                max_cq_period:  4095 us

        num_comp_vectors:               63
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet
                        max_msg_sz:             0x40000000
                        port_cap_flags:         0x04010000
                        port_cap_flags2:        0x0000
                        max_vl_num:             invalid value (0)
                        bad_pkey_cntr:          0x0
                        qkey_viol_cntr:         0x0
                        sm_sl:                  0
                        pkey_tbl_len:           1
                        gid_tbl_len:            256
                        subnet_timeout:         0
                        init_type_reply:        0
                        active_width:           4X (2)
                        active_speed:           25.0 Gbps (32)
                        phys_state:             LINK_UP (5)
                        GID[  0]:               fe80:0000:0000:0000:9a03:9bff:fe87:f102, RoCE v1
                        GID[  1]:               fe80::9a03:9bff:fe87:f102, RoCE v2
                        GID[  4]:               fe80:0000:0000:0000:9a03:9bff:fe87:f102, RoCE v1
                        GID[  5]:               fe80::9a03:9bff:fe87:f102, RoCE v2
                        GID[  6]:               0000:0000:0000:0000:0000:ffff:a001:01b5, RoCE v1
                        GID[  7]:               ::ffff:160.1.1.181, RoCE v2
                        GID[  8]:               0000:0000:0000:0000:0000:ffff:a101:01b5, RoCE v1
                        GID[  9]:               ::ffff:161.1.1.181, RoCE v2

Additional information (depending on the issue)

yosefe commented 2 years ago

@oleotiger if there is a change in the IP configuration of the RoCE network device, it may be causing a GID_CHANGE event. Some changes may cause running applications using UCX to stop working, in case the IP/GID it was using is affected.

oleotiger commented 2 years ago

@oleotiger if there is a change in the IP configuration of the RoCE network device, it may be causing a GID_CHANGE event. Some changes may cause running applications using UCX to stop working, in case the IP/GID it was using is affected.

I'm sure that during the running of the application there is no change of IP configuration of the RoCE netowrk device. May there be any other reasons that cause UCX to print the warning?

yosefe commented 2 years ago

I'm sure that during the running of the application there is no change of IP configuration of the RoCE netowrk device. May there be any other reasons that cause UCX to print the warning?

can you try to capture dmesg log from the time period the UCX application was running?

oleotiger commented 2 years ago

Firstly I cleaned the dmesg log with dmesg -C. Then I start the UCX application and warning ib_device.c:482 UCX WARN IB Async event on mlx5_0: GID table change on port 1 was printed.

dmesg -T shows empty dmesg log.

yosefe commented 2 years ago

@jgunthorpe why could such event be generated without any network change?

jgunthorpe commented 2 years ago

Most likely there is a configuration change. For instance an IPv6 temporary privacy address roll over. Monitor for changes with iproute to confirm.

oleotiger commented 2 years ago

I don't think there is IPv6 temporary privacy address roll over during UCX runing.

I disabled ipv6 with setting net.ipv6.conf.all.disable_ipv6=1. But warning didn't disappear.

yosefe commented 2 years ago

well, we can reduce the log level of these warnings to DIAG (hidden by default).

oleotiger commented 2 years ago

We can disable the log by setting log level. But I want to find the real reason that cause ucx to print warnings. Any other ideas or ways that I can locate the problem?

yosefe commented 2 years ago

According to the previous comments, there is no obvious reason for these warnings. Please open a case for Nvidia networking support.

oleotiger commented 2 years ago

Ok, I have opened a case in Mellonax community. This issue can be closed. Thank you for your reply.

bmb commented 2 years ago

Also seeing this issue on a fresh Centos Stream 8 host with same UCX version installed. Warnings are output to terminal only when an MPI application is running. Using system provided Open MPI 4.1.1. Another weird thing is that MPI is using --mca btl self,vader, so not even using the Mellanox device in this system