sonic-net / sonic-buildimage

Scripts which perform an installable binary image build for SONiC
Other
722 stars 1.38k forks source link

[EVPN] MAC move from remote to Local does not work #12363

Closed dgsudharsan closed 1 year ago

dgsudharsan commented 1 year ago

Description

When a MAC that's learnt remotely, is learnt local the move doesn't work. The issue is due to a race condition. On detecting mac move, fdbsyncd tries to delete remote mac first and then replace it with local mac. However frr will try to reinstall the remote mac as the local vtep cannot delete remote mac. This results in remote mac being reinstalled after the local mac is updated which results mac pointing to remote vtep even after the move.

Logs from fdbsyncd:

Oct 12 02:40:03.850516 sonic INFO swss#orchagent: message repeated 56 times: [ :- set: setting attribute 0x10000004 status: SAI_STATUS_SUCCESS]
Oct 12 02:40:03.850516 sonic INFO swss#orchagent: :- update: FDB event:2, MAC: 00:01:00:00:00:01 , BVID: 0x26000000000b0d ,                    bridge port ID: 0x3a000000000b0e.
Oct 12 02:40:03.850516 sonic INFO swss#orchagent: :- update: Received MOVE event for bvid=0x26000000000b0d mac=00:01:00:00:00:01 port=0x3a000000000b0e
Oct 12 02:40:03.850516 sonic INFO swss#orchagent: :- storeFdbEntryState: FdbOrch notification: mac 00:01:00:00:00:01 was inserted in port Ethernet64 into bv_id 0x26000000000b0d
Oct 12 02:40:03.850516 sonic INFO swss#orchagent: :- storeFdbEntryState: m_entries size=1 mac=00:01:00:00:00:01 port=0x3a000000000b0e
Oct 12 02:40:03.850827 sonic INFO swss#fdbsyncd: :- processStateFdb: FDBSYNCD STATE FDB updates key=Vlan11:00:01:00:00:00:01, operation=SET
Oct 12 02:40:03.850827 sonic INFO swss#fdbsyncd: :- processStateFdb:  FDBSYNCD STATE FDB updates : FvFiels port, FvValues: Ethernet64
Oct 12 02:40:03.850827 sonic INFO swss#fdbsyncd: :- processStateFdb:  FDBSYNCD STATE FDB updates : FvFiels type, FvValues: dynamic
Oct 12 02:40:03.856634 sonic INFO swss#fdbsyncd: :- macDelVxlanEntry: Success cmd: bridge fdb del 00:01:00:00:00:01 dev VTEP1-11 dst 2.2.2.2 vlan 11, res=, ret=0
Oct 12 02:40:03.856634 sonic INFO swss#fdbsyncd: :- updateLocalMac: Local learn event deleting from VXLAN table DEL_KEY Vlan11:00:01:00:00:00:01
Oct 12 02:40:03.856634 sonic INFO swss#fdbsyncd: :- macDelVxlan: DEL_KEY Vlan11:00:01:00:00:00:01 vtep:2.2.2.2 type:dynamic
Oct 12 02:40:03.856634 sonic NOTICE swss#fdbsyncd: :- macDelVxlanDB: VXLAN_FDB_TABLE: DEL_KEY Vlan11:00:01:00:00:00:01 vtep:2.2.2.2 type:dynamic
Oct 12 02:40:03.857513 sonic INFO swss#orchagent: :- removeFdbEntry: FdbOrch RemoveFDBEntry: mac=00:01:00:00:00:01 bv_id=0x26000000000b0dorigin 4
Oct 12 02:40:03.857513 sonic INFO swss#orchagent: :- removeFdbEntry: FdbOrch RemoveFDBEntry: mac=00:01:00:00:00:01 fdb origin is different; found_origin:1 delete_origin:4
Oct 12 02:40:03.861031 sonic INFO swss#fdbsyncd: :- updateLocalMac: cmd: bridge fdb replace 00:01:00:00:00:01 dev Ethernet64 master dynamic vlan 11, res=, ret=0
Oct 12 02:40:03.861031 sonic INFO swss#fdbsyncd: :- onMsgNbr: Tunnel IP 2.2.2.2 Int33686018
Oct 12 02:40:03.861125 sonic INFO swss#fdbsyncd: :- onMsgNbr: Tunnel IP 2.2.2.2 Int33686018
Oct 12 02:40:03.861125 sonic INFO swss#fdbsyncd: :- macAddVxlan: VXLAN_FDB_TABLE: ADD_KEY Vlan11:00:01:00:00:00:01 vtep:2.2.2.2 type:dynamic
Oct 12 02:40:03.861865 sonic INFO swss#orchagent: :- addFdbEntry: mac=00:01:00:00:00:01 bv_id=0x26000000000b0d port_name=Port_SRC_VTEP_1.1.1.1 type=dynamic origin=4 remote_ip=2.2.2.2
Oct 12 02:40:03.861865 sonic INFO swss#orchagent: :- addFdbEntry: MAC-Update FDB 00:01:00:00:00:01 in Vlan11 on from-Ethernet64:to-Port_SRC_VTEP_1.1.1.1 from-dynamic:to-dynamic origin-1-to-4
Oct 12 02:40:03.862440 qa-eth-vt03-1-4600ca1 NOTICE syncd#SDK: [SAI_UTILS.NOTICE] mlnx_sai_utils.c[1833]- set_dispatch_attrib_handler: Set TYPE, key:fdb entry mac [00:01:00:00:00:01] bv_id b00000026 (VLAN), val:STATIC
Oct 12 02:40:03.864124 qa-eth-vt03-1-4600ca1 NOTICE syncd#SDK: [SAI_UTILS.NOTICE] mlnx_sai_utils.c[1833]- set_dispatch_attrib_handler: Set ALLOW_MAC_MOVE, key:fdb entry mac [00:01:00:00:00:01] bv_id b00000026 (VLAN), val:true
Oct 12 02:40:03.865344 qa-eth-vt03-1-4600ca1 NOTICE syncd#SDK: [SAI_UTILS.NOTICE] mlnx_sai_utils.c[1833]- set_dispatch_attrib_handler: Set BRIDGE_PORT_ID, key:fdb entry mac [00:01:00:00:00:01] bv_id b00000026 (VLAN), val:BRIDGE_PORT,(0:0),205,0000,0
Oct 12 02:40:03.866570 qa-eth-vt03-1-4600ca1 NOTICE syncd#SDK: [SAI_UTILS.NOTICE] mlnx_sai_utils.c[1833]- set_dispatch_attrib_handler: Set ENDPOINT_IP, key:fdb entry mac [00:01:00:00:00:01] bv_id b00000026 (VLAN), val:2.2.2.2
Oct 12 02:40:03.867593 sonic INFO swss#fdbsyncd: :- processStateFdb: FDBSYNCD STATE FDB updates key=Vlan11:00:01:00:00:00:01, operation=DEL
Oct 12 02:40:03.867619 sonic INFO swss#fdbsyncd: :- macCheckSrcDB: DEL_KEY Vlan11:00:01:00:00:00:01
Oct 12 02:40:03.870828 sonic INFO swss#fdbsyncd: :- updateLocalMac: cmd: bridge fdb del 00:01:00:00:00:01 dev Ethernet64 master dynamic vlan 11, res=, ret=0
Oct 12 02:40:03.870828 sonic INFO swss#fdbsyncd: :- macRefreshStateDB: Refreshing Vlan:11 MAC route MAC:00:01:00:00:00:01 Key Vlan11:00:01:00:00:00:01
Oct 12 02:40:05.055843 sonic INFO swss#orchagent: :- set: setting attribute 0x10000004 status: SAI_STATUS_SUCCESS

Steps to reproduce the issue:

  1. Using EVPN learn mac in VTEP2 and advertise it as remote mac to VTEP1
  2. Try to learn the mac locally again in VTEP1
  3. Check show vxlan remotemac tables

Describe the results you received:

The mac is not learnt locally. It still points remote

Describe the results you expected:

The mac should move to local. It shouldn't point to remote.

Output of show version:

SONiC Software Version: SONiC.202205.42-ea51d9514_Internal
Distribution: Debian 11.5
Kernel: 5.10.0-12-2-amd64
Build commit: ea51d9514
Build date: Fri Oct  7 05:45:56 UTC 2022
Built by: sw-r2d2-bot@r-build-sonic-ci03-243

Platform: x86_64-mlnx_msn4600c-r0
HwSKU: ACS-MSN4600C
ASIC: mellanox
ASIC Count: 1
Serial Number: MT2140X00042
Model Number: MSN4600-CS2FO_QP
Hardware Revision: A1
Uptime: 04:16:10 up  6:39,  2 users,  load average: 0.53, 0.55, 0.56
Date: Wed 12 Oct 2022 04:16:10

Docker images:
REPOSITORY                                         TAG                            IMAGE ID       SIZE
docker-syncd-mlnx                                  202205.42-ea51d9514_Internal   97dd12cebe1e   859MB
docker-syncd-mlnx                                  latest                         97dd12cebe1e   859MB
docker-orchagent                                   202205.42-ea51d9514_Internal   347f0cdc723f   478MB
docker-orchagent                                   latest                         347f0cdc723f   478MB
docker-fpm-frr                                     202205.42-ea51d9514_Internal   ddadceae2d69   488MB
docker-fpm-frr                                     latest                         ddadceae2d69   488MB
docker-teamd                                       202205.42-ea51d9514_Internal   28f79f968d3c   459MB
docker-teamd                                       latest                         28f79f968d3c   459MB
docker-platform-monitor                            202205.42-ea51d9514_Internal   629c9ea03cf2   861MB
docker-platform-monitor                            latest                         629c9ea03cf2   861MB
docker-macsec                                      latest                         a7ea8b95281f   461MB
docker-snmp                                        202205.42-ea51d9514_Internal   0e96a62d07ee   488MB
docker-snmp                                        latest                         0e96a62d07ee   488MB
docker-dhcp-relay                                  latest                         8cef09a39edf   452MB
docker-lldp                                        202205.42-ea51d9514_Internal   337146c6b971   485MB
docker-lldp                                        latest                         337146c6b971   485MB
docker-mux                                         202205.42-ea51d9514_Internal   464339799d55   492MB
docker-mux                                         latest                         464339799d55   492MB
docker-sonic-telemetry                             202205.42-ea51d9514_Internal   7fc604d28c7c   523MB
docker-sonic-telemetry                             latest                         7fc604d28c7c   523MB
docker-database                                    202205.42-ea51d9514_Internal   98a7bdcfd7e8   443MB
docker-database                                    latest                         98a7bdcfd7e8   443MB
docker-router-advertiser                           202205.42-ea51d9514_Internal   f05c810acb38   443MB
docker-router-advertiser                           latest                         f05c810acb38   443MB
docker-nat                                         202205.42-ea51d9514_Internal   272fda2cdf1a   430MB
docker-nat                                         latest                         272fda2cdf1a   430MB
docker-sflow                                       202205.42-ea51d9514_Internal   5723c8d63918   428MB
docker-sflow                                       latest                         5723c8d63918   428MB
docker-sonic-mgmt-framework                        202205.42-ea51d9514_Internal   0fd3a3d91b98   557MB
docker-sonic-mgmt-framework                        latest                         0fd3a3d91b98   557MB
urm.nvidia.com/sw-nbu-sws-sonic-docker/sonic-wjh   1.3.1-202205                   4e8b9199b984   643MB

Output of show techsupport:

VTEP 1: qa-eth-vt03-1-4600ca1
VTEP 2: qa-eth-vt03-2-3700v
MAC used: 00:01:00:00:00:01

First learnt locally in VTEP2. VTEP1 learns it as remote mac throug EVPN.
Then traffic with same mac is sent again on VTEP1 local port. However mac move doesn't happen.

Additional information you deem important (e.g. issue happens only occasionally):

sonic_dump_qa-eth-vt03-1-4600ca1_20221012_033236.tar.gz sonic_dump_qa-eth-vt03-2-3700v_20221012_033232.tar.gz

dgsudharsan commented 1 year ago

@adyeung @srj102 @prsunny FYI. I believe the issue is due to the ordering of events in fpmsyncd. Should replace be done first followed by del?

dgsudharsan commented 1 year ago

Debug from frr show that bridge fdb delete command deletes both fdb entries in the kernel which might be the issue. Deleting the mac without dst results in readd

bridge fdb show | grep 00:01:00:00:00:01 00:01:00:00:00:01 dev VTEP2-11 vlan 11 extern_learn master Bridge 00:01:00:00:00:01 dev VTEP2-11 dst 1.1.1.1 self extern_learn

2022/10/12 04:13:49.672998 ZEBRA: [QFR1P-4MVVD] Rx RTM_DELNEIGH AF_BRIDGE IF 84 st 0x2 fl 0x12 MAC 00:01:00:00:00:01 dst 1.1.1.1 nhg 0 2022/10/12 04:13:49.673030 ZEBRA: [KMXEB-K771Y] netlink_parse_info: netlink-listen (NS 0) type RTM_DELNEIGH(29), len=40, seq=0, pid=0 2022/10/12 04:13:49.673033 ZEBRA: [QFR1P-4MVVD] Rx RTM_DELNEIGH AF_BRIDGE IF 84 st 0xc0 fl 0x2 MAC 00:01:00:00:00:01 nhg 0 2022/10/12 04:13:49.673047 ZEBRA: [QHNV9-0JJF7] dpDel remote MAC 00:01:00:00:00:01 intf VTEP2-11(84) VNI 770011 - readd 2022/10/12 04:13:49.673180 ZEBRA: [MMX22-H2MY5] Tx RTM_NEWNEIGH family bridge IF VTEP2-11(84) VLAN 11 MAC 00:01:00:00:00:01 dst 1.1.1.1 nhg 0 rem

adyeung commented 1 year ago

@kishorekunal01 has the RC, he will post a fix

dgsudharsan commented 1 year ago

Fixed in https://github.com/sonic-net/sonic-swss/pull/2521