sonic-net / sonic-buildimage

Scripts which perform an installable binary image build for SONiC
Other
734 stars 1.41k forks source link

[TH3][202012] WARMReboot test in test_advanced_reboot.py failed due to MAC events reported by SAI to SONIC during warmboot #8558

Closed gechiang closed 2 years ago

gechiang commented 3 years ago

Description

While testing TH3 Advanced reboot, part of the warmreboot test we observed Orchagent crashed. Upon closer examination on the syslog we found that the crash was due to BRCM SAI returning a set of bridge ports that belong to VLAN 1000 instead of the default VLAN (1). This caused the meta checker to catch that the reference count being > 0 and failed the attempt to remove these non default bridge ports. This issue is easily reproduced by executing the "test_advanced_reboot.py" on a TH3 T0 DUT where Orchagent Core is observed.

Here is the portion of the syslog that shows the issue:

...
Aug 23 18:28:05.723987 str2-z9332f-05 NOTICE swss#portmgrd: :- doTask: Configure Ethernet24 MTU to 9100
Aug 23 18:28:05.724304 str2-z9332f-05 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet2 admin:1 oper:1 addr:c8:f7:50:ed:15:41 ifindex:37 master:17
Aug 23 18:28:05.727381 str2-z9332f-05 NOTICE swss#portsyncd: message repeated 4 times: [ :- onMsg: nlmsg type:16 key:Ethernet2 admin:1 oper:1 addr:c8:f7:50:ed:15:41 ifindex:37 master:17]
Aug 23 18:28:05.727381 str2-z9332f-05 NOTICE swss#orchagent: :- removeDefaultVlanMembers: Remove 0 VLAN members from default VLAN
Aug 23 18:28:05.728018 str2-z9332f-05 INFO kernel: [   29.844864] Bridge: port 24(Ethernet2) entered blocking state
Aug 23 18:28:05.728030 str2-z9332f-05 INFO kernel: [   29.844866] Bridge: port 24(Ethernet2) entered disabled state
Aug 23 18:28:05.728032 str2-z9332f-05 INFO kernel: [   29.845297] device Ethernet2 entered promiscuous mode
Aug 23 18:28:05.728034 str2-z9332f-05 INFO kernel: [   29.845575] Bridge: port 24(Ethernet2) entered blocking state
Aug 23 18:28:05.728037 str2-z9332f-05 INFO kernel: [   29.845578] Bridge: port 24(Ethernet2) entered forwarding state
Aug 23 18:28:05.731057 str2-z9332f-05 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet2 admin:1 oper:1 addr:c8:f7:50:ed:15:41 ifindex:37 master:17
Aug 23 18:28:05.737032 str2-z9332f-05 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet2 admin:1 oper:1 addr:c8:f7:50:ed:15:41 ifindex:37 master:17
Aug 23 18:28:05.737143 str2-z9332f-05 ERR swss#orchagent: :- meta_generic_validation_remove: object 0x3a000000000d1b reference count is 1, can't remove
Aug 23 18:28:05.737143 str2-z9332f-05 ERR swss#orchagent: :- removeDefaultBridgePorts: Failed to remove bridge port, rv:-17
Aug 23 18:28:05.737302 str2-z9332f-05 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet24 admin:1 oper:1 addr:c8:f7:50:ed:15:41 ifindex:48 master:0
Aug 23 18:28:05.739157 str2-z9332f-05 NOTICE swss#portsyncd: :- onMsg: Publish Ethernet24(ok) to state db
Aug 23 18:28:05.739157 str2-z9332f-05 INFO swss#/supervisord: orchagent terminate called after throwing an instance of 'std::runtime_error'
Aug 23 18:28:05.739157 str2-z9332f-05 INFO swss#/supervisord: orchagent   what():  PortsOrch initialization failure
...

The corresponding SAIREDIS record shows where the query occured and a list of non default VLAN bridge port were returned as part of the query which proves this is a SAI issue:

...
2021-08-23.18:28:05.723020|G|SAI_STATUS_SUCCESS|SAI_PORT_ATTR_HW_LANE_LIST=8:249,250,251,252,253,254,255,256
2021-08-23.18:28:05.723130|g|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_DEFAULT_1Q_BRIDGE_ID=oid:0x0|SAI_SWITCH_ATTR_DEFAULT_VLAN_ID=oid:0x0
2021-08-23.18:28:05.724258|G|SAI_STATUS_SUCCESS|SAI_SWITCH_ATTR_DEFAULT_1Q_BRIDGE_ID=oid:0x39000000000068|SAI_SWITCH_ATTR_DEFAULT_VLAN_ID=oid:0x26000000000067
2021-08-23.18:28:05.724336|g|SAI_OBJECT_TYPE_VLAN:oid:0x26000000000067|SAI_VLAN_ATTR_MEMBER_LIST=82:oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0
2021-08-23.18:28:05.727253|G|SAI_STATUS_SUCCESS|SAI_VLAN_ATTR_MEMBER_LIST=0:null
2021-08-23.18:28:05.727346|g|SAI_OBJECT_TYPE_BRIDGE:oid:0x39000000000068|SAI_BRIDGE_ATTR_PORT_LIST=83:oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0,oid:0x0
2021-08-23.18:28:05.734559|G|SAI_STATUS_SUCCESS|SAI_BRIDGE_ATTR_PORT_LIST=64:oid:0x3a000000000d19,oid:0x3a000000000d1b,oid:0x3a000000000d1d,oid:0x3a000000000d1f,oid:0x3a000000000d21,oid:0x3a000000000d23,oid:0x3a000000000d25,oid:0x3a000000000d27,oid:0x3a000000000d29,oid:0x3a000000000d2b,oid:0x3a000000000d2d,oid:0x3a000000000d2f,oid:0x3a000000000d31,oid:0x3a000000000d33,oid:0x3a000000000d35,oid:0x3a000000000d37,oid:0x3a000000000cf1,oid:0x3a000000001003,oid:0x3a000000001019,oid:0x3a00000000102f,oid:0x3a000000001035,oid:0x3a000000000cf3,oid:0x3a000000000cf5,oid:0x3a000000000d03,oid:0x3a000000000fff,oid:0x3a000000001001,oid:0x3a000000001005,oid:0x3a000000001007,oid:0x3a000000001009,oid:0x3a00000000100b,oid:0x3a00000000100d,oid:0x3a00000000100f,oid:0x3a000000001011,oid:0x3a000000001013,oid:0x3a000000001015,oid:0x3a000000001017,oid:0x3a00000000101b,oid:0x3a00000000101d,oid:0x3a00000000101f,oid:0x3a000000001021,oid:0x3a000000001023,oid:0x3a000000001025,oid:0x3a000000001027,oid:0x3a000000001029,oid:0x3a00000000102b,oid:0x3a00000000102d,oid:0x3a000000001031,oid:0x3a000000001033,oid:0x3a000000000cf7,oid:0x3a000000000cf9,oid:0x3a000000000cfb,oid:0x3a000000000cfd,oid:0x3a000000000cff,oid:0x3a000000000d01,oid:0x3a000000000d05,oid:0x3a000000000d07,oid:0x3a000000000d09,oid:0x3a000000000d0b,oid:0x3a000000000d0d,oid:0x3a000000000d0f,oid:0x3a000000000d11,oid:0x3a000000000d13,oid:0x3a000000000d15,oid:0x3a000000000d17
2021-08-23.18:28:05.734702|g|SAI_OBJECT_TYPE_BRIDGE_PORT:oid:0x3a000000000d19|SAI_BRIDGE_PORT_ATTR_TYPE=SAI_BRIDGE_PORT_TYPE_PORT
2021-08-23.18:28:05.735220|G|SAI_STATUS_SUCCESS|SAI_BRIDGE_PORT_ATTR_TYPE=SAI_BRIDGE_PORT_TYPE_PORT
2021-08-23.18:28:05.735251|r|SAI_OBJECT_TYPE_BRIDGE_PORT:oid:0x3a000000000d19
2021-08-23.18:28:05.735839|g|SAI_OBJECT_TYPE_BRIDGE_PORT:oid:0x3a000000000d1b|SAI_BRIDGE_PORT_ATTR_TYPE=SAI_BRIDGE_PORT_TYPE_PORT
2021-08-23.18:28:05.736434|G|SAI_STATUS_SUCCESS|SAI_BRIDGE_PORT_ATTR_TYPE=SAI_BRIDGE_PORT_TYPE_PORT
2021-08-23.18:44:24.220963|#|recording on: /var/log/swss/sairedis.rec
2021-08-23.18:44:24.221261|#|logrotate on: /var/log/swss/sairedis.rec
2021-08-23.18:44:24.221837|a|INIT_VIEW
2021-08-23.18:44:24.222826|A|SAI_STATUS_SUCCESS

Steps to reproduce the issue:

  1. run test_advanced_reboot.py test case on TH3 setup

Describe the results you received:

Orchagent crashed in the warm reboot test portion of this tes.

Describe the results you expected:

No crash.

Output of show version:

(paste your output here)

Output of show techsupport:

admin@str2-z9332f-05:~$ show vers

SONiC Software Version: SONiC.20201231.18
Distribution: Debian 10.10
Kernel: 4.19.0-12-2-amd64
Build commit: 67ec0a56e3
Build date: Wed Aug 18 14:13:16 UTC 2021
Built by: AzDevOps@sonic-int-build-workers-0002UW

Platform: x86_64-dellemc_z9332f_d1508-r0
HwSKU: DellEMC-Z9332f-M-O16C64
ASIC: broadcom
ASIC Count: 1
Serial Number: TH04CN21CET0004K0123
Uptime: 20:26:01 up  1:12,  1 user,  load average: 0.44, 0.70, 0.82
...

Additional information you deem important (e.g. issue happens only occasionally):

BRCM CSP CS00012205357 filed to track this issue. syslog.txt

gechiang commented 3 years ago

Update the issue title to better describe the cause of the failure...

gechiang commented 2 years ago

Not seeing this same crash any more with warmboot on TH3 with latest 202012 image running with BRCM SAI 4.3.5.2 and above.