While testing TH3 Advanced reboot, part of the warmreboot test we observed Orchagent crashed. Upon closer examination on the syslog we found that the crash was due to BRCM SAI returning a set of bridge ports that belong to VLAN 1000 instead of the default VLAN (1). This caused the meta checker to catch that the reference count being > 0 and failed the attempt to remove these non default bridge ports.
This issue is easily reproduced by executing the "test_advanced_reboot.py" on a TH3 T0 DUT where Orchagent Core is observed.
Here is the portion of the syslog that shows the issue:
...
Aug 23 18:28:05.723987 str2-z9332f-05 NOTICE swss#portmgrd: :- doTask: Configure Ethernet24 MTU to 9100
Aug 23 18:28:05.724304 str2-z9332f-05 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet2 admin:1 oper:1 addr:c8:f7:50:ed:15:41 ifindex:37 master:17
Aug 23 18:28:05.727381 str2-z9332f-05 NOTICE swss#portsyncd: message repeated 4 times: [ :- onMsg: nlmsg type:16 key:Ethernet2 admin:1 oper:1 addr:c8:f7:50:ed:15:41 ifindex:37 master:17]
Aug 23 18:28:05.727381 str2-z9332f-05 NOTICE swss#orchagent: :- removeDefaultVlanMembers: Remove 0 VLAN members from default VLAN
Aug 23 18:28:05.728018 str2-z9332f-05 INFO kernel: [ 29.844864] Bridge: port 24(Ethernet2) entered blocking state
Aug 23 18:28:05.728030 str2-z9332f-05 INFO kernel: [ 29.844866] Bridge: port 24(Ethernet2) entered disabled state
Aug 23 18:28:05.728032 str2-z9332f-05 INFO kernel: [ 29.845297] device Ethernet2 entered promiscuous mode
Aug 23 18:28:05.728034 str2-z9332f-05 INFO kernel: [ 29.845575] Bridge: port 24(Ethernet2) entered blocking state
Aug 23 18:28:05.728037 str2-z9332f-05 INFO kernel: [ 29.845578] Bridge: port 24(Ethernet2) entered forwarding state
Aug 23 18:28:05.731057 str2-z9332f-05 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet2 admin:1 oper:1 addr:c8:f7:50:ed:15:41 ifindex:37 master:17
Aug 23 18:28:05.737032 str2-z9332f-05 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet2 admin:1 oper:1 addr:c8:f7:50:ed:15:41 ifindex:37 master:17
Aug 23 18:28:05.737143 str2-z9332f-05 ERR swss#orchagent: :- meta_generic_validation_remove: object 0x3a000000000d1b reference count is 1, can't remove
Aug 23 18:28:05.737143 str2-z9332f-05 ERR swss#orchagent: :- removeDefaultBridgePorts: Failed to remove bridge port, rv:-17
Aug 23 18:28:05.737302 str2-z9332f-05 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet24 admin:1 oper:1 addr:c8:f7:50:ed:15:41 ifindex:48 master:0
Aug 23 18:28:05.739157 str2-z9332f-05 NOTICE swss#portsyncd: :- onMsg: Publish Ethernet24(ok) to state db
Aug 23 18:28:05.739157 str2-z9332f-05 INFO swss#/supervisord: orchagent terminate called after throwing an instance of 'std::runtime_error'
Aug 23 18:28:05.739157 str2-z9332f-05 INFO swss#/supervisord: orchagent what(): PortsOrch initialization failure
...
The corresponding SAIREDIS record shows where the query occured and a list of non default VLAN bridge port were returned as part of the query which proves this is a SAI issue:
Description
While testing TH3 Advanced reboot, part of the warmreboot test we observed Orchagent crashed. Upon closer examination on the syslog we found that the crash was due to BRCM SAI returning a set of bridge ports that belong to VLAN 1000 instead of the default VLAN (1). This caused the meta checker to catch that the reference count being > 0 and failed the attempt to remove these non default bridge ports. This issue is easily reproduced by executing the "test_advanced_reboot.py" on a TH3 T0 DUT where Orchagent Core is observed.
Here is the portion of the syslog that shows the issue:
The corresponding SAIREDIS record shows where the query occured and a list of non default VLAN bridge port were returned as part of the query which proves this is a SAI issue:
Steps to reproduce the issue:
Describe the results you received:
Orchagent crashed in the warm reboot test portion of this tes.
Describe the results you expected:
No crash.
Output of
show version
:Output of
show techsupport
:Additional information you deem important (e.g. issue happens only occasionally):
BRCM CSP CS00012205357 filed to track this issue. syslog.txt