sonic-net / sonic-buildimage

Scripts which perform an installable binary image build for SONiC
Other
723 stars 1.38k forks source link

Subinterface creation on Broadcom switches cause multiple container shutdown #18237

Open rlebedys opened 6 months ago

rlebedys commented 6 months ago

Description

When creating a subinterface on Broadcom-based switches (Trident 3) it causes multiple containers to exit.

Steps to reproduce the issue:

  1. execute command config subinterface add EthernetXX.20 20

Describe the results you received:

Multiple containers (swss, syncd and others) exit and switch becomes unstable. Containers are in a crash loop.

Describe the results you expected:

Created subinterface on port EthernetXX.

Output of show version:

SONiC Software Version: SONiC.202311.480461-bacd21577
SONiC OS Version: 11
Distribution: Debian 11.8
Kernel: 5.10.0-23-2-amd64
Build commit: bacd21577
Build date: Sun Feb 18 12:27:37 UTC 2024
Built by: AzDevOps@vmss-soni0033YT

Platform: x86_64-accton_as7326_56x-r0
HwSKU: Accton-AS7326-56X
ASIC: broadcom
ASIC Count: 1

Additional information you deem important (e.g. issue happens only occasionally):

Broadcom SAI version:

:~# bcmcmd "bcmsai ver"
bcmsai ver
BRCM SAI ver: [10.1.6.0], OCP SAI ver: [1.13.2], SDK ver: [sdk-6.5.29], CANCUN ver: [06.04.01]
drivshell>

Attaching logs right after execution of config subinterface add command. subinterface_add_logs.txt

adyeung commented 6 months ago

I am not able to open the log, please upload techsupport output

rlebedys commented 6 months ago

@adyeung, I am adding the logs to the comment.

logs ``` Feb 19 13:59:29.504648 gs1-leaf71 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet72 admin:1 oper:1 addr:80:a2:35:26:1b:5e ifindex:313 master:0 Feb 19 13:59:29.505178 gs1-leaf71 NOTICE swss#portsyncd: :- onMsg: Publish Ethernet72(ok:up) to state db Feb 19 13:59:29.505178 gs1-leaf71 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet72.20 admin:0 oper:0 addr:80:a2:35:26:1b:5e ifindex:315 master:0 type:vlan Feb 19 13:59:29.505662 gs1-leaf71 WARNING pmon#xcvrd[30]: message repeated 2 times: [ $$$ Ethernet76 handle_port_update_event() : op=SET DB:STATE_DB Table:PORT_TABLE fvp {'state': 'ok', 'netdev_oper_status': 'up', 'admin_status': 'up', 'mtu': '9100', 'supported_speeds': '40000,100000', 'supported_fecs': 'none,rs', 'host_tx_ready': 'true', 'speed': '40000', 'fec': 'N/A'}] Feb 19 13:59:29.505662 gs1-leaf71 WARNING pmon#xcvrd[30]: $$$ Ethernet72 handle_port_update_event() : op=SET DB:STATE_DB Table:PORT_TABLE fvp {'state': 'ok', 'netdev_oper_status': 'up', 'admin_status': 'up', 'mtu': '9100', 'supported_speeds': '40000,100000', 'supported_fecs': 'none,rs', 'host_tx_ready': 'true', 'speed': '40000', 'fec': 'N/A'} Feb 19 13:59:29.506135 gs1-leaf71 NOTICE swss#portsyncd: :- onMsg: Cannot find Ethernet72.20 in port table Feb 19 13:59:29.506342 gs1-leaf71 INFO systemd-udevd[88249]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable. Feb 19 13:59:29.506810 gs1-leaf71 INFO systemd-udevd[88249]: Using default interface naming scheme 'v247'. Feb 19 13:59:29.508691 gs1-leaf71 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet72.20 admin:1 oper:1 addr:80:a2:35:26:1b:5e ifindex:315 master:0 type:vlan Feb 19 13:59:29.508824 gs1-leaf71 NOTICE swss#portsyncd: :- onMsg: Cannot find Ethernet72.20 in port table Feb 19 13:59:29.509359 gs1-leaf71 NOTICE swss#orchagent: :- doTask: Removed pending neighbor DEL operation for Ethernet72:169.254.0.1 after SET operation Feb 19 13:59:29.510046 gs1-leaf71 WARNING pmon#xcvrd[30]: $$$ Ethernet72.20 handle_port_update_event() : op=SET DB:STATE_DB Table:PORT_TABLE fvp {'state': 'ok'} Feb 19 13:59:29.510046 gs1-leaf71 WARNING pmon#xcvrd[30]: *** Ethernet72.20STATE_DBPORT_TABLE handle_port_update_event() fvp {'index': '-1', 'key': 'Ethernet72.20', 'asic_id': 0, 'op': 'SET'} Feb 19 13:59:29.510307 gs1-leaf71 ERR pmon#xcvrd[30]: Exception occured at CmisManagerTask thread due to KeyError(None) Feb 19 13:59:29.510752 gs1-leaf71 DEBUG bgp#bgpcfgd: Received message : '('Ethernet72.20', 'SET', (('vrf', ''),))' Feb 19 13:59:29.511034 gs1-leaf71 NOTICE swss#orchagent: :- addSubPort: Sub interface Ethernet72.20 inherits mtu size 9100 from parent port Ethernet72 Feb 19 13:59:29.511773 gs1-leaf71 ERR pmon#xcvrd[30]: Traceback (most recent call last): Feb 19 13:59:29.511983 gs1-leaf71 ERR pmon#xcvrd[30]: File "/usr/local/lib/python3.9/dist-packages/xcvrd/xcvrd.py", line 1523, in run Feb 19 13:59:29.511983 gs1-leaf71 ERR pmon#xcvrd[30]: self.task_worker() Feb 19 13:59:29.511983 gs1-leaf71 ERR pmon#xcvrd[30]: File "/usr/local/lib/python3.9/dist-packages/xcvrd/xcvrd.py", line 1228, in task_worker Feb 19 13:59:29.511983 gs1-leaf71 ERR pmon#xcvrd[30]: self.port_dict[lport]['host_tx_ready'] = self.get_host_tx_status(lport) Feb 19 13:59:29.511983 gs1-leaf71 ERR pmon#xcvrd[30]: File "/usr/local/lib/python3.9/dist-packages/xcvrd/xcvrd.py", line 1100, in get_host_tx_status Feb 19 13:59:29.511983 gs1-leaf71 ERR pmon#xcvrd[30]: state_port_tbl = self.xcvr_table_helper.get_state_port_tbl(asic_index) Feb 19 13:59:29.511983 gs1-leaf71 ERR pmon#xcvrd[30]: File "/usr/local/lib/python3.9/dist-packages/xcvrd/xcvrd.py", line 2426, in get_state_port_tbl Feb 19 13:59:29.511983 gs1-leaf71 ERR pmon#xcvrd[30]: return self.state_port_tbl[asic_id] Feb 19 13:59:29.511983 gs1-leaf71 ERR pmon#xcvrd[30]: KeyError: None Feb 19 13:59:29.516009 gs1-leaf71 ERR pmon#xcvrd[30]: Xcvrd: exception found at child thread CmisManagerTask due to KeyError(None) Feb 19 13:59:29.516009 gs1-leaf71 ERR pmon#xcvrd[30]: Exiting main loop as child thread raised exception! Feb 19 13:59:29.516009 gs1-leaf71 NOTICE swss#orchagent: :- setHostIntfsStripTag: Set SAI_HOSTIF_VLAN_TAG_KEEP to host interface: Ethernet72 Feb 19 13:59:29.516009 gs1-leaf71 INFO syncd#syncd: [none] SAI_API_PORT:_brcm_sai_link_event_cb:1558 Port 127 link down event cause: LOCAL Feb 19 13:59:29.516009 gs1-leaf71 INFO syncd#syncd: [none] SAI_API_ROUTER_INTERFACE:_brcm_sai_sub_router_intf_l2_config:1812 Creating vlan Feb 19 13:59:29.516009 gs1-leaf71 ERR syncd#syncd: [none] SAI_API_VLAN:_brcm_sai_vlan_create_internal_vfi:4546 MC-GRP create failed with error Feature unavailable (0xfffffff0). Feb 19 13:59:29.516009 gs1-leaf71 ERR syncd#syncd: [none] SAI_API_ROUTER_INTERFACE:_brcm_sai_sub_router_intf_l2_config:1852 internal vfi create failed with error -2. Feb 19 13:59:29.516009 gs1-leaf71 ERR syncd#syncd: [none] SAI_API_ROUTER_INTERFACE:_brcm_sai_xgs_create_sub_port_router_interface:3940 Sub-Port RIF L2 Config failed with error -2. Feb 19 13:59:29.516009 gs1-leaf71 ERR syncd#syncd: [none] SAI_API_ROUTER_INTERFACE:_brcm_sai_xgs_create_sub_port_router_interface:4001 SubPort Router Interface Create Failed for port:123 lag:no vlan:20 vpnid:20 vp:0x0 vfp_entry_id:0 l3_intf_id:0 rv:-2 Feb 19 13:59:29.516009 gs1-leaf71 ERR syncd#syncd: [none] SAI_API_ROUTER_INTERFACE:brcm_sai_xgs_create_router_interface:5176 Error in create router interface failed with error -2. Feb 19 13:59:29.516009 gs1-leaf71 ERR syncd#syncd: [none] SAI_API_ROUTER_INTERFACE:brcm_sai_create_router_interface:493 pd router intf create failed with error -2. Feb 19 13:59:29.516009 gs1-leaf71 ERR syncd#syncd: [none] SAI_API_ROUTER_INTERFACE:brcm_sai_create_router_interface:522 Router Interface Create Failed rv:-2 Feb 19 13:59:29.516009 gs1-leaf71 ERR syncd#syncd: [none] SAI_API_ROUTER_INTERFACE:brcm_sai_router_interface_create_err_cleanup:7140 RIF Create failed: rif_id:0 type:4 vrf:0 port-lag-id:123 lag:no vlan:20 virtual:no Feb 19 13:59:29.516009 gs1-leaf71 ERR syncd#syncd: :- sendApiResponse: api SAI_COMMON_API_CREATE failed in syncd mode: SAI_STATUS_NOT_SUPPORTED Feb 19 13:59:29.516009 gs1-leaf71 ERR syncd#syncd: :- processQuadEvent: attr: SAI_ROUTER_INTERFACE_ATTR_VIRTUAL_ROUTER_ID: oid:0x300000000003a Feb 19 13:59:29.516009 gs1-leaf71 ERR syncd#syncd: :- processQuadEvent: attr: SAI_ROUTER_INTERFACE_ATTR_SRC_MAC_ADDRESS: 80:A2:35:26:1B:5E Feb 19 13:59:29.516009 gs1-leaf71 ERR syncd#syncd: :- processQuadEvent: attr: SAI_ROUTER_INTERFACE_ATTR_TYPE: SAI_ROUTER_INTERFACE_TYPE_SUB_PORT Feb 19 13:59:29.516009 gs1-leaf71 ERR syncd#syncd: :- processQuadEvent: attr: SAI_ROUTER_INTERFACE_ATTR_PORT_ID: oid:0x1000000000038 Feb 19 13:59:29.516009 gs1-leaf71 ERR syncd#syncd: :- processQuadEvent: attr: SAI_ROUTER_INTERFACE_ATTR_OUTER_VLAN_ID: 20 Feb 19 13:59:29.516009 gs1-leaf71 ERR syncd#syncd: :- processQuadEvent: attr: SAI_ROUTER_INTERFACE_ATTR_ADMIN_V4_STATE: true Feb 19 13:59:29.516009 gs1-leaf71 ERR syncd#syncd: :- processQuadEvent: attr: SAI_ROUTER_INTERFACE_ATTR_ADMIN_V6_STATE: true Feb 19 13:59:29.516009 gs1-leaf71 ERR syncd#syncd: :- processQuadEvent: attr: SAI_ROUTER_INTERFACE_ATTR_MTU: 9100 Feb 19 13:59:29.516009 gs1-leaf71 ERR syncd#syncd: :- processQuadEvent: attr: SAI_ROUTER_INTERFACE_ATTR_NAT_ZONE_ID: 0 Feb 19 13:59:29.516009 gs1-leaf71 ERR swss#orchagent: :- create: create status: SAI_STATUS_NOT_SUPPORTED Feb 19 13:59:29.516009 gs1-leaf71 ERR swss#orchagent: :- addRouterIntfs: Failed to create router interface Ethernet72.20, rv:-2 Feb 19 13:59:29.516009 gs1-leaf71 ERR swss#orchagent: :- handleSaiCreateStatus: Encountered failure in create operation, exiting orchagent, SAI API: SAI_API_ROUTER_INTERFACE, status: SAI_STATUS_NOT_SUPPORTED ```

Also attaching the techsupport dump archive that was taken when containers exited after subinterface creation. sonic_dump_61W5SR3-mgmt_20240313_090936.tar.gz

adyeung commented 6 months ago

Problem is specific to DellEMC-S5248f-P-25G, it appears the community DELL td3-s5248f-25g.config.bcm is missing SOC parameter flow_init_mode = 1 for VFI MGID creation, besides that there are other parameters also needed for VLAN VFI to work in TD3 for sub intf creation.

Request DELL contributor @aravindmani-1 to help followup and update the file

rlebedys commented 6 months ago

Thanks for the update, I noticed the same issue on Accton-AS7326-56X and Accton-AS7726-32X, however, I don't have access to them anymore, and I can't collect any specific information.

rlebedys commented 3 months ago

@adyeung @aravindmani-1 is this fix going to get merged to the master?

aravindmani-1 commented 3 months ago

@adyeung @aravindmani-1 is this fix going to get merged to the master? Yes. This will be merged into master branch. @prgeor Could you please help to merge this PR https://github.com/sonic-net/sonic-buildimage/pull/18505 ?.

tomvil commented 2 months ago

The same happens with accton_as7326_56x switches. Are there any updates regarding Accton platform?

SONiC Software Version: SONiC.202405.0-dirty-20240620.233504
SONiC OS Version: 12
Distribution: Debian 12.5
Kernel: 6.1.0-11-2-amd64
Build commit: 926d03322
Build date: Thu Jun 20 22:58:12 UTC 2024

Platform: x86_64-accton_as7326_56x-r0
HwSKU: Accton-AS7326-56X
ASIC: broadcom
ASIC Count: 1
NerijusRazvodovskis commented 2 months ago

Hey @adyeung.

perhaps you had a chance to take a look at accton_as7326_56x switches, seems like they are facing the same issue as those Dell's.

adyeung commented 2 months ago

@jostar-yang please help update the config.bcm files from Accton

tomvil commented 2 months ago

@jostar-yang have you had the opportunity to review this issue?

NerijusRazvodovskis commented 2 months ago

@jostar-yang Hello, any update regarding this?

rlebedys commented 1 month ago

@aravindmani-1 any news about this?

tomvil commented 1 week ago

@rlebedys did you test @aravindmani-1 fix, does it work for you? I've just tested it with s5248f and still, as soon as I add subinterface containers start to crash.

The error:

2024 Aug 30 08:09:11.250333 leaf1 ERR syncd#syncd: [none] SAI_API_ROUTER_INTERFACE:_brcm_sai_xgs_create_router_interface_common_config:3205 L3 intf create failed with error -2.
2024 Aug 30 08:09:11.250333 leaf1 ERR syncd#syncd: [none] SAI_API_ROUTER_INTERFACE:_brcm_sai_xgs_create_router_interface_common_config:3278 RIF common config create failed rv:-2
2024 Aug 30 08:09:11.250333 leaf1 ERR syncd#syncd: [none] SAI_API_ROUTER_INTERFACE:_brcm_sai_xgs_create_sub_port_router_interface:3947 Sub-Port RIF common Config failed with error -2.
2024 Aug 30 08:09:11.250333 leaf1 ERR syncd#syncd: [none] SAI_API_ROUTER_INTERFACE:_brcm_sai_xgs_create_sub_port_router_interface:4001 SubPort Router Interface Create Failed for port:49 lag:no vlan:666 vpnid:32768 vp:0xb0000001 vfp_entry_id:0 l3_intf_id:0 rv:-2
2024 Aug 30 08:09:11.250333 leaf1 INFO syncd#syncd: [none] SAI_API_ROUTER_INTERFACE:_brcm_sai_sub_router_intf_l2_unconfig:1964 destroy vlan
2024 Aug 30 08:09:11.250537 leaf1 ERR syncd#syncd: [none] SAI_API_ROUTER_INTERFACE:brcm_sai_xgs_create_router_interface:5176 Error in create router interface failed with error -2.
2024 Aug 30 08:09:11.250590 leaf1 ERR syncd#syncd: [none] SAI_API_ROUTER_INTERFACE:brcm_sai_create_router_interface:493 pd router intf create failed with error -2.
2024 Aug 30 08:09:11.250862 leaf1 ERR syncd#syncd: [none] SAI_API_ROUTER_INTERFACE:brcm_sai_create_router_interface:522 Router Interface Create Failed rv:-2
2024 Aug 30 08:09:11.250909 leaf1 ERR syncd#syncd: [none] SAI_API_ROUTER_INTERFACE:brcm_sai_router_interface_create_err_cleanup:7140 RIF Create failed: rif_id:0 type:4 vrf:0 port-lag-id:49 lag:no vlan:666 virtual:no
2024 Aug 30 08:09:11.250958 leaf1 ERR syncd#syncd: :- sendApiResponse: api SAI_COMMON_API_CREATE failed in syncd mode: SAI_STATUS_NOT_SUPPORTED
SONiC Software Version: SONiC.202405.0-dirty-20240830.091822
SONiC OS Version: 12
Distribution: Debian 12.6
Kernel: 6.1.0-11-2-amd64
Build commit: 249c20bdf
Build date: Fri Aug 30 06:58:39 UTC 2024

Platform: x86_64-dellemc_s5248f_c3538-r0
HwSKU: DellEMC-S5248f-P-25G
ASIC: broadcom
ASIC Count: 1
Hardware Revision: N/A
Uptime: 08:06:06 up 12 min,  1 user,  load average: 2.89, 2.14, 1.36
Date: Fri 30 Aug 2024 08:06:06

Docker images:
REPOSITORY                    TAG                              IMAGE ID       SIZE
docker-dhcp-relay             latest                           934cdc88b25f   324MB
docker-dhcp-server            latest                           cdf709f8a11d   338MB
docker-fpm-frr                202405.0-dirty-20240830.091822   1b46e9d04a15   375MB
docker-fpm-frr                latest                           1b46e9d04a15   375MB
docker-macsec                 latest                           3a77d124e235   346MB
docker-lldp                   202405.0-dirty-20240830.091822   7904ddbfb954   360MB
docker-lldp                   latest                           7904ddbfb954   360MB
docker-mux                    202405.0-dirty-20240830.091822   f355662bc7b3   366MB
docker-mux                    latest                           f355662bc7b3   366MB
docker-snmp                   202405.0-dirty-20240830.091822   81a0b637c93e   354MB
docker-snmp                   latest                           81a0b637c93e   354MB
docker-sonic-gnmi             202405.0-dirty-20240830.091822   e4a31bbc8cd4   399MB
docker-sonic-gnmi             latest                           e4a31bbc8cd4   399MB
docker-sonic-mgmt-framework   202405.0-dirty-20240830.091822   6d5e68ff3033   401MB
docker-sonic-mgmt-framework   latest                           6d5e68ff3033   401MB
docker-teamd                  202405.0-dirty-20240830.091822   3f52352e3264   343MB
docker-teamd                  latest                           3f52352e3264   343MB
docker-platform-monitor       202405.0-dirty-20240830.091822   c3c08f5f6d41   440MB
docker-platform-monitor       latest                           c3c08f5f6d41   440MB
docker-sflow                  202405.0-dirty-20240830.091822   9da2da5cca1c   344MB
docker-sflow                  latest                           9da2da5cca1c   344MB
docker-router-advertiser      202405.0-dirty-20240830.091822   c932382d33d1   315MB
docker-router-advertiser      latest                           c932382d33d1   315MB
docker-orchagent              202405.0-dirty-20240830.091822   31fa919519aa   356MB
docker-orchagent              latest                           31fa919519aa   356MB
docker-nat                    202405.0-dirty-20240830.091822   85a7be8ce26d   346MB
docker-nat                    latest                           85a7be8ce26d   346MB
docker-iccpd                  202405.0-dirty-20240830.091822   d44f59428033   344MB
docker-iccpd                  latest                           d44f59428033   344MB
docker-database               202405.0-dirty-20240830.091822   59cefa77b041   323MB
docker-database               latest                           59cefa77b041   323MB
docker-eventd                 202405.0-dirty-20240830.091822   bbe4d9b78786   314MB
docker-eventd                 latest                           bbe4d9b78786   314MB
docker-syncd-brcm             202405.0-dirty-20240830.091822   3f34d16e8e42   717MB
docker-syncd-brcm             latest                           3f34d16e8e42   717MB
docker-gbsyncd-broncos        202405.0-dirty-20240830.091822   6ac692db5646   354MB
docker-gbsyncd-broncos        latest                           6ac692db5646   354MB
docker-gbsyncd-credo          202405.0-dirty-20240830.091822   3f63e3eb401e   327MB
docker-gbsyncd-credo          latest                           3f63e3eb401e   327MB

the fix is applied:

# cat /usr/share/sonic/device/x86_64-dellemc_s5248f_c3538-r0/DellEMC-S5248f-P-25G/td3-s5248f-25g.config.bcm 
...
mem_cache_enable=0
lpm_scaling_enable=0
bcm_num_cos=10
default_cpu_tx_queue=9
host_as_route_disable=1
sai_eapp_config_file=/etc/broadcom/eapps_cfg.json
sai_fast_convergence_support=1
flow_init_mode=1
sai_load_hw_config=/usr/lib/cancun/
...
aravindmani-1 commented 1 week ago

@tomvil could you please share the complete steps that you tried?. Did you tried restarting the switch after applying the NPU configs?.. From the logs shared, SAI API unsupported messages are seen.

tomvil commented 1 week ago

@aravindmani-1 I have built the image (202405 branch) with your commit from https://github.com/sonic-net/sonic-buildimage/pull/18505 pull request. I see the configuration is present in td3-s5248f-25g.config.bcm. And yes, I have tried to restart it.

Is there anything else I can check for you?

SAI version on my switch:

# bcmcmd "bcmsai ver"
bcmsai ver
BRCM SAI ver: [10.1.37.0], OCP SAI ver: [1.13.2], SDK ver: [sdk-6.5.29], CANCUN ver: [06.04.01]
aravindmani-1 commented 1 week ago

could you share the complete steps that you tried to recreate the issue(starting from the commands used)?.

tomvil commented 1 week ago

@aravindmani-1 here's how I reproduce the issue every time:

  1. Install fresh image (built from 202405 branch + your commit)
  2. Wait for containers to become stable
  3. Add subinterface with command config subinterface add Ethernet0.666 666
  4. Wait a few seconds and containers will start to go down/flap.
aravindmani-1 commented 1 week ago

@tomvil can you upload the "show techsupport" logs(when you hit the issue, please collect logs since one hour using techsupport options)?.