sonic-net / sonic-buildimage

Scripts which perform an installable binary image build for SONiC
Other
742 stars 1.43k forks source link

[202405] Inserting/Removing SFP causes attribute update timeout and crash on switches Broadcom ASIC #20784

Open rlebedys opened 1 week ago

rlebedys commented 1 week ago

Description

When SFP or QSFP module gets installed or removed it triggers some attribute updates (neighbor changes, nexthop group updates, etc) and sometimes it fails to complete within 30 seconds causing a complete switch crash.

Steps to reproduce the issue:

  1. Install or remove SFP or QSFP module

Describe the results you received:

Timeout on attribute update leading to a crash.

Describe the results you expected:

No crash.

Output of show version:

SONiC Software Version: SONiC.202405.693177-9f2570e20
SONiC OS Version: 12
Distribution: Debian 12.6
Kernel: 6.1.0-22-2-amd64
Build commit: 9f2570e20
Build date: Tue Nov 12 13:51:21 UTC 2024
Built by: azureuser@6a003250c000001

Platform: x86_64-dellemc_s5248f_c3538-r0
HwSKU: DellEMC-S5248f-P-25G
ASIC: broadcom
ASIC Count: 1
Serial Number: 61P7SR3
Model Number: 006Y6V
Hardware Revision: N/A
Uptime: 08:52:10 up 40 min,  4 users,  load average: 2.23, 1.74, 1.60
Date: Wed 13 Nov 2024 08:52:10

Docker images:
REPOSITORY                    TAG                       IMAGE ID       SIZE
docker-gbsyncd-broncos        202405.693177-9f2570e20   d178132b5ea3   352MB
docker-gbsyncd-broncos        latest                    d178132b5ea3   352MB
docker-syncd-brcm             202405.693177-9f2570e20   f5481acf8307   743MB
docker-syncd-brcm             latest                    f5481acf8307   743MB
docker-gbsyncd-credo          202405.693177-9f2570e20   7ceeb78f9794   326MB
docker-gbsyncd-credo          latest                    7ceeb78f9794   326MB
docker-dhcp-relay             latest                    be11710c38db   324MB
docker-macsec                 latest                    f78e1d2c072f   345MB
docker-platform-monitor       202405.693177-9f2570e20   3788669cb340   440MB
docker-platform-monitor       latest                    3788669cb340   440MB
docker-eventd                 202405.693177-9f2570e20   bbe316074740   314MB
docker-eventd                 latest                    bbe316074740   314MB
docker-teamd                  202405.693177-9f2570e20   5ca3acaf107f   343MB
docker-teamd                  latest                    5ca3acaf107f   343MB
docker-orchagent              202405.693177-9f2570e20   c0868abbf0a8   356MB
docker-orchagent              latest                    c0868abbf0a8   356MB
docker-sflow                  202405.693177-9f2570e20   59d16de99e37   344MB
docker-sflow                  latest                    59d16de99e37   344MB
docker-nat                    202405.693177-9f2570e20   d3ba602a2785   345MB
docker-nat                    latest                    d3ba602a2785   345MB
docker-fpm-frr                202405.693177-9f2570e20   8a8a19cec4c5   375MB
docker-fpm-frr                latest                    8a8a19cec4c5   375MB
docker-snmp                   202405.693177-9f2570e20   dee8f6d55ce0   354MB
docker-snmp                   latest                    dee8f6d55ce0   354MB
docker-lldp                   202405.693177-9f2570e20   06b5b4b426f4   360MB
docker-lldp                   latest                    06b5b4b426f4   360MB
docker-mux                    202405.693177-9f2570e20   e3a6d8b89b9e   366MB
docker-mux                    latest                    e3a6d8b89b9e   366MB
docker-sonic-gnmi             202405.693177-9f2570e20   e927ae9f4f44   399MB
docker-sonic-gnmi             latest                    e927ae9f4f44   399MB
docker-database               202405.693177-9f2570e20   17e57af3b395   323MB
docker-database               latest                    17e57af3b395   323MB
docker-router-advertiser      202405.693177-9f2570e20   285d10cd21b6   315MB
docker-router-advertiser      latest                    285d10cd21b6   315MB
docker-sonic-mgmt-framework   202405.693177-9f2570e20   0d930e176ef3   401MB
docker-sonic-mgmt-framework   latest                    0d930e176ef3   401MB

Additional information you deem important (e.g. issue happens only occasionally):

Logs during the issue:

2024 Nov 13 08:49:40.223140 gs1-leaf63 NOTICE swss#orchagent: :- doTask: Get port state change notification id:1000000000010 status:2
2024 Nov 13 08:49:40.223140 gs1-leaf63 NOTICE swss#orchagent: :- updatePortOperStatus: Port Ethernet55 oper state set from up to down
2024 Nov 13 08:49:40.225627 gs1-leaf63 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet55 admin:1 oper:0 addr:c4:5a:b1:40:97:05 ifindex:86 master:0
2024 Nov 13 08:49:40.226906 gs1-leaf63 NOTICE swss#portsyncd: :- onMsg: Publish Ethernet55(ok:down) to state db
2024 Nov 13 08:49:40.227832 gs1-leaf63 NOTICE swss#orchagent: :- setHostIntfsOperStatus: Set operation status DOWN to host interface Ethernet55
2024 Nov 13 08:49:40.227832 gs1-leaf63 NOTICE swss#orchagent: :- publish: EVENT_PUBLISHED: {"sonic-events-swss:if-state":{"ifname":"Ethernet55","status":"down","timestamp":"2024-11-13T08:49:40.227362Z"}}
2024 Nov 13 08:49:40.229162 gs1-leaf63 WARNING pmon#xcvrd: $$$ Ethernet55 handle_port_update_event() : op=SET DB:STATE_DB Table:PORT_TABLE fvp {'state': 'ok', 'netdev_oper_status': 'down', 'admin_status': 'up', 'mtu': '9100', 'supported_
speeds': '40000,100000', 'supported_fecs': 'none,rs', 'host_tx_ready': 'true', 'speed': '40000', 'fec': 'N/A'}
2024 Nov 13 08:49:40.229828 gs1-leaf63 NOTICE syncd#syncd: [none] SAI_API_NEXT_HOP_GROUP:brcm_sai_xgs_nexthop_group_member_remove:665 ecmp 200001 nhid 16(0) wt 0 ol-if 100067 ul-if 0
2024 Nov 13 08:49:40.231312 gs1-leaf63 NOTICE syncd#syncd: [none] SAI_API_NEXT_HOP_GROUP:brcm_sai_xgs_nexthop_group_member_remove:665 ecmp 200002 nhid 16(0) wt 0 ol-if 100067 ul-if 0
2024 Nov 13 08:49:40.233297 gs1-leaf63 NOTICE syncd#syncd: [none] SAI_API_NEXT_HOP:brcm_sai_remove_next_hop:441 Removing nhid 14 if_id 100065
2024 Nov 13 08:49:40.421660 gs1-leaf63 NOTICE pmon#xcvrd: Ethernet0: Got SFP removed event
2024 Nov 13 08:49:40.439436 gs1-leaf63 NOTICE pmon#xcvrd: Ethernet0: received plug out and update port sfp status table.
2024 Nov 13 08:49:40.439728 gs1-leaf63 WARNING pmon#xcvrd: $$$ Ethernet0 handle_port_update_event() : op=SET DB:STATE_DB Table:PORT_TABLE fvp {'state': 'ok', 'netdev_oper_status': 'up', 'admin_status': 'up', 'mtu': '9100', 'supported_spe
eds': '1000,10000,25000', 'supported_fecs': 'none,rs,fc', 'host_tx_ready': 'true', 'speed': '10000', 'fec': 'N/A', 'NPU_SI_SETTINGS_SYNC_STATUS': 'NPU_SI_SETTINGS_DEFAULT'}
2024 Nov 13 08:49:40.507176 gs1-leaf63 WARNING pmon#xcvrd: $$$ Ethernet0 handle_port_update_event() : op=DEL DB:STATE_DB Table:TRANSCEIVER_INFO fvp {}
2024 Nov 13 08:49:40.507470 gs1-leaf63 WARNING pmon#xcvrd: *** ('Ethernet0', 'STATE_DB', 'TRANSCEIVER_INFO') handle_port_update_event() fvp {'index': '-1', 'port_name': 'Ethernet0', 'asic_id': 0, 'op': 'DEL'}
2024 Nov 13 08:49:40.946406 gs1-leaf63 NOTICE syncd#syncd: :- threadFunction: time span 713 ms for 'remove:SAI_OBJECT_TYPE_NEXT_HOP:oid:0x4000000000a81'
2024 Nov 13 08:49:41.785773 gs1-leaf63 ERR ntpd[8434]: IO: bind(20) AF_INET 172.16.31.163#123 flags 0x9 failed: Cannot assign requested address
2024 Nov 13 08:49:41.785923 gs1-leaf63 ERR ntpd[8434]: IO: unable to create socket on eth0 (73) for 172.16.31.163#123
2024 Nov 13 08:49:41.785989 gs1-leaf63 INFO ntpd[8434]: IO: failed to init interface for address 172.16.31.163
2024 Nov 13 08:49:41.946571 gs1-leaf63 NOTICE syncd#syncd: :- threadFunction: time span 1713 ms for 'remove:SAI_OBJECT_TYPE_NEXT_HOP:oid:0x4000000000a81'
2024 Nov 13 08:49:42.946716 gs1-leaf63 NOTICE syncd#syncd: :- threadFunction: time span 2713 ms for 'remove:SAI_OBJECT_TYPE_NEXT_HOP:oid:0x4000000000a81'
2024 Nov 13 08:49:43.946845 gs1-leaf63 NOTICE syncd#syncd: :- threadFunction: time span 3713 ms for 'remove:SAI_OBJECT_TYPE_NEXT_HOP:oid:0x4000000000a81'
2024 Nov 13 08:49:44.946999 gs1-leaf63 NOTICE syncd#syncd: :- threadFunction: time span 4713 ms for 'remove:SAI_OBJECT_TYPE_NEXT_HOP:oid:0x4000000000a81'
2024 Nov 13 08:49:45.947154 gs1-leaf63 NOTICE syncd#syncd: :- threadFunction: time span 5713 ms for 'remove:SAI_OBJECT_TYPE_NEXT_HOP:oid:0x4000000000a81'
2024 Nov 13 08:49:46.947200 gs1-leaf63 NOTICE syncd#syncd: :- threadFunction: time span 6713 ms for 'remove:SAI_OBJECT_TYPE_NEXT_HOP:oid:0x4000000000a81'
2024 Nov 13 08:49:47.947335 gs1-leaf63 NOTICE syncd#syncd: :- threadFunction: time span 7713 ms for 'remove:SAI_OBJECT_TYPE_NEXT_HOP:oid:0x4000000000a81'
2024 Nov 13 08:49:48.947457 gs1-leaf63 NOTICE syncd#syncd: :- threadFunction: time span 8714 ms for 'remove:SAI_OBJECT_TYPE_NEXT_HOP:oid:0x4000000000a81'
2024 Nov 13 08:49:49.947578 gs1-leaf63 NOTICE syncd#syncd: :- threadFunction: time span 9714 ms for 'remove:SAI_OBJECT_TYPE_NEXT_HOP:oid:0x4000000000a81'
2024 Nov 13 08:49:50.947714 gs1-leaf63 NOTICE syncd#syncd: :- threadFunction: time span 10714 ms for 'remove:SAI_OBJECT_TYPE_NEXT_HOP:oid:0x4000000000a81'
2024 Nov 13 08:49:51.947842 gs1-leaf63 NOTICE syncd#syncd: :- threadFunction: time span 11714 ms for 'remove:SAI_OBJECT_TYPE_NEXT_HOP:oid:0x4000000000a81'
2024 Nov 13 08:49:52.947972 gs1-leaf63 NOTICE syncd#syncd: :- threadFunction: time span 12714 ms for 'remove:SAI_OBJECT_TYPE_NEXT_HOP:oid:0x4000000000a81'
2024 Nov 13 08:49:53.948113 gs1-leaf63 NOTICE syncd#syncd: :- threadFunction: time span 13714 ms for 'remove:SAI_OBJECT_TYPE_NEXT_HOP:oid:0x4000000000a81'
2024 Nov 13 08:49:54.948289 gs1-leaf63 NOTICE syncd#syncd: :- threadFunction: time span 14714 ms for 'remove:SAI_OBJECT_TYPE_NEXT_HOP:oid:0x4000000000a81'
2024 Nov 13 08:49:55.948364 gs1-leaf63 NOTICE syncd#syncd: :- threadFunction: time span 15714 ms for 'remove:SAI_OBJECT_TYPE_NEXT_HOP:oid:0x4000000000a81'
2024 Nov 13 08:49:56.948517 gs1-leaf63 NOTICE syncd#syncd: :- threadFunction: time span 16715 ms for 'remove:SAI_OBJECT_TYPE_NEXT_HOP:oid:0x4000000000a81'
2024 Nov 13 08:49:57.948623 gs1-leaf63 NOTICE syncd#syncd: :- threadFunction: time span 17715 ms for 'remove:SAI_OBJECT_TYPE_NEXT_HOP:oid:0x4000000000a81'
2024 Nov 13 08:49:58.948738 gs1-leaf63 NOTICE syncd#syncd: :- threadFunction: time span 18715 ms for 'remove:SAI_OBJECT_TYPE_NEXT_HOP:oid:0x4000000000a81'
2024 Nov 13 08:49:59.948978 gs1-leaf63 NOTICE syncd#syncd: :- threadFunction: time span 19715 ms for 'remove:SAI_OBJECT_TYPE_NEXT_HOP:oid:0x4000000000a81'
2024 Nov 13 08:50:00.949069 gs1-leaf63 NOTICE syncd#syncd: :- threadFunction: time span 20715 ms for 'remove:SAI_OBJECT_TYPE_NEXT_HOP:oid:0x4000000000a81'
2024 Nov 13 08:50:01.949200 gs1-leaf63 NOTICE syncd#syncd: :- threadFunction: time span 21715 ms for 'remove:SAI_OBJECT_TYPE_NEXT_HOP:oid:0x4000000000a81'
2024 Nov 13 08:50:02.949328 gs1-leaf63 NOTICE syncd#syncd: :- threadFunction: time span 22715 ms for 'remove:SAI_OBJECT_TYPE_NEXT_HOP:oid:0x4000000000a81'
2024 Nov 13 08:50:03.949487 gs1-leaf63 NOTICE syncd#syncd: :- threadFunction: time span 23716 ms for 'remove:SAI_OBJECT_TYPE_NEXT_HOP:oid:0x4000000000a81'
2024 Nov 13 08:50:04.949562 gs1-leaf63 NOTICE syncd#syncd: :- threadFunction: time span 24716 ms for 'remove:SAI_OBJECT_TYPE_NEXT_HOP:oid:0x4000000000a81'
2024 Nov 13 08:50:05.949711 gs1-leaf63 NOTICE syncd#syncd: :- threadFunction: time span 25716 ms for 'remove:SAI_OBJECT_TYPE_NEXT_HOP:oid:0x4000000000a81'
2024 Nov 13 08:50:06.949816 gs1-leaf63 NOTICE syncd#syncd: :- threadFunction: time span 26716 ms for 'remove:SAI_OBJECT_TYPE_NEXT_HOP:oid:0x4000000000a81'
2024 Nov 13 08:50:07.949992 gs1-leaf63 NOTICE syncd#syncd: :- threadFunction: time span 27716 ms for 'remove:SAI_OBJECT_TYPE_NEXT_HOP:oid:0x4000000000a81'
2024 Nov 13 08:50:08.950116 gs1-leaf63 NOTICE syncd#syncd: :- threadFunction: time span 28716 ms for 'remove:SAI_OBJECT_TYPE_NEXT_HOP:oid:0x4000000000a81'
2024 Nov 13 08:50:09.950233 gs1-leaf63 NOTICE syncd#syncd: :- threadFunction: time span 29716 ms for 'remove:SAI_OBJECT_TYPE_NEXT_HOP:oid:0x4000000000a81'
2024 Nov 13 08:50:10.950413 gs1-leaf63 NOTICE syncd#syncd: :- threadFunction: time span 30716 ms for 'remove:SAI_OBJECT_TYPE_NEXT_HOP:oid:0x4000000000a81'
2024 Nov 13 08:50:10.950685 gs1-leaf63 ERR syncd#syncd: :- threadFunction: time span WD exceeded 30716 ms for remove:SAI_OBJECT_TYPE_NEXT_HOP:oid:0x4000000000a81
2024 Nov 13 08:50:10.950825 gs1-leaf63 ERR syncd#syncd: :- logEventData: op: remove, key: SAI_OBJECT_TYPE_NEXT_HOP:oid:0x4000000000a81
2024 Nov 13 08:50:18.757039 gs1-leaf63 INFO systemd[1]: Starting sysstat-collect.service - system activity accounting tool...
2024 Nov 13 08:50:18.767258 gs1-leaf63 INFO systemd[1]: Starting logrotate.service - Rotate log files...
2024 Nov 13 08:50:18.772469 gs1-leaf63 INFO systemd[1]: sysstat-collect.service: Deactivated successfully.
2024 Nov 13 08:50:18.772970 gs1-leaf63 INFO systemd[1]: Finished sysstat-collect.service - system activity accounting tool.
2024 Nov 13 08:50:18.959161 gs1-leaf63 INFO systemd[1]: logrotate.service: Deactivated successfully.
2024 Nov 13 08:50:18.964231 gs1-leaf63 INFO systemd[1]: Finished logrotate.service - Rotate log files.
2024 Nov 13 08:50:37.686601 gs1-leaf63 WARNING swss#supervisor-proc-exit-listener: Process 'orchagent' is stuck in namespace 'host' (1.0 minutes).
2024 Nov 13 08:50:40.291515 gs1-leaf63 ERR swss#orchagent: :- wait: SELECT operation result: TIMEOUT on getresponse
2024 Nov 13 08:50:40.291515 gs1-leaf63 ERR swss#orchagent: :- wait: failed to get response for getresponse
2024 Nov 13 08:50:40.291515 gs1-leaf63 ERR swss#orchagent: :- remove: remove status: SAI_STATUS_FAILURE
2024 Nov 13 08:50:40.291515 gs1-leaf63 ERR swss#orchagent: :- removeNeighbor: Failed to remove next hop fe80::920a:84ff:fe0a:1c38 on Ethernet55, rv:-1
2024 Nov 13 08:50:40.291515 gs1-leaf63 ERR swss#orchagent: :- handleSaiRemoveStatus: Encountered failure in remove operation, exiting orchagent, SAI API: SAI_API_NEXT_HOP, status: SAI_STATUS_FAILURE
2024 Nov 13 08:50:40.291559 gs1-leaf63 NOTICE swss#orchagent: :- notifySyncd: sending syncd: SYNCD_INVOKE_DUMP
prgeor commented 6 days ago

@rlebedys can you please attach the coredump file as well?

prgeor commented 6 days ago

@prsunny can you help triage why OA timeout on select?

rlebedys commented 5 days ago

@prgeor, I am attaching a orchagent core dump of the incident.

orchagent.1731487900.165.core.gz

rlebedys commented 1 day ago

@prgeor @prsunny did you have a chance to check this issue? If it helps I managed to replicate it on 202311 branch. This issue did not reproduce on 202305.