sonic-net / sonic-buildimage

Scripts which perform an installable binary image build for SONiC
Other
741 stars 1.44k forks source link

Inserting SFP module causes a crash #20331

Open rlebedys opened 2 months ago

rlebedys commented 2 months ago

Description

When an SFP module gets inserted on a Dell S5248f-P-25G switch a crash occurs that restarts all the containers and ports. Ports are configured as 10G and the installed SFP module is 25G - FTLF8536P5BCL-HP FINISAR CORP. Later on, when replacing them back to 10G - it crashes again.

Steps to reproduce the issue:

  1. Install 25G SFP module to a port (FTLF8536P5BCL-HP)

Describe the results you received:

Switch crashed and restarted all the containers causing ports to flap.

Describe the results you expected:

No crash.

Output of show version:

SONiC Software Version: SONiC.master_2023_06_05_fpga_fix.0-dirty-20230605.130933
SONiC OS Version: 11
Distribution: Debian 11.7
Kernel: 5.10.0-18-2-amd64
Build commit: e1cb774b7
Build date: Mon Jun  5 11:45:26 UTC 2023
Built by: tadas@TadasXPS

Platform: x86_64-dellemc_s5248f_c3538-r0
HwSKU: DellEMC-S5248f-P-25G
ASIC: broadcom
ASIC Count: 1
Serial Number: 61N6SR3
Model Number: 006Y6V
Hardware Revision: N/A
Uptime: 08:51:42 up 466 days, 19:21,  1 user,  load average: 2.91, 2.74, 2.42
Date: Mon 23 Sep 2024 08:51:42

Docker images:
REPOSITORY                    TAG                                                  IMAGE ID       SIZE
sonic_exporter_fixed          latest                                               08275fefa674   164MB
docker-teamd                  latest                                               2fac41bcdc59   316MB
docker-teamd                  master_2023_06_05_fpga_fix.0-dirty-20230605.130933   2fac41bcdc59   316MB
docker-orchagent              latest                                               16756a86ac8f   328MB
docker-orchagent              master_2023_06_05_fpga_fix.0-dirty-20230605.130933   16756a86ac8f   328MB
docker-fpm-frr                latest                                               0197dcc8fb03   346MB
docker-fpm-frr                master_2023_06_05_fpga_fix.0-dirty-20230605.130933   0197dcc8fb03   346MB
docker-sflow                  latest                                               46b7eefe118c   317MB
docker-sflow                  master_2023_06_05_fpga_fix.0-dirty-20230605.130933   46b7eefe118c   317MB
docker-nat                    latest                                               6d93ce909958   319MB
docker-nat                    master_2023_06_05_fpga_fix.0-dirty-20230605.130933   6d93ce909958   319MB
docker-macsec                 latest                                               b0c1234bb9ec   318MB
docker-syncd-brcm             latest                                               0cd6663272b5   672MB
docker-syncd-brcm             master_2023_06_05_fpga_fix.0-dirty-20230605.130933   0cd6663272b5   672MB
docker-gbsyncd-broncos        latest                                               95df83515e3f   348MB
docker-gbsyncd-broncos        master_2023_06_05_fpga_fix.0-dirty-20230605.130933   95df83515e3f   348MB
docker-gbsyncd-credo          latest                                               0870c36f0977   321MB
docker-gbsyncd-credo          master_2023_06_05_fpga_fix.0-dirty-20230605.130933   0870c36f0977   321MB
docker-dhcp-relay             latest                                               30552f488478   306MB
docker-eventd                 latest                                               7349c946faa0   298MB
docker-eventd                 master_2023_06_05_fpga_fix.0-dirty-20230605.130933   7349c946faa0   298MB
docker-snmp                   latest                                               49681ad89f42   338MB
docker-snmp                   master_2023_06_05_fpga_fix.0-dirty-20230605.130933   49681ad89f42   338MB
docker-platform-monitor       latest                                               aa444cdc421d   420MB
docker-platform-monitor       master_2023_06_05_fpga_fix.0-dirty-20230605.130933   aa444cdc421d   420MB
docker-sonic-telemetry        latest                                               3e4c7d45b82d   597MB
docker-sonic-telemetry        master_2023_06_05_fpga_fix.0-dirty-20230605.130933   3e4c7d45b82d   597MB
docker-router-advertiser      latest                                               11732d20d93d   299MB
docker-router-advertiser      master_2023_06_05_fpga_fix.0-dirty-20230605.130933   11732d20d93d   299MB
docker-sonic-p4rt             latest                                               71d0b47ee486   870MB
docker-sonic-p4rt             master_2023_06_05_fpga_fix.0-dirty-20230605.130933   71d0b47ee486   870MB
docker-mux                    latest                                               cb3a35917325   347MB
docker-mux                    master_2023_06_05_fpga_fix.0-dirty-20230605.130933   cb3a35917325   347MB
docker-lldp                   latest                                               7f9b1935015c   341MB
docker-lldp                   master_2023_06_05_fpga_fix.0-dirty-20230605.130933   7f9b1935015c   341MB
docker-database               latest                                               00ab18aa58a2   299MB
docker-database               master_2023_06_05_fpga_fix.0-dirty-20230605.130933   00ab18aa58a2   299MB
docker-sonic-mgmt-framework   latest                                               d3db5c000aa2   414MB
docker-sonic-mgmt-framework   master_2023_06_05_fpga_fix.0-dirty-20230605.130933   d3db5c000aa2   414MB

Output of show techsupport:

Relevant logs:

Aug 28 13:13:01.490380 gs1-leaf41 NOTICE pmon#xcvrd[29]: Ethernet9: received plug in and update port sfp status table.
Aug 28 13:13:01.524683 gs1-leaf41 WARNING pmon#xcvrd[29]: $$$ Ethernet9 handle_port_update_event() : op=SET DB:STATE_DB Table:TRANSCEIVER_INFO fvp {'cable_type': 'Length OM3(10m)', 'manufacturer': 'FINISAR CORP.   ', 'application_advertisement': 'N/A', 'ext_rateselect_compliance': 'Unknown', 'cable_length': '7.0', 'vendor_rev': 'B   ', 'model': 'FTLF8536P5BCL-HP', 'vendor_date': '2024-04-18 16', 'connector': 'LC', 'nominal_bit_rate': '255', 'specification_compliance': "{'10G Ethernet Compliance': 'Unknown', 'Infiniband Compliance': 'Unknown', 'ESCON Compliance': 'Unknown', 'SONET Compliance Codes': 'Unknown', 'Ethernet Compliance': 'Unknown', 'Fibre Channel Link Length': 'Unknown', 'Fibre Channel Transmitter Technology': 'Unknown', 'SFP+CableTechnology': 'Unknown', 'Fibre Channel Transmission Media': 'Unknown', 'Fibre Channel Speed': 'Unknown'}", 'serial': 'MY841611NM      ', 'dom_capability': 'N/A', 'type': 'SFP/SFP+/SFP28', 'encoding': '64B/66B', 'vendor_oui': '00-90-65', 'ext_identifier': 'GBIC/SFP defined by two-wire interface ID', 'is_replaceable': 'False'}
Aug 28 13:13:01.527853 gs1-leaf41 WARNING pmon#xcvrd[29]: *** Ethernet9STATE_DBTRANSCEIVER_INFO handle_port_update_event() fvp {'cable_type': 'Length OM3(10m)', 'manufacturer': 'FINISAR CORP.   ', 'application_advertisement': 'N/A', 'ext_rateselect_compliance': 'Unknown', 'cable_length': '7.0', 'vendor_rev': 'B   ', 'model': 'FTLF8536P5BCL-HP', 'vendor_date': '2024-04-18 16', 'connector': 'LC', 'nominal_bit_rate': '255', 'specification_compliance': "{'10G Ethernet Compliance': 'Unknown', 'Infiniband Compliance': 'Unknown', 'ESCON Compliance': 'Unknown', 'SONET Compliance Codes': 'Unknown', 'Ethernet Compliance': 'Unknown', 'Fibre Channel Link Length': 'Unknown', 'Fibre Channel Transmitter Technology': 'Unknown', 'SFP+CableTechnology': 'Unknown', 'Fibre Channel Transmission Media': 'Unknown', 'Fibre Channel Speed': 'Unknown'}", 'serial': 'MY841611NM      ', 'dom_capability': 'N/A', 'type': 'SFP/SFP+/SFP28', 'encoding': '64B/66B', 'vendor_oui': '00-90-65', 'ext_identifier': 'GBIC/SFP defined by two-wire interface ID', 'is_replaceable': 'False', 'index': '-1', 'key': 'Ethernet9', 'asic_id': 0, 'op': 'SET'}
Aug 28 13:13:01.535522 gs1-leaf41 NOTICE pmon#xcvrd[29]: CMIS: Ethernet9: skipping CMIS state machine for flat memory xcvr
Aug 28 13:13:01.767840 gs1-leaf41 ERR syncd#syncd: [none] SAI_API_PORT:brcm_sai_create_port_serdes:8842 Port lane count 4 is different from supported lane count 1
Aug 28 13:13:01.768018 gs1-leaf41 ERR syncd#syncd: :- sendApiResponse: api SAI_COMMON_API_CREATE failed in syncd mode: SAI_STATUS_INVALID_ATTRIBUTE_MAX
Aug 28 13:13:01.768143 gs1-leaf41 ERR syncd#syncd: :- processQuadEvent: attr: SAI_PORT_SERDES_ATTR_PORT_ID: oid:0x100000000001b
Aug 28 13:13:01.768224 gs1-leaf41 ERR syncd#syncd: :- processQuadEvent: attr: SAI_PORT_SERDES_ATTR_PREEMPHASIS: 4:1198600,1198600,1198600,1198600
Aug 28 13:13:01.768419 gs1-leaf41 ERR swss#orchagent: :- create: create status: SAI_STATUS_INVALID_ATTRIBUTE_MAX
Aug 28 13:13:01.768620 gs1-leaf41 ERR swss#orchagent: :- setPortSerdesAttribute: Failed to create port serdes for port 0x100000000001b
Aug 28 13:13:01.768765 gs1-leaf41 ERR swss#orchagent: :- handleSaiCreateStatus: Encountered failure in create operation, exiting orchagent, SAI API: SAI_API_PORT, status: SAI_STATUS_INVALID_ATTRIBUTE_MAX
Aug 28 13:13:01.768927 gs1-leaf41 NOTICE swss#orchagent: :- notifySyncd: sending syncd: SYNCD_INVOKE_DUMP
Aug 28 13:13:01.769383 gs1-leaf41 NOTICE syncd#syncd: :- processNotifySyncd: Invoking SAI failure dump
Aug 28 13:13:01.776486 gs1-leaf41 NOTICE swss#orchagent: :- sai_redis_notify_syncd: invoked DUMP succeeded

Additional information you deem important (e.g. issue happens only occasionally):

Seems like issue happens when more 10G ports with 10G SFP modules are present. And it doesn't happen all the time, but it crashed on all 14 switches at least once when we installed around 200 SFP modules on them.

jelmeronline commented 1 month ago

Also happens with 10/25/100/200 DAC cables.

rlebedys commented 1 month ago

We tested this on a newer release of 202405 and it still reproduces although the error is different now:

2024 Oct 16 08:11:42.071417 gs1-leaf68 NOTICE syncd#syncd: :- threadFunction: time span 821 ms for 'create:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"fe80::262:bff:fe5e:2b52","rif":"oid:0x6000000000a74","switch_id":"oid:0x21000000000000"}'
2024 Oct 16 08:11:43.071532 gs1-leaf68 NOTICE syncd#syncd: :- threadFunction: time span 1821 ms for 'create:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"fe80::262:bff:fe5e:2b52","rif":"oid:0x6000000000a74","switch_id":"oid:0x21000000000000"}'
2024 Oct 16 08:11:44.071692 gs1-leaf68 NOTICE syncd#syncd: :- threadFunction: time span 2821 ms for 'create:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"fe80::262:bff:fe5e:2b52","rif":"oid:0x6000000000a74","switch_id":"oid:0x21000000000000"}'
2024 Oct 16 08:11:45.071779 gs1-leaf68 NOTICE syncd#syncd: :- threadFunction: time span 3821 ms for 'create:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"fe80::262:bff:fe5e:2b52","rif":"oid:0x6000000000a74","switch_id":"oid:0x21000000000000"}'
2024 Oct 16 08:11:46.072073 gs1-leaf68 NOTICE syncd#syncd: :- threadFunction: time span 4822 ms for 'create:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"fe80::262:bff:fe5e:2b52","rif":"oid:0x6000000000a74","switch_id":"oid:0x21000000000000"}'
2024 Oct 16 08:11:47.072052 gs1-leaf68 NOTICE syncd#syncd: :- threadFunction: time span 5822 ms for 'create:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"fe80::262:bff:fe5e:2b52","rif":"oid:0x6000000000a74","switch_id":"oid:0x21000000000000"}'
2024 Oct 16 08:11:48.072232 gs1-leaf68 NOTICE syncd#syncd: :- threadFunction: time span 6822 ms for 'create:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"fe80::262:bff:fe5e:2b52","rif":"oid:0x6000000000a74","switch_id":"oid:0x21000000000000"}'
2024 Oct 16 08:11:49.072347 gs1-leaf68 NOTICE syncd#syncd: :- threadFunction: time span 7822 ms for 'create:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"fe80::262:bff:fe5e:2b52","rif":"oid:0x6000000000a74","switch_id":"oid:0x21000000000000"}'
2024 Oct 16 08:11:50.072501 gs1-leaf68 NOTICE syncd#syncd: :- threadFunction: time span 8822 ms for 'create:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"fe80::262:bff:fe5e:2b52","rif":"oid:0x6000000000a74","switch_id":"oid:0x21000000000000"}'
2024 Oct 16 08:11:51.072617 gs1-leaf68 NOTICE syncd#syncd: :- threadFunction: time span 9822 ms for 'create:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"fe80::262:bff:fe5e:2b52","rif":"oid:0x6000000000a74","switch_id":"oid:0x21000000000000"}'
2024 Oct 16 08:11:52.072836 gs1-leaf68 NOTICE syncd#syncd: :- threadFunction: time span 10822 ms for 'create:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"fe80::262:bff:fe5e:2b52","rif":"oid:0x6000000000a74","switch_id":"oid:0x21000000000000"}'
2024 Oct 16 08:11:53.073379 gs1-leaf68 NOTICE syncd#syncd: :- threadFunction: time span 11823 ms for 'create:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"fe80::262:bff:fe5e:2b52","rif":"oid:0x6000000000a74","switch_id":"oid:0x21000000000000"}'
2024 Oct 16 08:11:54.073557 gs1-leaf68 NOTICE syncd#syncd: :- threadFunction: time span 12823 ms for 'create:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"fe80::262:bff:fe5e:2b52","rif":"oid:0x6000000000a74","switch_id":"oid:0x21000000000000"}'
2024 Oct 16 08:11:55.073662 gs1-leaf68 NOTICE syncd#syncd: :- threadFunction: time span 13823 ms for 'create:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"fe80::262:bff:fe5e:2b52","rif":"oid:0x6000000000a74","switch_id":"oid:0x21000000000000"}'
2024 Oct 16 08:11:56.073808 gs1-leaf68 NOTICE syncd#syncd: :- threadFunction: time span 14823 ms for 'create:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"fe80::262:bff:fe5e:2b52","rif":"oid:0x6000000000a74","switch_id":"oid:0x21000000000000"}'
2024 Oct 16 08:11:57.073960 gs1-leaf68 NOTICE syncd#syncd: :- threadFunction: time span 15824 ms for 'create:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"fe80::262:bff:fe5e:2b52","rif":"oid:0x6000000000a74","switch_id":"oid:0x21000000000000"}'
2024 Oct 16 08:11:58.074121 gs1-leaf68 NOTICE syncd#syncd: :- threadFunction: time span 16824 ms for 'create:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"fe80::262:bff:fe5e:2b52","rif":"oid:0x6000000000a74","switch_id":"oid:0x21000000000000"}'
2024 Oct 16 08:11:59.074181 gs1-leaf68 NOTICE syncd#syncd: :- threadFunction: time span 17824 ms for 'create:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"fe80::262:bff:fe5e:2b52","rif":"oid:0x6000000000a74","switch_id":"oid:0x21000000000000"}'
2024 Oct 16 08:12:00.074346 gs1-leaf68 NOTICE syncd#syncd: :- threadFunction: time span 18824 ms for 'create:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"fe80::262:bff:fe5e:2b52","rif":"oid:0x6000000000a74","switch_id":"oid:0x21000000000000"}'
2024 Oct 16 08:12:01.074484 gs1-leaf68 NOTICE syncd#syncd: :- threadFunction: time span 19824 ms for 'create:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"fe80::262:bff:fe5e:2b52","rif":"oid:0x6000000000a74","switch_id":"oid:0x21000000000000"}'
2024 Oct 16 08:12:02.074589 gs1-leaf68 NOTICE syncd#syncd: :- threadFunction: time span 20824 ms for 'create:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"fe80::262:bff:fe5e:2b52","rif":"oid:0x6000000000a74","switch_id":"oid:0x21000000000000"}'
2024 Oct 16 08:12:03.074722 gs1-leaf68 NOTICE syncd#syncd: :- threadFunction: time span 21824 ms for 'create:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"fe80::262:bff:fe5e:2b52","rif":"oid:0x6000000000a74","switch_id":"oid:0x21000000000000"}'
2024 Oct 16 08:12:04.074885 gs1-leaf68 NOTICE syncd#syncd: :- threadFunction: time span 22825 ms for 'create:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"fe80::262:bff:fe5e:2b52","rif":"oid:0x6000000000a74","switch_id":"oid:0x21000000000000"}'
2024 Oct 16 08:12:05.075043 gs1-leaf68 NOTICE syncd#syncd: :- threadFunction: time span 23825 ms for 'create:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"fe80::262:bff:fe5e:2b52","rif":"oid:0x6000000000a74","switch_id":"oid:0x21000000000000"}'
2024 Oct 16 08:12:06.075183 gs1-leaf68 NOTICE syncd#syncd: :- threadFunction: time span 24825 ms for 'create:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"fe80::262:bff:fe5e:2b52","rif":"oid:0x6000000000a74","switch_id":"oid:0x21000000000000"}'
2024 Oct 16 08:12:07.075297 gs1-leaf68 NOTICE syncd#syncd: :- threadFunction: time span 25825 ms for 'create:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"fe80::262:bff:fe5e:2b52","rif":"oid:0x6000000000a74","switch_id":"oid:0x21000000000000"}'
2024 Oct 16 08:12:08.075443 gs1-leaf68 NOTICE syncd#syncd: :- threadFunction: time span 26825 ms for 'create:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"fe80::262:bff:fe5e:2b52","rif":"oid:0x6000000000a74","switch_id":"oid:0x21000000000000"}'
2024 Oct 16 08:12:09.075538 gs1-leaf68 NOTICE syncd#syncd: :- threadFunction: time span 27825 ms for 'create:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"fe80::262:bff:fe5e:2b52","rif":"oid:0x6000000000a74","switch_id":"oid:0x21000000000000"}'
2024 Oct 16 08:12:10.075654 gs1-leaf68 NOTICE syncd#syncd: :- threadFunction: time span 28825 ms for 'create:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"fe80::262:bff:fe5e:2b52","rif":"oid:0x6000000000a74","switch_id":"oid:0x21000000000000"}'
2024 Oct 16 08:12:11.075782 gs1-leaf68 NOTICE syncd#syncd: :- threadFunction: time span 29825 ms for 'create:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"fe80::262:bff:fe5e:2b52","rif":"oid:0x6000000000a74","switch_id":"oid:0x21000000000000"}'
2024 Oct 16 08:12:12.076007 gs1-leaf68 NOTICE syncd#syncd: :- threadFunction: time span 30826 ms for 'create:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"fe80::262:bff:fe5e:2b52","rif":"oid:0x6000000000a74","switch_id":"oid:0x21000000000000"}'
2024 Oct 16 08:12:12.076084 gs1-leaf68 ERR syncd#syncd: :- threadFunction: time span WD exceeded 30826 ms for create:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"fe80::262:bff:fe5e:2b52","rif":"oid:0x6000000000a74","switch_id":"oid:0x21000000000000"}
2024 Oct 16 08:12:12.076129 gs1-leaf68 ERR syncd#syncd: :- logEventData: op: create, key: SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"fe80::262:bff:fe5e:2b52","rif":"oid:0x6000000000a74","switch_id":"oid:0x21000000000000"}
2024 Oct 16 08:12:12.076174 gs1-leaf68 ERR syncd#syncd: :- logEventData: fv: SAI_NEIGHBOR_ENTRY_ATTR_DST_MAC_ADDRESS: 00:62:0B:5E:2B:52
2024 Oct 16 08:12:41.251540 gs1-leaf68 ERR swss#orchagent: :- wait: SELECT operation result: TIMEOUT on getresponse
2024 Oct 16 08:12:41.251540 gs1-leaf68 ERR swss#orchagent: :- wait: failed to get response for getresponse
2024 Oct 16 08:12:41.251540 gs1-leaf68 ERR swss#orchagent: :- create: create status: SAI_STATUS_FAILURE
2024 Oct 16 08:12:41.251540 gs1-leaf68 ERR swss#orchagent: :- addNeighbor: Failed to create neighbor 00:62:0b:5e:2b:52 on Vlan10, rv:-1
2024 Oct 16 08:12:41.251606 gs1-leaf68 ERR swss#orchagent: :- handleSaiCreateStatus: Encountered failure in create operation, exiting orchagent, SAI API: SAI_API_NEIGHBOR, status: SAI_STATUS_FAILURE
2024 Oct 16 08:12:41.251606 gs1-leaf68 NOTICE swss#orchagent: :- notifySyncd: sending syncd: SYNCD_INVOKE_DUMP
2024 Oct 16 08:13:31.487430 gs1-leaf68 WARNING swss#supervisor-proc-exit-listener: message repeated 59 times: [ Process 'orchagent' is stuck in namespace 'host' (1.0 minutes).]
2024 Oct 16 08:13:31.487430 gs1-leaf68 WARNING swss#supervisor-proc-exit-listener: Process 'orchagent' is stuck in namespace 'host' (2.0 minutes).
2024 Oct 16 08:13:41.311514 gs1-leaf68 ERR swss#orchagent: :- wait: SELECT operation result: TIMEOUT on notify
2024 Oct 16 08:13:41.311588 gs1-leaf68 ERR swss#orchagent: :- wait: failed to get response for notify
2024 Oct 16 08:13:41.311653 gs1-leaf68 ERR swss#orchagent: :- handleSaiFailure: Failed to take sai failure dump -1
2024 Oct 16 08:13:42.380627 gs1-leaf68 INFO swss#supervisord 2024-10-16 08:13:42,379 WARN exited: orchagent (terminated by SIGABRT (core dumped); not expected)
2024 Oct 16 08:13:43.386455 gs1-leaf68 WARNING swss#supervisor-proc-exit-listener: message repeated 11 times: [ Process 'orchagent' is stuck in namespace 'host' (2.0 minutes).]
2024 Oct 16 08:13:43.386740 gs1-leaf68 INFO swss#supervisor-proc-exit-listener: Process 'orchagent' exited unexpectedly. Terminating supervisor 'swss'

Logs in sairedis.rec for the timeframe:

2024-10-16.08:11:41.249093|c|SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:{"ip":"fe80::262:bff:fe5e:2b52","rif":"oid:0x6000000000a7
4","switch_id":"oid:0x21000000000000"}|SAI_NEIGHBOR_ENTRY_ATTR_DST_MAC_ADDRESS=00:62:0B:5E:2B:52
2024-10-16.08:12:41.251101|E|SAI_STATUS_FAILURE
2024-10-16.08:12:41.251221|a|SYNCD_INVOKE_DUMP
2024-10-16.08:13:41.311092|A|SAI_STATUS_FAILURE

Logs in swss.rec:

2024-10-16.08:11:41.248774|NEIGH_TABLE:Vlan10:fe80::262:bff:fe5e:2b52|SET|neigh:00:62:0b:5e:2b:52|family:IPv6
2024-10-16.08:15:32.743657|recording started

Output of show version:

SONiC Software Version: SONiC.202405.659137-760d27732
SONiC OS Version: 12
Distribution: Debian 12.6
Kernel: 6.1.0-22-2-amd64
Build commit: 760d27732
Build date: Thu Oct  3 13:25:08 UTC 2024
Built by: azureuser@01fe05d0c000001

Platform: x86_64-dellemc_s5248f_c3538-r0
HwSKU: DellEMC-S5248f-P-25G
ASIC: broadcom
ASIC Count: 1
Serial Number: 61H6SR3
Model Number: 006Y6V
Hardware Revision: N/A
Uptime: 09:14:14 up  1:08,  2 users,  load average: 1.61, 1.87, 1.84
Date: Wed 16 Oct 2024 09:14:14

Docker images:
REPOSITORY                    TAG                       IMAGE ID       SIZE
docker-syncd-brcm             202405.659137-760d27732   9bea6f23d2b7   717MB
docker-syncd-brcm             latest                    9bea6f23d2b7   717MB
docker-gbsyncd-broncos        202405.659137-760d27732   3d73280bbd8d   354MB
docker-gbsyncd-broncos        latest                    3d73280bbd8d   354MB
docker-gbsyncd-credo          202405.659137-760d27732   553177005016   327MB
docker-gbsyncd-credo          latest                    553177005016   327MB
docker-teamd                  202405.659137-760d27732   a7618f11df00   344MB
docker-teamd                  latest                    a7618f11df00   344MB
docker-orchagent              202405.659137-760d27732   42d20a1dc365   357MB
docker-orchagent              latest                    42d20a1dc365   357MB
docker-sflow                  202405.659137-760d27732   fd4e276d4027   345MB
docker-sflow                  latest                    fd4e276d4027   345MB
docker-nat                    202405.659137-760d27732   1d1e3d97d102   347MB
docker-nat                    latest                    1d1e3d97d102   347MB
docker-fpm-frr                202405.659137-760d27732   000be03cc2d0   376MB
docker-fpm-frr                latest                    000be03cc2d0   376MB
docker-macsec                 latest                    7f173bad12f0   347MB
docker-dhcp-relay             latest                    31b58703e695   325MB
docker-platform-monitor       202405.659137-760d27732   160bd68ad0a6   442MB
docker-platform-monitor       latest                    160bd68ad0a6   442MB
docker-snmp                   202405.659137-760d27732   6066ecc6f573   355MB
docker-snmp                   latest                    6066ecc6f573   355MB
docker-eventd                 202405.659137-760d27732   0276054db396   316MB
docker-eventd                 latest                    0276054db396   316MB
docker-router-advertiser      202405.659137-760d27732   4445bf788014   316MB
docker-router-advertiser      latest                    4445bf788014   316MB
docker-lldp                   202405.659137-760d27732   103c85628e25   361MB
docker-lldp                   latest                    103c85628e25   361MB
docker-mux                    202405.659137-760d27732   e6aa6c110442   368MB
docker-mux                    latest                    e6aa6c110442   368MB
docker-sonic-gnmi             202405.659137-760d27732   d8c8a2ac0f3c   400MB
docker-sonic-gnmi             latest                    d8c8a2ac0f3c   400MB
docker-database               202405.659137-760d27732   609198162fd8   324MB
docker-database               latest                    609198162fd8   324MB
docker-sonic-mgmt-framework   202405.659137-760d27732   341964b243a5   402MB
docker-sonic-mgmt-framework   latest                    341964b243a5   402MB

@bingwang-ms

DennisChiuEC commented 1 month ago

I thought the error which led syncd crash has been fixed by the PR and it already included in 202405. https://github.com/sonic-net/sonic-platform-daemons/pull/533

The cause can be referred to the item 2 in the description of the above PR.

The lane count of Ethernet9 is 1. https://github.com/sonic-net/sonic-buildimage/blob/master/device/dell/x86_64-dellemc_s5248f_c3538-r0/platform.json#L549

The media_settings define the preemphasis for 4 lanes on Port 10. https://github.com/sonic-net/sonic-buildimage/blob/master/device/dell/x86_64-dellemc_s5248f_c3538-r0/media_settings.json#L210

Aug 28 13:13:01.490380 gs1-leaf41 NOTICE pmon#xcvrd[29]: Ethernet9: received plug in and update port sfp status table.
Aug 28 13:13:01.524683 gs1-leaf41 WARNING pmon#xcvrd[29]: $$$ Ethernet9 handle_port_update_event() : op=SET DB:STATE_DB Table:TRANSCEIVER_INFO fvp {'cable_type': 'Length OM3(10m)', 'manufacturer': 'FINISAR CORP.   ', 'application_advertisement': 'N/A', 'ext_rateselect_compliance': 'Unknown', 'cable_length': '7.0', 'vendor_rev': 'B   ', 'model': 'FTLF8536P5BCL-HP', 'vendor_date': '2024-04-18 16', 'connector': 'LC', 'nominal_bit_rate': '255', 'specification_compliance': "{'10G Ethernet Compliance': 'Unknown', 'Infiniband Compliance': 'Unknown', 'ESCON Compliance': 'Unknown', 'SONET Compliance Codes': 'Unknown', 'Ethernet Compliance': 'Unknown', 'Fibre Channel Link Length': 'Unknown', 'Fibre Channel Transmitter Technology': 'Unknown', 'SFP+CableTechnology': 'Unknown', 'Fibre Channel Transmission Media': 'Unknown', 'Fibre Channel Speed': 'Unknown'}", 'serial': 'MY841611NM      ', 'dom_capability': 'N/A', 'type': 'SFP/SFP+/SFP28', 'encoding': '64B/66B', 'vendor_oui': '00-90-65', 'ext_identifier': 'GBIC/SFP defined by two-wire interface ID', 'is_replaceable': 'False'}
Aug 28 13:13:01.527853 gs1-leaf41 WARNING pmon#xcvrd[29]: *** Ethernet9STATE_DBTRANSCEIVER_INFO handle_port_update_event() fvp {'cable_type': 'Length OM3(10m)', 'manufacturer': 'FINISAR CORP.   ', 'application_advertisement': 'N/A', 'ext_rateselect_compliance': 'Unknown', 'cable_length': '7.0', 'vendor_rev': 'B   ', 'model': 'FTLF8536P5BCL-HP', 'vendor_date': '2024-04-18 16', 'connector': 'LC', 'nominal_bit_rate': '255', 'specification_compliance': "{'10G Ethernet Compliance': 'Unknown', 'Infiniband Compliance': 'Unknown', 'ESCON Compliance': 'Unknown', 'SONET Compliance Codes': 'Unknown', 'Ethernet Compliance': 'Unknown', 'Fibre Channel Link Length': 'Unknown', 'Fibre Channel Transmitter Technology': 'Unknown', 'SFP+CableTechnology': 'Unknown', 'Fibre Channel Transmission Media': 'Unknown', 'Fibre Channel Speed': 'Unknown'}", 'serial': 'MY841611NM      ', 'dom_capability': 'N/A', 'type': 'SFP/SFP+/SFP28', 'encoding': '64B/66B', 'vendor_oui': '00-90-65', 'ext_identifier': 'GBIC/SFP defined by two-wire interface ID', 'is_replaceable': 'False', 'index': '-1', 'key': 'Ethernet9', 'asic_id': 0, 'op': 'SET'}
Aug 28 13:13:01.535522 gs1-leaf41 NOTICE pmon#xcvrd[29]: CMIS: Ethernet9: skipping CMIS state machine for flat memory xcvr
Aug 28 13:13:01.767840 gs1-leaf41 ERR syncd#syncd: [none] SAI_API_PORT:brcm_sai_create_port_serdes:8842 Port lane count 4 is different from supported lane count 1
Aug 28 13:13:01.768018 gs1-leaf41 ERR syncd#syncd: :- sendApiResponse: api SAI_COMMON_API_CREATE failed in syncd mode: SAI_STATUS_INVALID_ATTRIBUTE_MAX
Aug 28 13:13:01.768143 gs1-leaf41 ERR syncd#syncd: :- processQuadEvent: attr: SAI_PORT_SERDES_ATTR_PORT_ID: oid:0x100000000001b
Aug 28 13:13:01.768224 gs1-leaf41 ERR syncd#syncd: :- processQuadEvent: attr: SAI_PORT_SERDES_ATTR_PREEMPHASIS: 4:1198600,1198600,1198600,1198600
Aug 28 13:13:01.768419 gs1-leaf41 ERR swss#orchagent: :- create: create status: SAI_STATUS_INVALID_ATTRIBUTE_MAX
Aug 28 13:13:01.768620 gs1-leaf41 ERR swss#orchagent: :- setPortSerdesAttribute: Failed to create port serdes for port 0x100000000001b
Aug 28 13:13:01.768765 gs1-leaf41 ERR swss#orchagent: :- handleSaiCreateStatus: Encountered failure in create operation, exiting orchagent, SAI API: SAI_API_PORT, status: SAI_STATUS_INVALID_ATTRIBUTE_MAX
Aug 28 13:13:01.768927 gs1-leaf41 NOTICE swss#orchagent: :- notifySyncd: sending syncd: SYNCD_INVOKE_DUMP
Aug 28 13:13:01.769383 gs1-leaf41 NOTICE syncd#syncd: :- processNotifySyncd: Invoking SAI failure dump
Aug 28 13:13:01.776486 gs1-leaf41 NOTICE swss#orchagent: :- sai_redis_notify_syncd: invoked DUMP succeeded
rlebedys commented 1 month ago

@DennisChiuEC thank you for responding. Indeed it looks like the initial issue I filed is fixed in 202405 release. However, we are facing an issue defined here https://github.com/sonic-net/sonic-buildimage/issues/20331#issuecomment-2416232001

When 100G QSFP module gets installed some kind of timeout occurs while trying to create neighbor entry. It goes on for 30 seconds and initiates a restart afterwards. When containers restart switch operates as expected.

Do you know what could be causing this? @bingwang-ms