sonic-net / SONiC

Landing page for Software for Open Networking in the Cloud (SONiC) - https://sonic-net.github.io/SONiC/
2.23k stars 1.12k forks source link

teamd crashes until max restarts reached #1021

Open sjhwilkes opened 2 years ago

sjhwilkes commented 2 years ago

Built from current master as needed ICCPD for MCLAG. Appears to works fine in lab conditions - with very minimal traffic. When deployed onto pair of DX010 with 10 ports channels to various models of Brocade switch/Sophos firewall/Linux end system. Teamd when running is very verbose in syslog, but there's no obvious cause to the shutdowns. I'm going to try building again from 202111 - it is going to be very difficult to replace builds in production at this time, so if anyone can suggest any troubleshooting I can do that would be good. -Could Teamd be dying due to ICCPD / MCLAGs?

madhukar-kamarapu commented 2 years ago

Is it possible to share the techsupport? If yes, please share the same.

sjhwilkes commented 2 years ago

Inserted one syslog here, the entire (67MB) Techsupport is here, https://www.dropbox.com/sh/26wrl87kobxy3f6/AABzGYPLct482KK9x_TUO3IEa?dl=0 Many core files from orchagent. Looking at the teamd entries in the log I don't see it recording that it's being restarted syslog.5.gz .

madhukar-kamarapu commented 2 years ago

Hi,

The dropbox link is not accessible.

I looked into the syslog.5 file. Observed that teamd did not crash; there was an FDB update failure that resulted in orchagent to exit. As a result all the services in the system have restarted.

Jun 24 22:41:01.002138 celestica-c2 NOTICE iccpd#mclagsyncd: :- mclagsyncdSetRemoteIfState: Set mlag 1, remote interface PortChannel11 to down
Jun 24 22:41:01.002453 celestica-c2 WARNING swss#orchagent: :- meta_sai_on_fdb_event_single: object key SAI_OBJECT_TYPE_FDB_ENTRY:{"bvid":"oid:0x260000000009af","mac":"00:E0:EC:BB:07:E3","switch_id":"oid:0x21000000000000"} doesn't exist but received AGED event
Jun 24 22:41:01.004686 celestica-c2 INFO syncd#syncd: [none] SAI_API_FDB:brcm_sai_set_fdb_entry_attribute:1130 FDB MAC:00-E0-EC-BB-06-5A port_tid:0x0 tgid:10 port_type:LAG vlan_id:2 vid:0x2
Jun 24 22:41:01.004686 celestica-c2 INFO syncd#syncd: [none] SAI_API_FDB:brcm_sai_set_fdb_entry_attribute:1140 Setting Type SAI_FDB_ENTRY_TYPE_STATIC:
Jun 24 22:41:01.004686 celestica-c2 INFO syncd#syncd: [none] SAI_API_FDB:brcm_sai_bcm_l2_addr_add:154 FDB Add-[Ok]: MAC:00-E0-EC-BB-06-5A vfi:0x2
Jun 24 22:41:01.006477 celestica-c2 INFO syncd#syncd: [none] SAI_API_FDB:brcm_sai_set_fdb_entry_attribute:1130 FDB MAC:00-E0-EC-BB-06-5A port_tid:0x0 tgid:10 port_type:LAG vlan_id:2 vid:0x2
Jun 24 22:41:01.006477 celestica-c2 INFO syncd#syncd: [none] SAI_API_FDB:brcm_sai_set_fdb_entry_attribute:1224 Setting Type SAI_FDB_ENTRY_TYPE_STATIC_MACMOVE:
Jun 24 22:41:01.006477 celestica-c2 INFO syncd#syncd: [none] SAI_API_FDB:brcm_sai_bcm_l2_addr_add:154 FDB Add-[Ok]: MAC:00-E0-EC-BB-06-5A vfi:0x2
Jun 24 22:41:01.008210 celestica-c2 INFO syncd#syncd: [none] SAI_API_FDB:brcm_sai_set_fdb_entry_attribute:1130 FDB MAC:00-E0-EC-BB-06-5A port_tid:0x0 tgid:10 port_type:LAG vlan_id:2 vid:0x2
Jun 24 22:41:01.008210 celestica-c2 INFO syncd#syncd: [none] SAI_API_FDB:brcm_sai_set_fdb_entry_attribute:1280 L2 LAG: 0
Jun 24 22:41:01.008210 celestica-c2 INFO syncd#syncd: [none] SAI_API_FDB:brcm_sai_bcm_l2_addr_add:154 FDB Add-[Ok]: MAC:00-E0-EC-BB-06-5A vfi:0x2
Jun 24 22:41:01.008357 celestica-c2 INFO syncd#syncd: [none] SAI_API_FDB:brcm_sai_set_fdb_entry_attribute:1325 Setting Port port_tid:0x0 tgid=0 port_type:LAG
Jun 24 22:41:01.008357 celestica-c2 INFO syncd#syncd: [none] SAI_API_NEIGHBOR:brcm_sai_xgs_common_neighbor_mac_db_update:105 Mac entry not found - skip nbr processing. add 0, dir 8
Jun 24 22:41:01.008526 celestica-c2 INFO syncd#syncd: [none] SAI_API_NEIGHBOR:brcm_sai_xgs_common_neighbor_mac_db_update:105 Mac entry not found - skip nbr processing. add 1, dir 9
Jun 24 22:41:01.009468 celestica-c2 NOTICE swss#orchagent: :- addFdbEntry: fdbEvent: AddFdbEntry: Add MCLAG MAC with state mclag remote fdb table Mac: 00:e0:ec:bb:06:5a Vlan: 2 port:PortChannel01 type:dynamic
Jun 24 22:41:01.010308 celestica-c2 ERR swss#orchagent: :- meta_sai_validate_fdb_entry: object key SAI_OBJECT_TYPE_FDB_ENTRY:{"bvid":"oid:0x260000000009af","mac":"00:E0:EC:BB:07:E3","switch_id":"oid:0x21000000000000"} doesn't exist
Jun 24 22:41:01.010308 celestica-c2 ERR swss#orchagent: :- addFdbEntry: macUpdate-Failed for attr.id=0x0 for FDB 00:e0:ec:bb:07:e3 in Vlan2 on PortChannel01, rv:-5
Jun 24 22:41:01.010308 celestica-c2 ERR swss#orchagent: :- handleSaiSetStatus: Encountered failure in set operation, exiting orchagent, SAI API: SAI_API_FDB, status: SAI_STATUS_INVALID_PARAMETER
Jun 24 22:41:01.043072 celestica-c2 NOTICE iccpd#mclagsyncd: :- setFdbEntry: del fdb entry from ASIC_DB:key =Vlan2:00:e0:ec:bb:06:5a
Jun 24 22:41:01.043577 celestica-c2 NOTICE iccpd#mclagsyncd: :- setFdbEntry: del fdb entry from ASIC_DB:key =Vlan2:00:e0:ec:bb:07:e3
Jun 24 22:41:01.047116 celestica-c2 NOTICE iccpd#mclagsyncd: :- setFdbEntry: del fdb entry from ASIC_DB:key =Vlan2:60:9c:9f:ba:1d:34
Jun 24 22:41:01.301435 celestica-c2 INFO kernel: [252229.103699] igb 0000:00:14.0 eth0: igb: eth0 NIC Link is Down
Jun 24 22:41:01.743405 celestica-c2 INFO swss#supervisord 2022-06-24 22:41:01,742 INFO exited: orchagent (terminated by SIGABRT (core dumped); not expected)
Jun 24 22:41:02.010703 celestica-c2 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:eth0 admin:1 oper:0 addr:00:e0:ec:bb:06:5a ifindex:2 master:0
Jun 24 22:41:02.763361 celestica-c2 INFO swss#supervisor-proc-exit-listener: Process 'orchagent' exited unexpectedly. Terminating supervisor 'swss'
Jun 24 22:41:02.765060 celestica-c2 INFO swss#supervisord 2022-06-24 22:41:02,764 WARN received SIGTERM indicating exit request
Jun 24 22:41:02.765619 celestica-c2 INFO swss#supervisord 2022-06-24 22:41:02,764 INFO waiting for supervisor-proc-exit-listener, rsyslogd, portsyncd, coppmgrd, arp_update, ndppd, neighsyncd, vlanmgrd, intfmgrd, portmgrd, buffermgrd, vrfmgrd, nbrmgrd, vxlanmgrd, fdbsyncd, tunnelmgrd to die
Jun 24 22:41:02.767831 celestica-c2 INFO swss#supervisord 2022-06-24 22:41:02,767 INFO stopped: tunnelmgrd (terminated by SIGTERM)
Jun 24 22:41:02.769533 celestica-c2 INFO swss#supervisord 2022-06-24 22:41:02,768 INFO stopped: fdbsyncd (terminated by SIGTERM)
sjhwilkes commented 2 years ago

OK that explains all the core files maybe - are there any specific files that would help determine the root cause?

Tomasjokeyang commented 1 year ago

Nov 10 02:04:02.552700 sonic INFO swss#supervisord 2022-11-10 02:04:02,548 INFO spawned: 'orchagent' with pid 44 Nov 10 02:04:03.004439 sonic INFO swss#/supervisord: orchagent barefoot Nov 10 02:04:03.010349 sonic INFO swss#/supervisord: orchagent b4:56:b9:f0:02:4d Nov 10 02:04:03.017882 sonic INFO swss#/supervisord: orchagent enable Nov 10 02:04:03.043191 sonic NOTICE swss#orchagent: :- main: --- Starting Orchestration Agent --- Nov 10 02:04:03.043629 sonic NOTICE swss#orchagent: :- initialize: acting as a sairedis SERVER Nov 10 02:04:03.043792 sonic NOTICE swss#orchagent: :- loadFromFile: no context config specified, will load default context config Nov 10 02:04:03.043952 sonic NOTICE swss#orchagent: :- insert: added switch: idx 0, hwinfo '' Nov 10 02:04:03.044144 sonic NOTICE swss#orchagent: :- RedisChannel: creating notification thread Nov 10 02:04:03.044549 sonic NOTICE swss#orchagent: :- clear_local_state: clearing local state Nov 10 02:04:03.044735 sonic NOTICE swss#orchagent: :- initialize: init client/server sai: SAI_STATUS_SUCCESS Nov 10 02:04:03.044897 sonic NOTICE swss#orchagent: :- loadFromFile: no server config specified, will load default Nov 10 02:04:03.045051 sonic NOTICE swss#orchagent: :- ZeroMQSelectableChannel: binding on ipc:///tmp/saiServer Nov 10 02:04:03.048780 sonic NOTICE swss#orchagent: :- initialize: starting server thread Nov 10 02:04:03.048968 sonic NOTICE swss#orchagent: :- initialize: init client/server sai: SAI_STATUS_SUCCESS Nov 10 02:04:03.049124 sonic NOTICE swss#orchagent: :- zmqPollThread: begin Nov 10 02:04:03.049278 sonic NOTICE swss#orchagent: :- serverThreadFunction: begin Nov 10 02:04:03.049432 sonic NOTICE swss#orchagent: :- setRecordingFilename: setting recording filename : sairedis.rec Nov 10 02:04:03.049911 sonic NOTICE swss#orchagent: :- stopRecording: stopped recording Nov 10 02:04:03.050090 sonic NOTICE swss#orchagent: :- startRecording: started recording: sairedis.rec Nov 10 02:04:03.050247 sonic NOTICE swss#orchagent: :- initSaiRedis: Enable redis pipeline Nov 10 02:04:03.050401 sonic NOTICE swss#orchagent: :- notifySyncd: sending syncd: INIT_VIEW Nov 10 02:04:03.050563 sonic WARNING swss#orchagent: :- wait: got not expected response: SAI_STATUS_SUCCESS:getresponse Nov 10 02:04:04.019768 sonic INFO swss#supervisord 2022-11-10 02:04:04,019 INFO success: orchagent entered RUNNING state, process has stayed up for > than 1 seconds (startsecs) Nov 10 02:10:03.148911 sonic ERR swss#orchagent: :- wait: SELECT operation result: TIMEOUT on notify Nov 10 02:10:03.148911 sonic ERR swss#orchagent: :- wait: failed to get response for notify Nov 10 02:10:03.148911 sonic ERR swss#orchagent: :- initSaiRedis: Failed to notify syncd INIT_VIEW, rv:-1 gSwitchId 0 Nov 10 02:10:03.150727 sonic NOTICE swss#orchagent: :- uninitialize: begin Nov 10 02:10:03.150727 sonic NOTICE swss#orchagent: :- uninitialize: begin Nov 10 02:10:03.150727 sonic NOTICE swss#orchagent: :- uninitialize: end server thread begin Nov 10 02:10:03.150727 sonic NOTICE swss#orchagent: :- serverThreadFunction: end Nov 10 02:10:03.150727 sonic NOTICE swss#orchagent: :- uninitialize: end server thread end Nov 10 02:10:03.150727 sonic NOTICE swss#orchagent: :- uninitialize: begin Nov 10 02:10:03.150727 sonic NOTICE swss#orchagent: :- uninitialize: begin Nov 10 02:10:03.150727 sonic NOTICE swss#orchagent: :- ~RedisChannel: join ntf thread begin Nov 10 02:10:03.150727 sonic NOTICE swss#orchagent: :- ~RedisChannel: join ntf thread end Nov 10 02:10:03.150727 sonic NOTICE swss#orchagent: :- clear_local_state: clearing local state Nov 10 02:10:03.150727 sonic NOTICE swss#orchagent: :- meta_init_db: begin Nov 10 02:10:03.150727 sonic NOTICE swss#orchagent: :- meta_init_db: end Nov 10 02:10:03.150727 sonic NOTICE swss#orchagent: :- uninitialize: end Nov 10 02:10:03.150727 sonic NOTICE swss#orchagent: :- stopRecording: stopped recording Nov 10 02:10:03.150727 sonic NOTICE swss#orchagent: :- stopRecording: closed recording file: sairedis.rec Nov 10 02:10:03.150727 sonic NOTICE swss#orchagent: :- uninitialize: end Nov 10 02:10:03.150923 sonic NOTICE swss#orchagent: :- uninitialize: end Nov 10 02:10:03.150963 sonic NOTICE swss#orchagent: :- zmqPollThread: ending pool thread, since run is false Nov 10 02:10:03.151002 sonic NOTICE swss#orchagent: :- zmqPollThread: end Nov 10 02:10:03.151677 sonic NOTICE swss#orchagent: :- ~ZeroMQSelectableChannel: ending zmq poll thread for channel ipc:///tmp/saiServer Nov 10 02:10:03.151713 sonic NOTICE swss#orchagent: :- ~ZeroMQSelectableChannel: ended zmq poll thread for channel ipc:///tmp/saiServer Nov 10 02:10:03.152163 sonic NOTICE swss#orchagent: :- uninitialize: end Nov 10 02:10:03.279845 sonic INFO swss#supervisord 2022-11-10 02:10:03,279 INFO exited: orchagent (exit status 1; not expected) Nov 10 02:10:04.292074 sonic INFO swss#/supervisor-proc-exit-listener: Process 'orchagent' exited unexpectedly. Terminating supervisor 'swss'