sonic-net / sonic-buildimage

Scripts which perform an installable binary image build for SONiC
Other
728 stars 1.39k forks source link

syncd container is getting killed twice upon critical syncd process crash #16989

Open deepak-singhal0408 opened 11 months ago

deepak-singhal0408 commented 11 months ago

Description

Steps to reproduce the issue:

  1. ps aux | grep syncd
  2. Kill the syncd process.
  3. Observe the syslog

Describe the results you received:

**Oct 24 17:39:49.758858  INFO systemd[1]: Stopping syncd service...** <<<<<<<<<<<<<<<<<<<<
Oct 24 17:39:49.773551  NOTICE root: Stopping syncd service...
Oct 24 17:39:49.783564  NOTICE root: Locking /tmp/swss-syncd-lock from syncd service
Oct 24 17:39:49.794935  NOTICE root: Locked /tmp/swss-syncd-lock (10) from syncd service
Oct 24 17:39:54.767370  NOTICE CCmisApi: System is not ready - one or more services are not up
Oct 24 17:39:54.813670  NOTICE root: Stopped syncd service...
Oct 24 17:39:54.828542  NOTICE root: Unlocking /tmp/swss-syncd-lock (10) from syncd service
Oct 24 17:39:54.846956  INFO systemd[1]: syncd.service: Succeeded.
Oct 24 17:39:54.848412  INFO systemd[1]: Stopped syncd service.
**Oct 24 17:39:54.856646  INFO systemd[1]: Starting syncd service...** <<<<<<<<<<<<<<<<<
Oct 24 17:39:54.872095  NOTICE root: Starting syncd service...
Oct 24 17:39:54.883024  NOTICE root: Locking /tmp/swss-syncd-lock from syncd service
Oct 24 17:39:54.896816  NOTICE root: Locked /tmp/swss-syncd-lock (10) from syncd service
Oct 24 17:39:54.899749  NOTICE root: Stopping swss service...
Oct 24 17:39:54.912609  NOTICE root: Locking /tmp/swss-syncd-lock from swss service
Oct 24 17:39:56.013094  NOTICE root: Started syncd service...
Oct 24 17:39:56.026844  NOTICE root: Unlocking /tmp/swss-syncd-lock (10) from syncd service
Oct 24 17:39:56.053775  NOTICE root: Locked /tmp/swss-syncd-lock (10) from swss service
Oct 24 17:39:56.054462  INFO systemd[1]: Started syncd service.
Oct 24 17:39:56.104088  NOTICE root: Starting lldp service...
Oct 24 17:39:56.136806  NOTICE root: Started lldp service
Oct 24 17:39:59.570926  NOTICE CCmisApi: System is not ready - one or more services are not up
Oct 24 17:39:59.623829  NOTICE root: Stopped swss service...
Oct 24 17:39:59.654071  NOTICE root: Unlocking /tmp/swss-syncd-lock (10) from swss service
Oct 24 17:39:59.688459  NOTICE admin: Stopping teamd service...
Oct 24 17:40:00.191384  INFO systemd[1]: sysstat-collect.service: Succeeded.
Oct 24 17:40:00.337610  INFO systemd[1]: logrotate.service: Succeeded.
Oct 24 17:40:06.665963  NOTICE admin: Stopped teamd service...
Oct 24 17:40:06.673940  INFO systemd[1]: teamd.service: Succeeded.
Oct 24 17:40:06.708588  NOTICE admin: Stopping radv service...
Oct 24 17:40:09.369558  NOTICE admin: Stopped radv service...
Oct 24 17:40:09.376939  INFO systemd[1]: radv.service: Succeeded.
Oct 24 17:40:09.405921  NOTICE admin: Stopping bgp service...
Oct 24 17:40:20.228029  NOTICE admin: Stopped bgp service...
Oct 24 17:40:20.346309  INFO systemd[1]: bgp.service: Succeeded.
**Oct 24 17:40:20.373463  INFO systemd[1]: Stopping syncd service...** <<<<<<<<<<<<<<<<<<<<<
Oct 24 17:40:20.386888  NOTICE root: Stopping syncd service...
Oct 24 17:40:20.395759  NOTICE root: Locking /tmp/swss-syncd-lock from syncd service
Oct 24 17:40:20.407234  NOTICE root: Locked /tmp/swss-syncd-lock (10) from syncd service
Oct 24 17:40:29.825951  NOTICE root: Stopped syncd service...
Oct 24 17:40:29.842229  NOTICE root: Unlocking /tmp/swss-syncd-lock (10) from syncd service
Oct 24 17:40:29.862177  INFO systemd[1]: syncd.service: Succeeded.
Oct 24 17:40:29.864007  INFO systemd[1]: Stopped syncd service.
Oct 24 17:40:29.873463  INFO systemd[1]: swss.service: Succeeded.
Oct 24 17:40:37.214299  INFO dbus-daemon[532]: [system] Activating via systemd: service name='org.freedesktop.hostname1' unit='dbus-org.freedesktop.hostname1.service' requested by ':1.13' (uid=0 pid=614840 comm="systemd-analyze plot ")
Oct 24 17:40:37.399191  INFO dbus-daemon[532]: [system] Successfully activated service 'org.freedesktop.hostname1'
Oct 24 17:40:59.877346  INFO systemd[1]: swss.service: Scheduled restart job, restart counter is at 1.
Oct 24 17:40:59.893925  NOTICE root: Stopping snmp service...
Oct 24 17:41:07.453738  INFO systemd[1]: systemd-hostnamed.service: Succeeded.
Oct 24 17:41:09.254548  NOTICE root: Stopped snmp service...
Oct 24 17:41:09.262144  INFO systemd[1]: snmp.service: Succeeded.
Oct 24 17:41:09.265761  INFO systemd[1]: Stopped switch state service.
Oct 24 17:41:09.274634  INFO systemd[1]: Starting switch state service...
Oct 24 17:41:09.312947  NOTICE root: Starting swss service...
Oct 24 17:41:09.323634  NOTICE root: Locking /tmp/swss-syncd-lock from swss service
Oct 24 17:41:09.336832  NOTICE root: Locked /tmp/swss-syncd-lock (10) from swss service
Oct 24 17:41:10.800737  NOTICE root: Started swss service...
Oct 24 17:41:10.815371  NOTICE root: Unlocking /tmp/swss-syncd-lock (10) from swss service
Oct 24 17:41:10.848829  INFO systemd[1]: Started switch state service.
Oct 24 17:41:10.867701  NOTICE root: Starting snmp service...
Oct 24 17:41:10.931986  INFO systemd[1]: Starting syncd service...
Oct 24 17:41:10.948503  NOTICE root: Starting syncd service...
Oct 24 17:41:10.968107  NOTICE root: Locking /tmp/swss-syncd-lock from syncd service
Oct 24 17:41:10.991026  NOTICE root: Locked /tmp/swss-syncd-lock (10) from syncd service
Oct 24 17:41:12.398649  NOTICE root: Started syncd service...
Oct 24 17:41:12.414634  NOTICE root: Unlocking /tmp/swss-syncd-lock (10) from syncd service
Oct 24 17:41:12.438583  INFO systemd[1]: Started syncd service.
Oct 24 17:41:12.494648  NOTICE admin: Starting radv service...
Oct 24 17:41:12.536698  NOTICE root: Starting lldp service...
Oct 24 17:41:12.611767  NOTICE root: Started lldp service
Oct 24 17:41:13.593609  NOTICE root: Started snmp service...
Oct 24 17:41:14.360264  NOTICE admin: Started radv service...
Oct 24 17:41:14.432015  NOTICE admin: Starting bgp service...
Oct 24 17:41:16.387606  NOTICE admin: Started bgp service...
Oct 24 17:41:16.425832  INFO swss.sh[617647]: Failed to start dhcp_relay.service: Unit dhcp_relay.service is masked.
Oct 24 17:41:16.475916  NOTICE admin: Starting teamd service...
Oct 24 17:41:18.517372  NOTICE admin: Started teamd service...
Oct 24 17:41:20.893415  NOTICE syncd#syncd: :- syncd_ipc_task_main: IPC service is online
Oct 24 17:41:24.752925  INFO bgp#bgpmon: bgpmon service started
Oct 24 17:41:30.943524  INFO swss#restore_neighbor: restore_neighbors service is started
Oct 24 17:41:30.951538  INFO swss#restore_neighbor: restore_neighbors service is skipped as warm restart not enabled

Describe the results you expected:

syncd container should stop only once.

Output of show version:

SONiC Software Version: SONiC.20220531.36 SONiC OS Version: 11 Distribution: Debian 11.7 Kernel: 5.10.0-18-2-amd64

(paste your output here)

Output of show techsupport:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

deepak-singhal0408 commented 11 months ago

RCA: As per the current syncd.service systemd config, syncd service immediately starts after exiting. As part of syncd exit, dependent services also get exited(swss, teamd). As part of swss exit, it stops dependent services (bgp, syncd etc)..

Thats why second time syncd service stop is observed.