Open anamehra opened 1 year ago
@anamehra Could you add more info if seen in latest 202205, and steps for repro
I hit the issue again on 202205.
I brought up a Supervisor without any config db file. Then did deploy-mg from sonic-mgmt. All bgp dockers went down except one. I see a timing condition when bgp service was restarted as part of load_minigraph and then hostcfgd masking the service. bgp13 service goes down but as the docker was started around same time, it did not down:
Nov 29 21:20:49.407555 sonic NOTICE cisco: Stopping bgp13 service...
Nov 29 21:20:49.659314 sonic NOTICE cisco: Warm boot flag: bgp13 false.
Nov 29 21:20:49.721835 sonic NOTICE cisco: Fast boot flag: bgp13 false.
Nov 29 21:20:51.369372 sonic DEBUG container: read_data: config:True feature:bgp13 fields:[('set_owner', 'local'), ('no_fallback_to_local', False), ('state', 'disabled')] val:['local', False, 'disabled']
Nov 29 21:20:51.372539 sonic DEBUG container: read_data: config:False feature:bgp13 fields:[('current_owner', 'none'), ('remote_state', 'none'), ('container_id', '')] val:['none', 'none', '']
Nov 29 21:20:51.377978 sonic DEBUG container: container_stop: bgp13: set_owner:local current_owner:none remote_state:none docker_id:bgp13
Nov 29 21:20:52.082265 sonic INFO bgp13#supervisord 2023-11-29 21:20:52,080 WARN received SIGTERM indicating exit request
Nov 29 21:20:52.082265 sonic INFO bgp13#supervisord 2023-11-29 21:20:52,080 INFO waiting for supervisor-proc-exit-listener, rsyslogd, staticd, zebra, bgpd, bgpcfgd, bgpmon, fpmsyncd, staticroutebfd, containercfgd to die
Nov 29 21:20:52.087363 sonic INFO bgp13#supervisord 2023-11-29 21:20:52,087 INFO stopped: containercfgd (exit status 143)
Nov 29 21:20:53.089609 sonic INFO bgp13#supervisord 2023-11-29 21:20:53,089 INFO stopped: staticroutebfd (exit status 0)
Nov 29 21:20:53.090714 sonic INFO bgp13#supervisord 2023-11-29 21:20:53,090 INFO stopped: fpmsyncd (terminated by SIGTERM)
Nov 29 21:20:53.091379 sonic INFO bgp13#supervisord 2023-11-29 21:20:53,091 INFO stopped: bgpmon (terminated by SIGTERM)
Nov 29 21:20:55.094025 sonic INFO bgp13#supervisord 2023-11-29 21:20:55,093 INFO waiting for supervisor-proc-exit-listener, rsyslogd, staticd, zebra, bgpd, bgpcfgd to die
Nov 29 21:20:58.096824 sonic INFO bgp13#supervisord 2023-11-29 21:20:58,096 INFO waiting for supervisor-proc-exit-listener, rsyslogd, staticd, zebra, bgpd, bgpcfgd to die
Nov 29 21:21:01.100217 sonic INFO bgp13#supervisord 2023-11-29 21:21:01,099 INFO waiting for supervisor-proc-exit-listener, rsyslogd, staticd, zebra, bgpd, bgpcfgd to die
Nov 29 21:21:01.547321 sonic INFO container: docker cmd: wait for bgp13
Nov 29 21:21:01.548201 sonic INFO container: docker cmd: stop for bgp13
Nov 29 21:21:01.662206 sonic NOTICE cisco: Stopped bgp13 service...
Nov 29 21:21:01.775324 sonic NOTICE systemd[1]: bgp@13.service: Control process exited, code=exited, status=1/FAILURE
Nov 29 21:21:01.775378 sonic WARNING systemd[1]: bgp@13.service: Failed with result 'exit-code'.
Nov 29 21:21:40.627266 sonic INFO hostcfgd: Updating feature 'bgp@13' systemd config file related to auto-restart ...
Nov 29 21:21:40.627368 sonic INFO hostcfgd: Feature 'bgp@13' systemd config file related to auto-restart is updated!
Nov 29 21:21:42.910335 sonic NOTICE python3: :- publish: EVENT_PUBLISHED: {"sonic-events-host:event-down-ctr":{"ctr_name":"bgp13","timestamp":"2023-11-29T21:21:42.910193Z"}}
Nov 29 21:21:46.352367 sonic NOTICE cisco: Starting bgp13 service...
Nov 29 21:21:46.528101 sonic NOTICE cisco: Warm boot flag: bgp13 false.
Nov 29 21:21:46.578191 sonic NOTICE cisco: Fast boot flag: bgp13 .
Nov 29 21:21:47.269205 sonic INFO bgp.sh[40618]: Removing obsolete bgp13 container with HWSKU /usr/share/sonic/device/x86_64-8800_rp_o-r0/13
Nov 29 21:21:47.359159 sonic INFO bgp.sh[40778]: bgp13
Nov 29 21:21:47.365148 sonic INFO bgp.sh[40618]: Creating new bgp13 container with HWSKU Cisco-8800-RP
Nov 29 21:21:49.228210 sonic DEBUG container: read_data: config:True feature:bgp13 fields:[('set_owner', 'local'), ('no_fallback_to_local', False), ('state', 'disabled')] val:['local', False, 'disabled']
Nov 29 21:21:49.231453 sonic DEBUG container: read_data: config:False feature:bgp13 fields:[('current_owner', 'none'), ('remote_state', 'none'), ('container_id', '')] val:['none', 'none', '']
Nov 29 21:21:49.233142 sonic DEBUG container: container_start: bgp13: set_owner:local fallback:True remote_state:none server_connected:false
Nov 29 21:21:49.477865 sonic INFO hostcfgd: Running cmd: '['sudo', 'systemctl', 'stop', 'bgp@13.service']'
Nov 29 21:21:49.489713 sonic WARNING systemd[1]: bgp@13.service: Control process exited, code=killed, status=15/TERM
Nov 29 21:21:49.561933 sfd-01 WARNING systemd[1]: bgp@13.service: Failed with result 'signal'.
Nov 29 21:21:49.564464 sfd-01 INFO hostcfgd: Running cmd: '['sudo', 'systemctl', 'disable', 'bgp@13.service']'
Nov 29 21:21:49.835122 sfd-01 INFO hostcfgd: Running cmd: '['sudo', 'systemctl', 'mask', 'bgp@13.service']'
Nov 29 21:21:49.847375 sfd-01 INFO hostcfgd[41326]: Created symlink /etc/systemd/system/bgp@13.service → /dev/null.
Nov 29 21:21:53.680650 sfd-01 INFO bgp13#rsyslogd: [origin software="rsyslogd" swVersion="8.2302.0" x-pid="47" x-info="https://www.rsyslog.com"] start
Nov 29 21:21:53.699499 sfd-01 INFO bgp13#supervisord 2023-11-29 21:21:51,464 INFO Included extra file "/etc/supervisor/conf.d/containercfgd.conf" during parsing
Nov 29 21:21:53.699499 sfd-01 INFO bgp13#supervisord 2023-11-29 21:21:51,465 INFO Included extra file "/etc/supervisor/conf.d/supervisord.conf" during parsing
Nov 29 21:21:53.699499 sfd-01 INFO bgp13#supervisord 2023-11-29 21:21:51,465 INFO Set uid to user 0 succeeded
Nov 29 21:21:53.699499 sfd-01 INFO bgp13#supervisord 2023-11-29 21:21:51,490 INFO RPC interface 'supervisor' initialized
Nov 29 21:21:53.699499 sfd-01 INFO bgp13#supervisord 2023-11-29 21:21:51,490 CRIT Server 'unix_http_server' running without any HTTP authentication checking
Nov 29 21:21:53.699499 sfd-01 INFO bgp13#supervisord 2023-11-29 21:21:51,490 INFO supervisord started with pid 1
Nov 29 21:21:53.699499 sfd-01 INFO bgp13#supervisord 2023-11-29 21:21:52,493 INFO spawned: 'dependent-startup' with pid 43
Nov 29 21:21:53.699499 sfd-01 INFO bgp13#supervisord 2023-11-29 21:21:52,494 INFO spawned: 'supervisor-proc-exit-listener' with pid 44
Nov 29 21:21:53.699499 sfd-01 INFO bgp13#supervisord 2023-11-29 21:21:53,662 INFO success: dependent-startup entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
Nov 29 21:21:53.699499 sfd-01 INFO bgp13#supervisord 2023-11-29 21:21:53,662 INFO success: supervisor-proc-exit-listener entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
Nov 29 21:21:53.699499 sfd-01 INFO bgp13#supervisord 2023-11-29 21:21:53,666 INFO spawned: 'rsyslogd' with pid 47
Nov 29 21:21:53.699499 sfd-01 INFO bgp13#supervisord 2023-11-29 21:21:53,666 INFO success: rsyslogd entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
Nov 29 21:21:53.763962 sfd-01 INFO bgp13#supervisord 2023-11-29 21:21:53,762 INFO spawned: 'zebra' with pid 51
Nov 29 21:21:53.763962 sfd-01 INFO bgp13#supervisord 2023-11-29 21:21:53,762 INFO success: zebra entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
Nov 29 21:21:53.766833 sfd-01 INFO bgp13#supervisord 2023-11-29 21:21:53,766 INFO spawned: 'containercfgd' with pid 52
Nov 29 21:21:53.811325 sfd-01 INFO bgp13#supervisord: zebra 2023/11/29 21:21:53 ZEBRA: [NNACN-54BDA][EC 4043309110] Disabling MPLS support (no kernel support)
Nov 29 21:21:53.866085 sfd-01 NOTICE bgp13#containercfgd[52]: Connected to CONFIG DB
Nov 29 21:21:53.873146 sfd-01 INFO bgp13#supervisord 2023-11-29 21:21:53,871 INFO spawned: 'zsocket' with pid 57
Nov 29 21:21:53.873146 sfd-01 INFO bgp13#supervisord 2023-11-29 21:21:53,872 INFO success: zsocket entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
Nov 29 21:21:53.892714 sfd-01 INFO bgp13#root: It took 0.012434 seconds to wait for zebra to be ready to accept connections
Nov 29 21:21:53.894870 sfd-01 INFO bgp13#supervisord 2023-11-29 21:21:53,892 INFO exited: zsocket (exit status 0; expected)
Nov 29 21:21:53.920204 sfd-01 INFO bgp13#supervisord 2023-11-29 21:21:53,919 INFO spawned: 'zsocket' with pid 65
Nov 29 21:21:53.920204 sfd-01 INFO bgp13#supervisord 2023-11-29 21:21:53,919 INFO success: zsocket entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
Nov 29 21:21:53.941018 sfd-01 INFO bgp13#root: It took 0.0132363 seconds to wait for zebra to be ready to accept connections
Nov 29 21:21:53.942481 sfd-01 INFO bgp13#supervisord 2023-11-29 21:21:53,940 INFO exited: zsocket (exit status 0; expected)
root@sfd-01:/home/cisco# systemctl status bgp@13
Warning: The unit file, source configuration file or drop-ins of bgp@13.service changed on disk. Run 'systemctl daemon-reload' to reload units.
● bgp@13.service
Loaded: masked (Reason: Unit bgp@13.service is masked.)
Drop-In: /etc/systemd/system/bgp@13.service.d
└─auto_restart.conf
Active: inactive (dead) since Wed 2023-11-29 21:21:49 UTC; 2h 12min ago
Nov 29 21:21:47 sfd-01 bgp.sh[40888]: cp: failed to access '/etc/sonic/frr/13': Permission denied
Nov 29 21:21:47 sfd-01 bgp.sh[40890]: 5eeab82e6f54fe4b3ae656dba013401302d31e135b1ef2012ad66518ed78b788
Nov 29 21:21:49 sfd-01 container[41051]: container_start: BEGIN
Nov 29 21:21:49 sfd-01 container[41051]: read_data: config:True feature:bgp13 fields:[('set_owner', 'local'), ('no_fallback_to_local', False), ('state', 'disabled')] val:['local', False, 'disabled']
Nov 29 21:21:49 sfd-01 container[41051]: read_data: config:False feature:bgp13 fields:[('current_owner', 'none'), ('remote_state', 'none'), ('container_id', '')] val:['none', 'none', '']
Nov 29 21:21:49 sfd-01 container[41051]: read_data: config:False feature:SERVER fields:[('connected', 'false'), ('update_time', '')] val:['false', '']
Nov 29 21:21:49 sfd-01 container[41051]: container_start: bgp13: set_owner:local fallback:True remote_state:none server_connected:false
Nov 29 21:21:49 sfd-01 systemd[1]: bgp@13.service: Control process exited, code=killed, status=15/TERM
Nov 29 21:21:49 sfd-01 systemd[1]: bgp@13.service: Failed with result 'signal'.
Nov 29 21:21:49 sfd-01 systemd[1]: Stopped BGP container.
root@sfd-01:/home/cisco# docker ps | grep bgp
5eeab82e6f54 docker-fpm-frr:latest "/usr/bin/docker_ini…" 2 hours ago Up 2 hours bgp13
root@sfd-01:/home/cisco# docker logs bgp13
/usr/local/lib/python3.9/dist-packages/supervisor/options.py:473: UserWarning: Supervisord is running as root and it is searching for its configuration file in default locations (including its current working directory); you probably want to specify a "-c.
self.warnings.warn(
2023-11-29 21:21:51,464 INFO Included extra file "/etc/supervisor/conf.d/containercfgd.conf" during parsing
2023-11-29 21:21:51,465 INFO Included extra file "/etc/supervisor/conf.d/supervisord.conf" during parsing
2023-11-29 21:21:51,465 INFO Set uid to user 0 succeeded
2023-11-29 21:21:51,490 INFO RPC interface 'supervisor' initialized
2023-11-29 21:21:51,490 CRIT Server 'unix_http_server' running without any HTTP authentication checking
2023-11-29 21:21:51,490 INFO supervisord started with pid 1
2023-11-29 21:21:52,493 INFO spawned: 'dependent-startup' with pid 43
2023-11-29 21:21:52,494 INFO spawned: 'supervisor-proc-exit-listener' with pid 44
2023-11-29 21:21:53,662 INFO success: dependent-startup entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-11-29 21:21:53,662 INFO success: supervisor-proc-exit-listener entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-11-29 21:21:53,666 INFO spawned: 'rsyslogd' with pid 47
2023-11-29 21:21:53,666 INFO success: rsyslogd entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
2023-11-29 21:21:53,762 INFO spawned: 'zebra' with pid 51
2023-11-29 21:21:53,762 INFO success: zebra entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
2023-11-29 21:21:53,766 INFO spawned: 'containercfgd' with pid 52
2023-11-29 21:21:53,871 INFO spawned: 'zsocket' with pid 57
2023-11-29 21:21:53,872 INFO success: zsocket entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
2023-11-29 21:21:53,892 INFO exited: zsocket (exit status 0; expected)
2023-11-29 21:21:53,919 INFO spawned: 'zsocket' with pid 65
2023-11-29 21:21:53,919 INFO success: zsocket entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
2023-11-29 21:21:53,940 INFO exited: zsocket (exit status 0; expected)
2023-11-29 21:21:53,951 INFO spawned: 'zsocket' with pid 73
2023-11-29 21:21:53,952 INFO success: zsocket entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
2023-11-29 21:21:53,983 INFO exited: zsocket (exit status 0; expected)
2023-11-29 21:21:53,995 INFO spawned: 'staticd' with pid 81
2023-11-29 21:21:53,995 INFO success: staticd entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
2023-11-29 21:21:53,997 INFO spawned: 'bgpd' with pid 82
2023-11-29 21:21:53,998 INFO success: bgpd entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
2023-11-29 21:21:54,190 INFO spawned: 'bgpcfgd' with pid 89
2023-11-29 21:21:54,191 INFO success: bgpcfgd entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
2023-11-29 21:21:54,200 INFO spawned: 'bgpmon' with pid 90
2023-11-29 21:21:54,200 INFO success: bgpmon entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
2023-11-29 21:21:54,215 INFO spawned: 'fpmsyncd' with pid 91
2023-11-29 21:21:54,216 INFO success: fpmsyncd entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
2023-11-29 21:21:54,224 INFO spawned: 'staticroutebfd' with pid 92
2023-11-29 21:21:54,224 INFO success: staticroutebfd entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
2023-11-29 21:21:54,765 INFO success: containercfgd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-11-29 21:21:54,895 INFO exited: dependent-startup (exit status 0; expected)
Description
Looks like https://github.com/sonic-net/sonic-buildimage/issues/15667 issue is not fixed yet. his issue is still seen intermittently with the 202205 branch having the fix. I see same signatures where during the first after load_minigraph, when hostcfg calls systemctl stop and docker has just started, the docker fails to go down. Same as mentioned in the original description of #15667 .
Steps to reproduce the issue:
1. 2. 3.
Describe the results you received:
Describe the results you expected:
Output of
show version
:Output of
show techsupport
:Additional information you deem important (e.g. issue happens only occasionally):