sonic-net / sonic-buildimage

Scripts which perform an installable binary image build for SONiC
Other
729 stars 1.4k forks source link

chassis: Sup: 'container_checker' status failed (4) -- Unexpected running containers: bgp12 #16707

Open anamehra opened 1 year ago

anamehra commented 1 year ago

Description

Looks like https://github.com/sonic-net/sonic-buildimage/issues/15667 issue is not fixed yet. his issue is still seen intermittently with the 202205 branch having the fix. I see same signatures where during the first after load_minigraph, when hostcfg calls systemctl stop and docker has just started, the docker fails to go down. Same as mentioned in the original description of #15667 .

Steps to reproduce the issue:

1. 2. 3.

Describe the results you received:

Describe the results you expected:

Output of show version:

(paste your output here)

Output of show techsupport:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

judyjoseph commented 1 year ago

@anamehra Could you add more info if seen in latest 202205, and steps for repro

anamehra commented 10 months ago

I hit the issue again on 202205.

I brought up a Supervisor without any config db file. Then did deploy-mg from sonic-mgmt. All bgp dockers went down except one. I see a timing condition when bgp service was restarted as part of load_minigraph and then hostcfgd masking the service. bgp13 service goes down but as the docker was started around same time, it did not down:

Nov 29 21:20:49.407555 sonic NOTICE cisco: Stopping bgp13 service...                                                                                                                                                                                            
Nov 29 21:20:49.659314 sonic NOTICE cisco: Warm boot flag: bgp13 false.                                                                                                                                                                                         
Nov 29 21:20:49.721835 sonic NOTICE cisco: Fast boot flag: bgp13 false.                                                                                                                                                                                         
Nov 29 21:20:51.369372 sonic DEBUG container: read_data: config:True feature:bgp13 fields:[('set_owner', 'local'), ('no_fallback_to_local', False), ('state', 'disabled')] val:['local', False, 'disabled']                                                     
Nov 29 21:20:51.372539 sonic DEBUG container: read_data: config:False feature:bgp13 fields:[('current_owner', 'none'), ('remote_state', 'none'), ('container_id', '')] val:['none', 'none', '']                                                                 
Nov 29 21:20:51.377978 sonic DEBUG container: container_stop: bgp13: set_owner:local current_owner:none remote_state:none docker_id:bgp13                                                                                                                       
Nov 29 21:20:52.082265 sonic INFO bgp13#supervisord 2023-11-29 21:20:52,080 WARN received SIGTERM indicating exit request                                                                                                                                       
Nov 29 21:20:52.082265 sonic INFO bgp13#supervisord 2023-11-29 21:20:52,080 INFO waiting for supervisor-proc-exit-listener, rsyslogd, staticd, zebra, bgpd, bgpcfgd, bgpmon, fpmsyncd, staticroutebfd, containercfgd to die                                     
Nov 29 21:20:52.087363 sonic INFO bgp13#supervisord 2023-11-29 21:20:52,087 INFO stopped: containercfgd (exit status 143)                                                                                                                                       
Nov 29 21:20:53.089609 sonic INFO bgp13#supervisord 2023-11-29 21:20:53,089 INFO stopped: staticroutebfd (exit status 0)                                                                                                                                        
Nov 29 21:20:53.090714 sonic INFO bgp13#supervisord 2023-11-29 21:20:53,090 INFO stopped: fpmsyncd (terminated by SIGTERM)                                                                                                                                      
Nov 29 21:20:53.091379 sonic INFO bgp13#supervisord 2023-11-29 21:20:53,091 INFO stopped: bgpmon (terminated by SIGTERM)                                                                                                                                        
Nov 29 21:20:55.094025 sonic INFO bgp13#supervisord 2023-11-29 21:20:55,093 INFO waiting for supervisor-proc-exit-listener, rsyslogd, staticd, zebra, bgpd, bgpcfgd to die                                                                                      
Nov 29 21:20:58.096824 sonic INFO bgp13#supervisord 2023-11-29 21:20:58,096 INFO waiting for supervisor-proc-exit-listener, rsyslogd, staticd, zebra, bgpd, bgpcfgd to die                                                                                      
Nov 29 21:21:01.100217 sonic INFO bgp13#supervisord 2023-11-29 21:21:01,099 INFO waiting for supervisor-proc-exit-listener, rsyslogd, staticd, zebra, bgpd, bgpcfgd to die                                                                                      
Nov 29 21:21:01.547321 sonic INFO container: docker cmd: wait for bgp13                                                                                                                                                                                         
Nov 29 21:21:01.548201 sonic INFO container: docker cmd: stop for bgp13                                                                                                                                                                                         
Nov 29 21:21:01.662206 sonic NOTICE cisco: Stopped bgp13 service...                                                                                                                                                                                             
Nov 29 21:21:01.775324 sonic NOTICE systemd[1]: bgp@13.service: Control process exited, code=exited, status=1/FAILURE                                                                                                                                           
Nov 29 21:21:01.775378 sonic WARNING systemd[1]: bgp@13.service: Failed with result 'exit-code'.                                                                                                                                                                
Nov 29 21:21:40.627266 sonic INFO hostcfgd: Updating feature 'bgp@13' systemd config file related to auto-restart ...                                                                                                                                           
Nov 29 21:21:40.627368 sonic INFO hostcfgd: Feature 'bgp@13' systemd config file related to auto-restart is updated!                                                                                                                                            
Nov 29 21:21:42.910335 sonic NOTICE python3: :- publish: EVENT_PUBLISHED: {"sonic-events-host:event-down-ctr":{"ctr_name":"bgp13","timestamp":"2023-11-29T21:21:42.910193Z"}}                                                                                   
Nov 29 21:21:46.352367 sonic NOTICE cisco: Starting bgp13 service...                                                                                                                                                                                            
Nov 29 21:21:46.528101 sonic NOTICE cisco: Warm boot flag: bgp13 false.                                                                                                                                                                                         
Nov 29 21:21:46.578191 sonic NOTICE cisco: Fast boot flag: bgp13 .                                                                                                                                                                                              
Nov 29 21:21:47.269205 sonic INFO bgp.sh[40618]: Removing obsolete bgp13 container with HWSKU /usr/share/sonic/device/x86_64-8800_rp_o-r0/13                                                                                                                    
Nov 29 21:21:47.359159 sonic INFO bgp.sh[40778]: bgp13                                                                                                                                                                                                          
Nov 29 21:21:47.365148 sonic INFO bgp.sh[40618]: Creating new bgp13 container with HWSKU Cisco-8800-RP                                                                                                                                                          
Nov 29 21:21:49.228210 sonic DEBUG container: read_data: config:True feature:bgp13 fields:[('set_owner', 'local'), ('no_fallback_to_local', False), ('state', 'disabled')] val:['local', False, 'disabled']                                                     
Nov 29 21:21:49.231453 sonic DEBUG container: read_data: config:False feature:bgp13 fields:[('current_owner', 'none'), ('remote_state', 'none'), ('container_id', '')] val:['none', 'none', '']                                                                 
Nov 29 21:21:49.233142 sonic DEBUG container: container_start: bgp13: set_owner:local fallback:True remote_state:none server_connected:false                                                                                                                    
Nov 29 21:21:49.477865 sonic INFO hostcfgd: Running cmd: '['sudo', 'systemctl', 'stop', 'bgp@13.service']'                                                                                                                                                      
Nov 29 21:21:49.489713 sonic WARNING systemd[1]: bgp@13.service: Control process exited, code=killed, status=15/TERM                                                                                                                                            
Nov 29 21:21:49.561933 sfd-01 WARNING systemd[1]: bgp@13.service: Failed with result 'signal'.                                                                                                                                                                  
Nov 29 21:21:49.564464 sfd-01 INFO hostcfgd: Running cmd: '['sudo', 'systemctl', 'disable', 'bgp@13.service']'                                                                                                                                                  
Nov 29 21:21:49.835122 sfd-01 INFO hostcfgd: Running cmd: '['sudo', 'systemctl', 'mask', 'bgp@13.service']'                                                                                                                                                     
Nov 29 21:21:49.847375 sfd-01 INFO hostcfgd[41326]: Created symlink /etc/systemd/system/bgp@13.service → /dev/null.                                                                                                                                             
Nov 29 21:21:53.680650 sfd-01 INFO bgp13#rsyslogd: [origin software="rsyslogd" swVersion="8.2302.0" x-pid="47" x-info="https://www.rsyslog.com"] start                                                                                                          
Nov 29 21:21:53.699499 sfd-01 INFO bgp13#supervisord 2023-11-29 21:21:51,464 INFO Included extra file "/etc/supervisor/conf.d/containercfgd.conf" during parsing                                                                                                
Nov 29 21:21:53.699499 sfd-01 INFO bgp13#supervisord 2023-11-29 21:21:51,465 INFO Included extra file "/etc/supervisor/conf.d/supervisord.conf" during parsing                                                                                                  
Nov 29 21:21:53.699499 sfd-01 INFO bgp13#supervisord 2023-11-29 21:21:51,465 INFO Set uid to user 0 succeeded                                                                                                                                                   
Nov 29 21:21:53.699499 sfd-01 INFO bgp13#supervisord 2023-11-29 21:21:51,490 INFO RPC interface 'supervisor' initialized                                                                                                                                        
Nov 29 21:21:53.699499 sfd-01 INFO bgp13#supervisord 2023-11-29 21:21:51,490 CRIT Server 'unix_http_server' running without any HTTP authentication checking                                                                                                    
Nov 29 21:21:53.699499 sfd-01 INFO bgp13#supervisord 2023-11-29 21:21:51,490 INFO supervisord started with pid 1                                                                                                                                                
Nov 29 21:21:53.699499 sfd-01 INFO bgp13#supervisord 2023-11-29 21:21:52,493 INFO spawned: 'dependent-startup' with pid 43                                                                                                                                      
Nov 29 21:21:53.699499 sfd-01 INFO bgp13#supervisord 2023-11-29 21:21:52,494 INFO spawned: 'supervisor-proc-exit-listener' with pid 44                                                                                                                          
Nov 29 21:21:53.699499 sfd-01 INFO bgp13#supervisord 2023-11-29 21:21:53,662 INFO success: dependent-startup entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)                                                                      
Nov 29 21:21:53.699499 sfd-01 INFO bgp13#supervisord 2023-11-29 21:21:53,662 INFO success: supervisor-proc-exit-listener entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
Nov 29 21:21:53.699499 sfd-01 INFO bgp13#supervisord 2023-11-29 21:21:53,666 INFO spawned: 'rsyslogd' with pid 47                                                                                                                                               
Nov 29 21:21:53.699499 sfd-01 INFO bgp13#supervisord 2023-11-29 21:21:53,666 INFO success: rsyslogd entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
Nov 29 21:21:53.763962 sfd-01 INFO bgp13#supervisord 2023-11-29 21:21:53,762 INFO spawned: 'zebra' with pid 51                                      
Nov 29 21:21:53.763962 sfd-01 INFO bgp13#supervisord 2023-11-29 21:21:53,762 INFO success: zebra entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
Nov 29 21:21:53.766833 sfd-01 INFO bgp13#supervisord 2023-11-29 21:21:53,766 INFO spawned: 'containercfgd' with pid 52     
Nov 29 21:21:53.811325 sfd-01 INFO bgp13#supervisord: zebra 2023/11/29 21:21:53 ZEBRA: [NNACN-54BDA][EC 4043309110] Disabling MPLS support (no kernel support)
Nov 29 21:21:53.866085 sfd-01 NOTICE bgp13#containercfgd[52]: Connected to CONFIG DB                                                                                                                                                                            
Nov 29 21:21:53.873146 sfd-01 INFO bgp13#supervisord 2023-11-29 21:21:53,871 INFO spawned: 'zsocket' with pid 57
Nov 29 21:21:53.873146 sfd-01 INFO bgp13#supervisord 2023-11-29 21:21:53,872 INFO success: zsocket entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
Nov 29 21:21:53.892714 sfd-01 INFO bgp13#root: It took 0.012434 seconds to wait for zebra to be ready to accept connections
Nov 29 21:21:53.894870 sfd-01 INFO bgp13#supervisord 2023-11-29 21:21:53,892 INFO exited: zsocket (exit status 0; expected)
Nov 29 21:21:53.920204 sfd-01 INFO bgp13#supervisord 2023-11-29 21:21:53,919 INFO spawned: 'zsocket' with pid 65
Nov 29 21:21:53.920204 sfd-01 INFO bgp13#supervisord 2023-11-29 21:21:53,919 INFO success: zsocket entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
Nov 29 21:21:53.941018 sfd-01 INFO bgp13#root: It took 0.0132363 seconds to wait for zebra to be ready to accept connections
Nov 29 21:21:53.942481 sfd-01 INFO bgp13#supervisord 2023-11-29 21:21:53,940 INFO exited: zsocket (exit status 0; expected)

root@sfd-01:/home/cisco# systemctl status bgp@13
Warning: The unit file, source configuration file or drop-ins of bgp@13.service changed on disk. Run 'systemctl daemon-reload' to reload units.
● bgp@13.service
     Loaded: masked (Reason: Unit bgp@13.service is masked.)
    Drop-In: /etc/systemd/system/bgp@13.service.d
             └─auto_restart.conf
     Active: inactive (dead) since Wed 2023-11-29 21:21:49 UTC; 2h 12min ago

Nov 29 21:21:47 sfd-01 bgp.sh[40888]: cp: failed to access '/etc/sonic/frr/13': Permission denied
Nov 29 21:21:47 sfd-01 bgp.sh[40890]: 5eeab82e6f54fe4b3ae656dba013401302d31e135b1ef2012ad66518ed78b788
Nov 29 21:21:49 sfd-01 container[41051]: container_start: BEGIN
Nov 29 21:21:49 sfd-01 container[41051]: read_data: config:True feature:bgp13 fields:[('set_owner', 'local'), ('no_fallback_to_local', False), ('state', 'disabled')] val:['local', False, 'disabled']
Nov 29 21:21:49 sfd-01 container[41051]: read_data: config:False feature:bgp13 fields:[('current_owner', 'none'), ('remote_state', 'none'), ('container_id', '')] val:['none', 'none', '']
Nov 29 21:21:49 sfd-01 container[41051]: read_data: config:False feature:SERVER fields:[('connected', 'false'), ('update_time', '')] val:['false', '']
Nov 29 21:21:49 sfd-01 container[41051]: container_start: bgp13: set_owner:local fallback:True remote_state:none server_connected:false
Nov 29 21:21:49 sfd-01 systemd[1]: bgp@13.service: Control process exited, code=killed, status=15/TERM
Nov 29 21:21:49 sfd-01 systemd[1]: bgp@13.service: Failed with result 'signal'.
Nov 29 21:21:49 sfd-01 systemd[1]: Stopped BGP container.

root@sfd-01:/home/cisco# docker ps | grep bgp
5eeab82e6f54   docker-fpm-frr:latest                "/usr/bin/docker_ini…"   2 hours ago   Up 2 hours             bgp13

root@sfd-01:/home/cisco# docker logs bgp13
/usr/local/lib/python3.9/dist-packages/supervisor/options.py:473: UserWarning: Supervisord is running as root and it is searching for its configuration file in default locations (including its current working directory); you probably want to specify a "-c.
  self.warnings.warn(
2023-11-29 21:21:51,464 INFO Included extra file "/etc/supervisor/conf.d/containercfgd.conf" during parsing
2023-11-29 21:21:51,465 INFO Included extra file "/etc/supervisor/conf.d/supervisord.conf" during parsing
2023-11-29 21:21:51,465 INFO Set uid to user 0 succeeded
2023-11-29 21:21:51,490 INFO RPC interface 'supervisor' initialized
2023-11-29 21:21:51,490 CRIT Server 'unix_http_server' running without any HTTP authentication checking
2023-11-29 21:21:51,490 INFO supervisord started with pid 1
2023-11-29 21:21:52,493 INFO spawned: 'dependent-startup' with pid 43
2023-11-29 21:21:52,494 INFO spawned: 'supervisor-proc-exit-listener' with pid 44
2023-11-29 21:21:53,662 INFO success: dependent-startup entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-11-29 21:21:53,662 INFO success: supervisor-proc-exit-listener entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-11-29 21:21:53,666 INFO spawned: 'rsyslogd' with pid 47
2023-11-29 21:21:53,666 INFO success: rsyslogd entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
2023-11-29 21:21:53,762 INFO spawned: 'zebra' with pid 51
2023-11-29 21:21:53,762 INFO success: zebra entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
2023-11-29 21:21:53,766 INFO spawned: 'containercfgd' with pid 52
2023-11-29 21:21:53,871 INFO spawned: 'zsocket' with pid 57
2023-11-29 21:21:53,872 INFO success: zsocket entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
2023-11-29 21:21:53,892 INFO exited: zsocket (exit status 0; expected)
2023-11-29 21:21:53,919 INFO spawned: 'zsocket' with pid 65
2023-11-29 21:21:53,919 INFO success: zsocket entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
2023-11-29 21:21:53,940 INFO exited: zsocket (exit status 0; expected)
2023-11-29 21:21:53,951 INFO spawned: 'zsocket' with pid 73
2023-11-29 21:21:53,952 INFO success: zsocket entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
2023-11-29 21:21:53,983 INFO exited: zsocket (exit status 0; expected)
2023-11-29 21:21:53,995 INFO spawned: 'staticd' with pid 81
2023-11-29 21:21:53,995 INFO success: staticd entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
2023-11-29 21:21:53,997 INFO spawned: 'bgpd' with pid 82
2023-11-29 21:21:53,998 INFO success: bgpd entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
2023-11-29 21:21:54,190 INFO spawned: 'bgpcfgd' with pid 89
2023-11-29 21:21:54,191 INFO success: bgpcfgd entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
2023-11-29 21:21:54,200 INFO spawned: 'bgpmon' with pid 90
2023-11-29 21:21:54,200 INFO success: bgpmon entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
2023-11-29 21:21:54,215 INFO spawned: 'fpmsyncd' with pid 91
2023-11-29 21:21:54,216 INFO success: fpmsyncd entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
2023-11-29 21:21:54,224 INFO spawned: 'staticroutebfd' with pid 92
2023-11-29 21:21:54,224 INFO success: staticroutebfd entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
2023-11-29 21:21:54,765 INFO success: containercfgd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-11-29 21:21:54,895 INFO exited: dependent-startup (exit status 0; expected)