sonic-net / sonic-buildimage

Scripts which perform an installable binary image build for SONiC
Other
736 stars 1.42k forks source link

SNMP Container Failing with mgmt VRF #16730

Open justindthomas opened 1 year ago

justindthomas commented 1 year ago

Description

Using a Dell N3248TE-ON, I configured the management interface using the mgmt VRF. The SNMP container stopped working after that (although I didn't notice it immediately).

Steps to reproduce the issue:

  1. Enable the management VRF
  2. Reboot (or reload the config).

Describe the results you received:

I only have one PSU, so you can ignore that message below.

$ sudo show system-health summary
System status summary

  System status LED  blink_yellow
  Services:
    Status: Not OK
    Not Running: snmp:snmpd, snmp:snmp-subagent
  Hardware:
    Status: Not OK
    Reasons: PSU 2 is missing or not available
             PSU2 Fan is missing
$ docker logs snmp
2023-09-27 13:16:01,071 INFO Included extra file "/etc/supervisor/conf.d/containercfgd.conf" during parsing
2023-09-27 13:16:01,071 INFO Included extra file "/etc/supervisor/conf.d/supervisord.conf" during parsing
2023-09-27 13:16:01,071 INFO Set uid to user 0 succeeded
Unlinking stale socket /var/run/supervisor.sock
2023-09-27 13:16:01,410 INFO RPC interface 'supervisor' initialized
2023-09-27 13:16:01,411 CRIT Server 'unix_http_server' running without any HTTP authentication checking
2023-09-27 13:16:01,412 INFO supervisord started with pid 1
2023-09-27 13:16:02,422 INFO spawned: 'dependent-startup' with pid 7
2023-09-27 13:16:02,438 INFO spawned: 'supervisor-proc-exit-listener' with pid 8
2023-09-27 13:16:02,486 INFO spawned: 'start' with pid 9
2023-09-27 13:16:03,550 INFO success: dependent-startup entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-09-27 13:16:03,553 INFO success: supervisor-proc-exit-listener entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-09-27 13:16:03,553 INFO success: start entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
2023-09-27 13:16:05,795 INFO spawned: 'rsyslogd' with pid 17
2023-09-27 13:16:07,619 INFO success: rsyslogd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-09-27 13:16:07,921 INFO exited: start (exit status 0; expected)
2023-09-27 13:16:07,980 INFO spawned: 'containercfgd' with pid 25
2023-09-27 13:16:08,066 INFO spawned: 'snmpd' with pid 26
2023-09-27 13:16:09,468 INFO success: snmpd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-09-27 13:16:09,469 INFO success: containercfgd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-09-27 13:16:09,470 INFO exited: snmpd (exit status 1; not expected)
2023-09-27 13:16:10,507 WARN received SIGTERM indicating exit request
2023-09-27 13:16:10,519 INFO waiting for dependent-startup, supervisor-proc-exit-listener, rsyslogd, containercfgd to die
2023-09-27 13:16:10,662 INFO stopped: containercfgd (exit status 143)
2023-09-27 13:16:10,699 INFO exited: dependent-startup (exit status 3; expected)
2023-09-27 13:16:11,712 INFO stopped: rsyslogd (exit status 0)
2023-09-27 13:16:11,716 INFO stopped: supervisor-proc-exit-listener (terminated by SIGTERM)
/usr/local/lib/python3.9/dist-packages/supervisor/options.py:473: UserWarning: Supervisord is running as root and it is searching for its configuration file in default locations (including its current working directory); you probably want to specify a "-c" argument specifying an absolute path to a configuration file for improved security.
  self.warnings.warn(
2023-09-27 13:16:46,980 INFO Included extra file "/etc/supervisor/conf.d/containercfgd.conf" during parsing
2023-09-27 13:16:46,980 INFO Included extra file "/etc/supervisor/conf.d/supervisord.conf" during parsing
2023-09-27 13:16:46,981 INFO Set uid to user 0 succeeded
2023-09-27 13:16:46,990 INFO RPC interface 'supervisor' initialized
2023-09-27 13:16:46,991 CRIT Server 'unix_http_server' running without any HTTP authentication checking
2023-09-27 13:16:46,992 INFO supervisord started with pid 1
2023-09-27 13:16:47,995 INFO spawned: 'dependent-startup' with pid 7
2023-09-27 13:16:47,999 INFO spawned: 'supervisor-proc-exit-listener' with pid 8
2023-09-27 13:16:48,002 INFO spawned: 'start' with pid 9
2023-09-27 13:16:48,652 INFO success: start entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
2023-09-27 13:16:49,273 INFO success: dependent-startup entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-09-27 13:16:49,274 INFO success: supervisor-proc-exit-listener entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-09-27 13:16:49,403 INFO spawned: 'rsyslogd' with pid 17
2023-09-27 13:16:50,114 INFO exited: start (exit status 0; expected)
2023-09-27 13:16:51,117 INFO success: rsyslogd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-09-27 13:16:51,148 INFO spawned: 'snmpd' with pid 25
2023-09-27 13:16:51,209 INFO spawned: 'containercfgd' with pid 26
2023-09-27 13:16:51,472 INFO exited: snmpd (exit status 1; not expected)
2023-09-27 13:16:52,481 INFO spawned: 'snmpd' with pid 27
2023-09-27 13:16:52,483 INFO success: containercfgd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-09-27 13:16:52,676 INFO exited: snmpd (exit status 1; not expected)
2023-09-27 13:16:54,709 INFO spawned: 'snmpd' with pid 28
2023-09-27 13:16:54,840 INFO exited: snmpd (exit status 1; not expected)
2023-09-27 13:16:57,895 INFO spawned: 'snmpd' with pid 29
2023-09-27 13:16:58,036 INFO exited: snmpd (exit status 1; not expected)
2023-09-27 13:16:58,042 INFO gave up: snmpd entered FATAL state, too many start retries too quickly

...in the SNMP container. 10.200.0.2 is the IP address of my eth0 interface in the mgmt VRF.

root@sonic:/# /usr/sbin/snmpd -LOw -u Debian-snmp -g Debian-snmp -I -smux,mteTrigger,mteTriggerConf -f -p /run/snmpd.pid
Error opening specified endpoint "udp:[10.200.0.2]:161"
Server Exiting with code 1
root@sonic:/#

Describe the results you expected:

The SNMP container to continue operating.

Output of show version:

I use sudo show version because it complains about dmidecode if I don't - but that's a separate issue.

$ sudo show version

SONiC Software Version: SONiC.master.372158-6e3519ea5
SONiC OS Version: 11
Distribution: Debian 11.7
Kernel: 5.10.0-23-2-amd64
Build commit: 6e3519ea5
Build date: Tue Sep 26 12:40:01 UTC 2023
Built by: AzDevOps@vmss-soni00236O

Platform: x86_64-dellemc_n3248te_c3338-r0
HwSKU: DellEMC-N3248TE
ASIC: broadcom
ASIC Count: 1
Serial Number: 4GNXV43
Model Number: 0WNWT9
Hardware Revision:
Uptime: 00:21:00 up 11:07,  1 user,  load average: 1.66, 1.69, 1.86
Date: Thu 28 Sep 2023 00:21:00

Docker images:
REPOSITORY                    TAG                       IMAGE ID       SIZE
docker-gbsyncd-broncos        latest                    cc45da41f975   349MB
docker-gbsyncd-broncos        master.372158-6e3519ea5   cc45da41f975   349MB
docker-gbsyncd-credo          latest                    f92ef16874e9   320MB
docker-gbsyncd-credo          master.372158-6e3519ea5   f92ef16874e9   320MB
docker-syncd-brcm             latest                    d75b29a1c076   673MB
docker-syncd-brcm             master.372158-6e3519ea5   d75b29a1c076   673MB
docker-orchagent              latest                    c9a7d8c767dd   336MB
docker-orchagent              master.372158-6e3519ea5   c9a7d8c767dd   336MB
docker-sflow                  latest                    c237dca05920   325MB
docker-sflow                  master.372158-6e3519ea5   c237dca05920   325MB
docker-teamd                  latest                    484fc6d98aff   324MB
docker-teamd                  master.372158-6e3519ea5   484fc6d98aff   324MB
docker-nat                    latest                    b6c8fc185930   327MB
docker-nat                    master.372158-6e3519ea5   b6c8fc185930   327MB
docker-fpm-frr                latest                    2da17b1d30d6   355MB
docker-fpm-frr                master.372158-6e3519ea5   2da17b1d30d6   355MB
docker-dhcp-relay             latest                    45c8b8528600   307MB
docker-macsec                 latest                    dfb8e087d649   326MB
docker-eventd                 latest                    c1e41f7a8894   299MB
docker-eventd                 master.372158-6e3519ea5   c1e41f7a8894   299MB
docker-platform-monitor       latest                    e80f40f1fd0c   419MB
docker-platform-monitor       master.372158-6e3519ea5   e80f40f1fd0c   419MB
docker-snmp                   latest                    46d4b18ffbd7   338MB
docker-snmp                   master.372158-6e3519ea5   46d4b18ffbd7   338MB
docker-sonic-telemetry        latest                    2c400217beee   386MB
docker-sonic-telemetry        master.372158-6e3519ea5   2c400217beee   386MB
docker-router-advertiser      latest                    187575c1fb26   299MB
docker-router-advertiser      master.372158-6e3519ea5   187575c1fb26   299MB
docker-lldp                   latest                    e7232b5b0a8e   341MB
docker-lldp                   master.372158-6e3519ea5   e7232b5b0a8e   341MB
docker-mux                    latest                    d7ba03f26579   348MB
docker-mux                    master.372158-6e3519ea5   d7ba03f26579   348MB
docker-database               latest                    4966acba31e6   299MB
docker-database               master.372158-6e3519ea5   4966acba31e6   299MB
docker-sonic-mgmt-framework   latest                    42078cbfacab   416MB
docker-sonic-mgmt-framework   master.372158-6e3519ea5   42078cbfacab   416MB

Output of show techsupport:

techsupport.txt

Additional information you deem important (e.g. issue happens only occasionally):

The dump file is 60MB and exceeds GitHub's limits.

justindthomas commented 1 year ago

Is it possible that this PR might address this? https://github.com/sonic-net/sonic-buildimage/pull/17044

justindthomas commented 11 months ago

I re-added the management VRF settings to see if that PR I mentioned above solved the issue. It does not.

2023-11-21 23:42:04,907 - supervisord_dependent_startup - [INFO   ] New event: Service snmpd went from BACKOFF to STARTING
2023-11-21 23:42:04,940 - supervisord_dependent_startup - [INFO   ] Services:
2023-11-21 23:42:04,954 - supervisord_dependent_startup - [INFO   ]  - rsyslogd                       RUNNING                         dependent_startup: True   priority:    1
2023-11-21 23:42:04,970 - supervisord_dependent_startup - [INFO   ]  - start                          EXITED                          dependent_startup: False  wait_for: 'rsyslogd:RUNNING'  priority:    1
2023-11-21 23:42:04,986 - supervisord_dependent_startup - [INFO   ]  - containercfgd                  RUNNING                         dependent_startup: True   wait_for: 'rsyslogd:RUNNING'  priority:   99
2023-11-21 23:42:04,992 - supervisord_dependent_startup - [INFO   ]  - snmpd                          STARTING                        dependent_startup: True   wait_for: 'start:EXITED'  priority:    3
2023-11-21 23:42:05,022 - supervisord_dependent_startup - [INFO   ]  - snmp-subagent                  STOPPED                         dependent_startup: True   wait_for: 'snmpd:RUNNING'  priority:    4
2023-11-21 23:42:05,044 - supervisord_dependent_startup - [INFO   ] Services not yet running (2): snmpd, snmp-subagent
2023-11-21 23:42:06,210 - supervisord_dependent_startup - [INFO   ]
2023-11-21 23:42:06,211 - supervisord_dependent_startup - [INFO   ] New event: Service snmpd went from STARTING to BACKOFF
2023-11-21 23:42:06,230 - supervisord_dependent_startup - [INFO   ] Services:
2023-11-21 23:42:06,233 - supervisord_dependent_startup - [INFO   ]  - rsyslogd                       RUNNING                         dependent_startup: True   priority:    1
2023-11-21 23:42:06,246 - supervisord_dependent_startup - [INFO   ]  - start                          EXITED                          dependent_startup: False  wait_for: 'rsyslogd:RUNNING'  priority:    1
2023-11-21 23:42:06,253 - supervisord_dependent_startup - [INFO   ]  - containercfgd                  RUNNING                         dependent_startup: True   wait_for: 'rsyslogd:RUNNING'  priority:   99
2023-11-21 23:42:06,262 - supervisord_dependent_startup - [INFO   ]  - snmpd                          FATAL                           dependent_startup: True   wait_for: 'start:EXITED'  priority:    3
2023-11-21 23:42:06,264 - supervisord_dependent_startup - [INFO   ]  - snmp-subagent                  STOPPED                         dependent_startup: True   wait_for: 'snmpd:RUNNING'  priority:    4
2023-11-21 23:42:06,272 - supervisord_dependent_startup - [INFO   ] Services not yet running (2): snmpd, snmp-subagent
2023-11-21 23:42:06,274 - supervisord_dependent_startup - [INFO   ]
2023-11-21 23:42:06,274 - supervisord_dependent_startup - [INFO   ] New event: Service snmpd went from BACKOFF to FATAL
2023-11-21 23:42:06,294 - supervisord_dependent_startup - [INFO   ] Services:
2023-11-21 23:42:06,297 - supervisord_dependent_startup - [INFO   ]  - rsyslogd                       RUNNING                         dependent_startup: True   priority:    1
2023-11-21 23:42:06,302 - supervisord_dependent_startup - [INFO   ]  - start                          EXITED                          dependent_startup: False  wait_for: 'rsyslogd:RUNNING'  priority:    1
2023-11-21 23:42:06,305 - supervisord_dependent_startup - [INFO   ]  - containercfgd                  RUNNING                         dependent_startup: True   wait_for: 'rsyslogd:RUNNING'  priority:   99
2023-11-21 23:42:06,310 - supervisord_dependent_startup - [INFO   ]  - snmpd                          FATAL                           dependent_startup: True   wait_for: 'start:EXITED'  priority:    3
2023-11-21 23:42:06,312 - supervisord_dependent_startup - [INFO   ]  - snmp-subagent                  STOPPED                         dependent_startup: True   wait_for: 'snmpd:RUNNING'  priority:    4
2023-11-21 23:42:06,324 - supervisord_dependent_startup - [INFO   ] Services not yet running (1): snmp-subagent
jdt@sonic:~$ sudo show system-health summary
System status summary

  System status LED  blink_yellow
  Services:
    Status: Not OK
    Not Running: container_checker, telemetry, snmp:snmpd, snmp:snmp-subagent
  Hardware:
    Status: OK

The telemetry error is new, but unrelated to the management VRF (AFAIK) and I think that's being addressed in a separate issue.

justindthomas commented 11 months ago

This may be related to https://github.com/sonic-net/sonic-buildimage/issues/16187