sonic-net / sonic-buildimage

Scripts which perform an installable binary image build for SONiC
Other
741 stars 1.43k forks source link

[snmpd] Snmpd ioctl 35123 returned -1 logs appear during config reload #9675

Open dgsudharsan opened 2 years ago

dgsudharsan commented 2 years ago

Description

When a scaled LAG configuration is added and config reload is performed snmpd emits below log before shutting down. 'Jan 2 21:35:03.621782 r-panther-13 ERR snmp#snmpd[21]: ioctl 35123 returned -1\n'

I believe it might be due to some race condition where interface is deleted but snmpd tries to access it.

This issue occurs very rarely and hard to reproduce.

Steps to reproduce the issue:

  1. Add the configuration from sysdump
  2. Perform config reload.

Describe the results you received:

Describe the results you expected:

Output of show version:

SONiC Software Version: SONiC.202012.196-16a392915_Internal
Distribution: Debian 10.11
Kernel: 4.19.0-12-2-amd64
Build commit: 16a392915
Build date: Sun Jan  2 13:59:36 UTC 2022
Built by: sw-r2d2-bot@r-build-sonic-ci02-241

Platform: x86_64-mlnx_msn2700-r0
HwSKU: Mellanox-SN2700
ASIC: mellanox
ASIC Count: 1
Serial Number: MT2020T04244
Uptime: 21:42:43 up 21 min,  1 user,  load average: 1.93, 4.41, 3.71

Docker images:
REPOSITORY                    TAG                             IMAGE ID            SIZE
docker-syncd-mlnx             202012.196-16a392915_Internal   860cc402667b        969MB
docker-syncd-mlnx             latest                          860cc402667b        969MB
docker-platform-monitor       202012.196-16a392915_Internal   6b2004bdecf8        682MB
docker-platform-monitor       latest                          6b2004bdecf8        682MB
docker-sflow                  202012.196-16a392915_Internal   69fb819114b5        391MB
docker-sflow                  latest                          69fb819114b5        391MB
docker-snmp                   202012.196-16a392915_Internal   2c5f8c75c09b        422MB
docker-snmp                   latest                          2c5f8c75c09b        422MB
docker-teamd                  202012.196-16a392915_Internal   f1c74fa41ba1        390MB
docker-teamd                  latest                          f1c74fa41ba1        390MB
docker-router-advertiser      202012.196-16a392915_Internal   7c14699b69ca        380MB
docker-router-advertiser      latest                          7c14699b69ca        380MB
docker-nat                    202012.196-16a392915_Internal   d5d60742e3ce        393MB
docker-nat                    latest                          d5d60742e3ce        393MB
docker-lldp                   202012.196-16a392915_Internal   4086facc8fc3        420MB
docker-lldp                   latest                          4086facc8fc3        420MB
docker-database               202012.196-16a392915_Internal   eb326db92704        380MB
docker-database               latest                          eb326db92704        380MB
docker-sonic-mgmt-framework   202012.196-16a392915_Internal   ad183bf610a8        793MB
docker-sonic-mgmt-framework   latest                          ad183bf610a8        793MB
docker-orchagent              202012.196-16a392915_Internal   9fa6d1f72aa2        409MB
docker-orchagent              latest                          9fa6d1f72aa2        409MB
docker-sonic-telemetry        202012.196-16a392915_Internal   e03068e245e7        469MB
docker-sonic-telemetry        latest                          e03068e245e7        469MB
docker-mux                    202012.196-16a392915_Internal   7dcd0b71d7dd        432MB
docker-mux                    latest                          7dcd0b71d7dd        432MB
docker-fpm-frr                202012.196-16a392915_Internal   c280b5311557        409MB
docker-fpm-frr                latest                          c280b5311557        409MB
docker-dhcp-relay             202012.196-16a392915_Internal   edcf26f69c29        393MB
docker-dhcp-relay             latest                          edcf26f69c29        393MB

Output of show techsupport:

(paste your output here or download and attach the file here )

sysdump_test_lags_scale.tar.gz

Additional information you deem important (e.g. issue happens only occasionally):

yxieca commented 2 years ago

Scaled LAG: many LAGs on the system, each LAG has 1 member to maximize the number of LAGs, in this test, all ports are in LAGs (# of LAG == # of physical ports). Repro rate: 1 out of 10 retries.

yxieca commented 2 years ago

@dgsudharsan please check if SNMP and TEAMD are both SWSS service dependents and shutdown by SWSS service script. If so, please try shutdown SNMP first as you suggested.

vivekrnv commented 2 years ago
Jan  2 21:34:53.168231 r-panther-13 INFO systemd[1]: Stopping SNMP container…
Jan  2 21:34:57.379650 r-panther-13 INFO systemd[1]: Stopping TEAMD container…
Jan  2 21:35:03.621782 r-panther-13 ERR snmp#snmpd[21]: ioctl 35123 returned -1\n
Jan  2 21:35:04.254619 r-panther-13 INFO snmp#supervisord 2022-01-02 21:35:04,253 INFO stopped: snmpd (exit status 0)
Jan  2 21:35:06.385508 r-panther-13 INFO teamd#supervisord 2022-01-02 21:35:06,380 INFO stopped: teammgrd (exit status 0)
Jan  2 21:35:07.810544 r-panther-13 NOTICE root: Stopping swss service…
Jan  2 21:35:15.853449 r-panther-13 INFO swss#supervisord 2022-01-02 21:35:15,852 INFO stopped: orchagent (terminated by SIGTERM)

Hi @yxieca,

SNMPD & TEAMD are both SWSS dependent services and from the logs it is clear that stop jobs for teamd and snmp are running concurrently and once both are finished, stop job for swss is started. To start snmp before teamd, a explicit dependency on the teamd service has to be given for snmp. I can test doing this but the problem is, i couldn't repro this to actually verify if this fix solves the issue.

And even though this log is seen, snmpd has exited gracefully and given that this is a non-functional issue, the solution (assuming it works) seems overkill. Also, this solution might potentially increase the time for config reload.

Let me know what you think