sonic-net / sonic-buildimage

Scripts which perform an installable binary image build for SONiC
Other
736 stars 1.42k forks source link

Rsyslogd in teamd container start failed with empty rsyslog.conf caused by sonic-cfggen generate conf failed #13674

Open lerry-lee opened 1 year ago

lerry-lee commented 1 year ago

Description

In t0-sonic and multi-asic KVM-test, sometimes test_pretest.py::test_disable_rsyslog_rate_limit will fail. Error message like below:

test_pretest.py::test_disable_rsyslog_rate_limit[vlab-02] 
-------------------------------- live log call ---------------------------------
06:18:51 __init__.pytest_runtest_call             L0040 ERROR  | Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/_pytest/python.py", line 1464, in runtest
    self.ihook.pytest_pyfunc_call(pyfuncitem=self)
  File "/usr/local/lib/python2.7/dist-packages/pluggy/hooks.py", line 286, in __call__
    return self._hookexec(self, self.get_hookimpls(), kwargs)
  File "/usr/local/lib/python2.7/dist-packages/pluggy/manager.py", line 93, in _hookexec
    return self._inner_hookexec(hook, methods, kwargs)
  File "/usr/local/lib/python2.7/dist-packages/pluggy/manager.py", line 87, in <lambda>
    firstresult=hook.spec.opts.get("firstresult") if hook.spec else False,
  File "/usr/local/lib/python2.7/dist-packages/pluggy/callers.py", line 208, in _multicall
    return outcome.get_result()
  File "/usr/local/lib/python2.7/dist-packages/pluggy/callers.py", line 81, in get_result
    _reraise(*ex)  # noqa
  File "/usr/local/lib/python2.7/dist-packages/pluggy/callers.py", line 187, in _multicall
    res = hook_impl.function(*args)
  File "/usr/local/lib/python2.7/dist-packages/_pytest/python.py", line 174, in pytest_pyfunc_call
    testfunction(**testargs)
  File "/var/src/sonic-mgmt/tests/test_pretest.py", line 172, in test_disable_rsyslog_rate_limit
    duthost.modify_syslog_rate_limit(feature_name, rl_option='disable')
  File "/var/src/sonic-mgmt/tests/common/devices/multi_asic.py", line 476, in modify_syslog_rate_limit
    self.sonichost.shell_cmds(cmds=cmds)
  File "/var/src/sonic-mgmt/tests/common/devices/base.py", line 90, in _run
    raise RunAnsibleModuleFail("run module {} failed".format(self.module_name), res)
RunAnsibleModuleFail: run module shell_cmds failed, Ansible Results =>
{"changed": false, "cmds": ["docker exec -i teamd sed -i 's/^\\$SystemLogRateLimit/#\\$SystemLogRateLimit/g' /etc/rsyslog.conf", "docker exec -i teamd supervisorctl restart rsyslogd"], "delta": "0:00:02.559184", "end": "2023-02-02 06:18:50.970187", "failed": true, "msg": "At least running one of the commands failed", "results": [{"cmd": "docker exec -i teamd sed -i 's/^\\$SystemLogRateLimit/#\\$SystemLogRateLimit/g' /etc/rsyslog.conf", "rc": 0, "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}, {"cmd": "docker exec -i teamd supervisorctl restart rsyslogd", "rc": 7, "stderr": "", "stderr_lines": [], "stdout": "rsyslogd: ERROR (not running)\nrsyslogd: ERROR (spawn error)\n", "stdout_lines": ["rsyslogd: ERROR (not running)", "rsyslogd: ERROR (spawn error)"]}], "start": "2023-02-02 06:18:48.411003"}

The reason is that rsyslogd process can not enter running state because its' rsyslog.conf is empty. Further, we found that, in teamd.sh, when start teamd, it will update rsyslog.conf use sonic-cfggen, related code is below:

sonic-cfggen -d -t /usr/share/sonic/templates/rsyslog-container.conf.j2 -a "{\"target_ip\": \"$TARGET_IP\", \"container_name\": \"$CONTAINER_NAME\" }"  > $TMP_FILE
...
docker cp $TMP_FILE ${DOCKERNAME}:/etc/rsyslog.conf

And if this command fails, it will generate an empty file and copy it to teamd container. Currently, it uses -d parameter which means read from configDB. We see the sonic-cfggen failed reason is: image

But in the aspect of the service dependency control, we found that teamd will start after database. So, the connection to Redis meets some problems.

Steps to reproduce the issue:

1. 2. 3.

Describe the results you received:

Describe the results you expected:

Output of show version:

(paste your output here)

Output of show techsupport:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

lerry-lee commented 1 year ago

PR #13669 is a hotfix to unblock some PR tests.

stephenxs commented 1 year ago

Hi @lerry-lee Did you see the traceback from the syslog when the issue occurred? I observed a similar issue #13791 which I also suspect sonic-cfggen aborted somehow but I didn't see the traceback from the log.

lerry-lee commented 1 year ago

Hi @lerry-lee Did you see the traceback from the syslog when the issue occurred? I observed a similar issue #13791 which I also suspect sonic-cfggen aborted somehow but I didn't see the traceback from the log. @stephenxs After issue occurred, I checked the syslog and saw the error msg like the screenshot in the top description. I saw your observed issue also related to sonic-cfggen -d, I think it might be some related cause, because -d requires Redis, so I think some problems occurred when getting connection to Redis.

stephenxs commented 1 year ago

Hi @lerry-lee Did you see the traceback from the syslog when the issue occurred? I observed a similar issue #13791 which I also suspect sonic-cfggen aborted somehow but I didn't see the traceback from the log. @stephenxs After issue occurred, I checked the syslog and saw the error msg like the screenshot in the top description. I saw your observed issue also related to sonic-cfggen -d, I think it might be some related cause, because -d requires Redis, so I think some problems occurred when getting connection to Redis.

Thanks. I also suspect they share the same cause. The only difference is that I didn't see the traceback from the dumps. I'm going to try using -a to avoid redis db access

arlakshm commented 1 year ago

can we run the tcpdump on lo interface when the team starts to check the packet dump on why the connection to redis is failing.

arlakshm commented 1 year ago

please update the description with the testcase where this failure is seen.

lerry-lee commented 1 year ago

please update the description with the testcase where this failure is seen.

Refined.