sonic-net / sonic-mgmt

Configuration management examples for SONiC
Other
200 stars 732 forks source link

[syslog] test_syslog_rate_limit.py fail when test on syncd container #11181

Closed wen587 closed 10 months ago

wen587 commented 11 months ago

Description The test will fail when randomly pick acms or syncd container to test rate limiter. Related code: https://github.com/sonic-net/sonic-buildimage/blob/master/src/sonic-containercfgd/containercfgd/containercfgd.py#L158 The root cause is that there is no containercfgd to restart when config syslog rate-limit-container on these two container.

Steps to reproduce the issue:

  1. To simplify the process, we can just config the rate limiter for acms and check the syslog
  2. config syslog rate limiter for either acms or syncd
    admin@str3-xx-02:~$ show syslog rate-limit-c | grep acms
    acms          300         20000
    admin@str3-xx-02:~$ docker exec -i acms bash -c 'pidof rsyslogd'
    14
    admin@str3-xx-02:~$  sudo config syslog rate-limit-container acms -b 100 -i 10
    admin@str3-xx-02:~$ show syslog rate-limit-c | grep acms
    acms          10          100
    admin@str3-xx-02:~$ docker exec -i acms bash -c 'pidof rsyslogd'
    14                   <===== rsyslogd didn't restart after config

    From syslog, it does send the command to udpate the rate-limits. But there is no rsyslog restart.

    admin@str3-xx-02:~$ show logging -f
    Jan  3 10:07:27.337988 str3-xx-02 INFO swss#supervisord: orchagent
    Jan  3 10:07:27.383581 str3-xx-02 INFO python[263883]: ansible-ansible.legacy.file Invoked with dest=/tmp/loganalyzer.py _original_basename=system_msg_handler.py recurse=False state=file path=/tmp/loganalyzer.py force=False follow=True modification_time_format=%Y%m%d%H%M.%S access_time_format=%Y%m%d%H%M.%S unsafe_writes=False _diff_peek=None src=None modification_time=None access_time=None mode=None owner=None group=None seuser=None serole=None selevel=None setype=None attributes=None
    Jan  3 10:07:27.905952 str3-xx-02 NOTICE CCmisApi: Configured syslog acms rate-limits: interval=10,        burst=100

Describe the results you received: acms or syncd container didn't restart thus test fail. Because it wait forever for pid referesh.

Describe the results you expected: Need mellanox team to confirm if the behavior is expected.

Below is the pass case of teamd. The container restart and test pass.

admin@str3-xx-02:~$ show syslog rate-limit-c | grep teamd
teamd         300         20000
admin@str3-xx-02:~$ docker exec -i teamd bash -c 'pidof rsyslogd'
20
admin@str3-xx-02:~$ sudo config syslog rate-limit-container teamd -b 100 -i 10
admin@str3-xx-02:~$ docker exec -i teamd bash -c 'pidof rsyslogd'
3414           <========= rsyslog restart and pid update

From syslog, it does config the rate limit and the container restarts

admin@str3-xx-02:~$ show logging -f
Jan  3 10:08:46.214710 str3-xx-02 NOTICE teamd#containercfgd[220]: Configure syslog rate limit interval=10, burst=100
Jan  3 10:08:46.214968 str3-xx-02 NOTICE CCmisApi: Configured syslog teamd rate-limits: interval=10,        burst=100
Jan  3 10:08:46.889775 str3-xx-02 INFO teamd#rsyslogd: [origin software="rsyslogd" swVersion="8.2302.0" x-pid="3414" x-info="
https://www.rsyslog.com"]
start
Jan  3 10:08:46.889775 str3-xx-02 INFO teamd#supervisord 2024-01-03 10:08:46,873 INFO waiting for rsyslogd to stop
Jan  3 10:08:46.889775 str3-xx-02 INFO teamd#supervisord 2024-01-03 10:08:46,881 INFO stopped: rsyslogd (exit status 0)
Jan  3 10:08:46.889775 str3-xx-02 INFO teamd#supervisord 2024-01-03 10:08:46,883 INFO spawned: 'rsyslogd' with pid 3414
Jan  3 10:08:47.885389 str3-xx-02 INFO teamd#supervisord 2024-01-03 10:08:47,885 INFO success: rsyslogd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)

Additional information you deem important:

**Output of `show version`:**
    admin@str3-xx-02:~$ show ver
SONiC Software Version: SONiC.20230531.13
SONiC OS Version: 11
Distribution: Debian 11.8
Kernel: 5.10.0-23-2-amd64
Build commit: c30894f156
Build date: Tue Dec 26 09:33:21 UTC 2023
Built by: cloudtest@d5b2c2fcc000000
**Attach debug file `sudo generate_dump`:**

```
(paste your output here)
```
Junchao-Mellanox commented 10 months ago

Hi, I don't have system that runs on Cisco platform. There is no acms container running in Nvidia platform. And syncd container for different vendor might have different docker build file. Could you please check:

  1. Does containercfgd run in syncd/acms container?
  2. Does containercfgd installed in those container?
wen587 commented 10 months ago

Hi @Junchao-Mellanox , I took one in 2700 platform. I saw we have acms container. But there is no containercfgd running in acms.

admin@str-msn2700-01:~$ docker ps | grep acms
3b832b238f3b   docker-acms:latest                "/usr/local/bin/supe…"   25 hours ago     Up About an hour             acms
admin@str-msn2700-01:~$ docker exec -it acms bash
root@str-msn2700-01:/# ps -aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.3  30512 26116 pts/0    Ss+  05:31   0:02 /usr/bin/python3 /usr/local/bin/supervisord
root           7  0.0  0.3 124864 27296 pts/0    Sl   05:31   0:01 python3 /usr/bin/supervisor-proc-exit-listener --container-name acms
root           8  0.0  0.2  37296 21840 pts/0    S    05:31   0:00 python3 /usr/bin/start.py
root           9  0.0  0.3  40556 24384 pts/0    S    05:31   0:00 python3 /usr/bin/CA_cert_downloader.py
root          10  0.0  0.1  13852  9872 pts/0    S    05:31   0:00 python3 /usr/bin/cert_converter.py
root          14  0.0  0.0 222184  4032 pts/0    Sl   05:31   0:00 /usr/sbin/rsyslogd -n
root         609  0.0  0.0   4160  3288 pts/1    Ss   07:01   0:00 bash
root         616  0.0  0.0   6756  2848 pts/1    R+   07:01   0:00 ps -aux
root@str-msn2700-01:/#
admin@str-msn2700-01:~$ docker exec -it acms bash
root@str-msn2700-01:/# ls /usr/local/bin/containercfgd
/usr/local/bin/containercfgd
wen587 commented 10 months ago

Comparing to eventd, I can see containercfgd runnning in it


admin@str-msn2700-01:~$ docker exec -it eventd bash
root@str-msn2700-01:/# ps -aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.1  0.3  30520 26272 pts/0    Ss+  06:52   0:00 /usr/bin/python3 /usr/local/bin/supervisord
root           9  0.0  0.3 124840 27548 pts/0    Sl   06:52   0:00 python3 /usr/bin/supervisor-proc-exit-listener --container-name eventd
root          12  0.0  0.0 222184  6132 pts/0    Sl   06:52   0:00 /usr/sbin/rsyslogd -n -iNONE
root          17  0.0  0.2  40924 24048 pts/0    S    06:52   0:00 python3 /usr/local/bin/containercfgd
root          19  0.1  0.2 559180 16148 pts/0    Sl   06:52   0:01 /usr/bin/eventd
root         100  0.3  0.0   4160  3344 pts/1    Ss   07:04   0:00 bash
root         106  0.0  0.0   6756  2948 pts/1    R+   07:04   0:00 ps -aux
root@str-msn2700-01:/#
Junchao-Mellanox commented 10 months ago

Thanks. But I don't see acms container on my side. Could you please point me to the docker folder in sonic-buildimage? I don't see it in https://github.com/sonic-net/sonic-buildimage/tree/master/dockers Maybe it is a private container on your side?

wen587 commented 10 months ago

Hi Junchao, I found that acms container was for internal use only. That's why you cannot see it.

And syncd container for different vendor might have different docker build file. Could you please check:

Do you mean syncd was built differently in each vendor with the same source code? If so, maybe we should bypass syncd syslog rate-limit test.

Junchao-Mellanox commented 10 months ago

Thanks for the confirmation. For syncd, do you see issue on Nvidia/Mellanox platform? We don't find it in our local regression. To my understanding, each vendor should maintain the FEATURE table for their platforms. In case a FEATURE does not support syslog rate limit, they should set FEATURE.support_syslog_rate_limit to false. For example, cisco does not support syslog rate limit for syncd, they should have following in FEATURE table:

{
    "FEATULRE": {
        "syncd": {
            "support_syslog_rate_limit": "false"
        }
    }
}

The test case will ignore such service.

wen587 commented 10 months ago

I don't see issue on Nvdia/Mellanox platform. Thanks. I will close this issue and add check for other platform.

wen587 commented 10 months ago
admin@str3-msn4700-01:~$ show ver

SONiC Software Version: SONiC.20230531.14
SONiC OS Version: 11
Distribution: Debian 11.8
Kernel: 5.10.0-23-2-amd64
Build commit: 25f341a9dc
Build date: Sat Jan  6 18:28:53 UTC 2024
Built by: cloudtest@107a37f6c000000

Platform: x86_64-mlnx_msn4700-r0
HwSKU: Mellanox-SN4700-O8C48
ASIC: mellanox
ASIC Count: 1
Serial Number: MT2102X08020
Model Number: MSN4700-WS2FO
Hardware Revision: A1
Uptime: 10:59:19 up  2:24,  2 users,  load average: 3.31, 3.03, 2.82
Date: Tue 09 Jan 2024 10:59:19
...
admin@str3-msn4700-01:~$ show syslog rate-limit-c
SERVICE       INTERVAL    BURST
------------  ----------  -------
acms          300         20000
bgp           300         20000
database      300         20000
dhcp_relay    300         20000
eventd        300         20000
gnmi          300         20000
lldp          300         20000
macsec        300         20000
mux           300         20000
pmon          300         20000
radv          300         20000
restapi       300         20000
snmp          300         20000
swss          300         20000
syncd         300         20000
teamd         300         20000
telemetry     300         20000
vnet-monitor  300         20000
admin@str3-msn4700-01:~$ docker exec -i eventd bash -c 'pidof rsyslogd'
129
admin@str3-msn4700-01:~$ sudo config syslog rate-limit-container eventd -b 100 -i 10
admin@str3-msn4700-01:~$ docker exec -i eventd bash -c 'pidof rsyslogd'
129
admin@str3-msn4700-01:~$ docker exec -i restapi bash -c 'pidof rsyslogd'
243
admin@str3-msn4700-01:~$ sudo config syslog rate-limit-container restapi -b 100 -i 10
admin@str3-msn4700-01:~$ docker exec -i restapi bash -c 'pidof rsyslogd'
243
admin@str3-msn4700-01:~$ 
admin@str3-msn4700-01:~$ docker exec -i syncd bash -c 'pidof rsyslogd'
467
admin@str3-msn4700-01:~$ sudo config syslog rate-limit-container syncd -b 100 -i 10
admin@str3-msn4700-01:~$ docker exec -i syncd bash -c 'pidof rsyslogd'
467
admin@str3-msn4700-01:~$ show syslog rate-limit-c
SERVICE       INTERVAL    BURST
------------  ----------  -------
acms          300         20000
bgp           300         20000
database      300         20000
dhcp_relay    300         20000
eventd        10          100
gnmi          300         20000
lldp          300         20000
macsec        300         20000
mux           300         20000
pmon          300         20000
radv          300         20000
restapi       10          100
snmp          300         20000
swss          300         20000
syncd         10          100
teamd         300         20000
telemetry     300         20000
vnet-monitor  300         20000

Hi @Junchao-Mellanox , found one issue in mellanox and also other platform. Config rate limiter on any container won't restart. It didn't report any error. After load minigraph, issue persists. Do you have any idea?

Junchao-Mellanox commented 10 months ago

What is the output of config syslog rate-limit-feature --help? If subcommand rate-limit-feature exists, please make sure your sonic-mgmt contains this PR https://github.com/sonic-net/sonic-mgmt/pull/10986

wen587 commented 10 months ago

I saw the issue being widely happen in 20230531.14 which doesn't have your sonic-mgmt PR included. I will keep this issue open and check if the issue no longer exist in our internal test after that PR being merged to 202305.

Junchao-Mellanox commented 10 months ago

Thanks.

There was a recent change related to syslog rate limit. The feature is disabled by default in that change. So, we need explicitly enable it in sonic-mgmt before doing the test.

wen587 commented 10 months ago

Close it after nightly testcase pass.