Closed wen587 closed 10 months ago
Hi, I don't have system that runs on Cisco platform. There is no acms container running in Nvidia platform. And syncd container for different vendor might have different docker build file. Could you please check:
Hi @Junchao-Mellanox , I took one in 2700 platform. I saw we have acms container. But there is no containercfgd running in acms.
admin@str-msn2700-01:~$ docker ps | grep acms
3b832b238f3b docker-acms:latest "/usr/local/bin/supe…" 25 hours ago Up About an hour acms
admin@str-msn2700-01:~$ docker exec -it acms bash
root@str-msn2700-01:/# ps -aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.3 30512 26116 pts/0 Ss+ 05:31 0:02 /usr/bin/python3 /usr/local/bin/supervisord
root 7 0.0 0.3 124864 27296 pts/0 Sl 05:31 0:01 python3 /usr/bin/supervisor-proc-exit-listener --container-name acms
root 8 0.0 0.2 37296 21840 pts/0 S 05:31 0:00 python3 /usr/bin/start.py
root 9 0.0 0.3 40556 24384 pts/0 S 05:31 0:00 python3 /usr/bin/CA_cert_downloader.py
root 10 0.0 0.1 13852 9872 pts/0 S 05:31 0:00 python3 /usr/bin/cert_converter.py
root 14 0.0 0.0 222184 4032 pts/0 Sl 05:31 0:00 /usr/sbin/rsyslogd -n
root 609 0.0 0.0 4160 3288 pts/1 Ss 07:01 0:00 bash
root 616 0.0 0.0 6756 2848 pts/1 R+ 07:01 0:00 ps -aux
root@str-msn2700-01:/#
admin@str-msn2700-01:~$ docker exec -it acms bash
root@str-msn2700-01:/# ls /usr/local/bin/containercfgd
/usr/local/bin/containercfgd
Comparing to eventd, I can see containercfgd runnning in it
admin@str-msn2700-01:~$ docker exec -it eventd bash
root@str-msn2700-01:/# ps -aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.1 0.3 30520 26272 pts/0 Ss+ 06:52 0:00 /usr/bin/python3 /usr/local/bin/supervisord
root 9 0.0 0.3 124840 27548 pts/0 Sl 06:52 0:00 python3 /usr/bin/supervisor-proc-exit-listener --container-name eventd
root 12 0.0 0.0 222184 6132 pts/0 Sl 06:52 0:00 /usr/sbin/rsyslogd -n -iNONE
root 17 0.0 0.2 40924 24048 pts/0 S 06:52 0:00 python3 /usr/local/bin/containercfgd
root 19 0.1 0.2 559180 16148 pts/0 Sl 06:52 0:01 /usr/bin/eventd
root 100 0.3 0.0 4160 3344 pts/1 Ss 07:04 0:00 bash
root 106 0.0 0.0 6756 2948 pts/1 R+ 07:04 0:00 ps -aux
root@str-msn2700-01:/#
Thanks. But I don't see acms container on my side. Could you please point me to the docker folder in sonic-buildimage? I don't see it in https://github.com/sonic-net/sonic-buildimage/tree/master/dockers Maybe it is a private container on your side?
Hi Junchao, I found that acms container was for internal use only. That's why you cannot see it.
And syncd container for different vendor might have different docker build file. Could you please check:
Do you mean syncd was built differently in each vendor with the same source code? If so, maybe we should bypass syncd syslog rate-limit test.
Thanks for the confirmation. For syncd, do you see issue on Nvidia/Mellanox platform? We don't find it in our local regression. To my understanding, each vendor should maintain the FEATURE table for their platforms. In case a FEATURE does not support syslog rate limit, they should set FEATURE.support_syslog_rate_limit to false. For example, cisco does not support syslog rate limit for syncd, they should have following in FEATURE table:
{
"FEATULRE": {
"syncd": {
"support_syslog_rate_limit": "false"
}
}
}
The test case will ignore such service.
I don't see issue on Nvdia/Mellanox platform. Thanks. I will close this issue and add check for other platform.
admin@str3-msn4700-01:~$ show ver
SONiC Software Version: SONiC.20230531.14
SONiC OS Version: 11
Distribution: Debian 11.8
Kernel: 5.10.0-23-2-amd64
Build commit: 25f341a9dc
Build date: Sat Jan 6 18:28:53 UTC 2024
Built by: cloudtest@107a37f6c000000
Platform: x86_64-mlnx_msn4700-r0
HwSKU: Mellanox-SN4700-O8C48
ASIC: mellanox
ASIC Count: 1
Serial Number: MT2102X08020
Model Number: MSN4700-WS2FO
Hardware Revision: A1
Uptime: 10:59:19 up 2:24, 2 users, load average: 3.31, 3.03, 2.82
Date: Tue 09 Jan 2024 10:59:19
...
admin@str3-msn4700-01:~$ show syslog rate-limit-c
SERVICE INTERVAL BURST
------------ ---------- -------
acms 300 20000
bgp 300 20000
database 300 20000
dhcp_relay 300 20000
eventd 300 20000
gnmi 300 20000
lldp 300 20000
macsec 300 20000
mux 300 20000
pmon 300 20000
radv 300 20000
restapi 300 20000
snmp 300 20000
swss 300 20000
syncd 300 20000
teamd 300 20000
telemetry 300 20000
vnet-monitor 300 20000
admin@str3-msn4700-01:~$ docker exec -i eventd bash -c 'pidof rsyslogd'
129
admin@str3-msn4700-01:~$ sudo config syslog rate-limit-container eventd -b 100 -i 10
admin@str3-msn4700-01:~$ docker exec -i eventd bash -c 'pidof rsyslogd'
129
admin@str3-msn4700-01:~$ docker exec -i restapi bash -c 'pidof rsyslogd'
243
admin@str3-msn4700-01:~$ sudo config syslog rate-limit-container restapi -b 100 -i 10
admin@str3-msn4700-01:~$ docker exec -i restapi bash -c 'pidof rsyslogd'
243
admin@str3-msn4700-01:~$
admin@str3-msn4700-01:~$ docker exec -i syncd bash -c 'pidof rsyslogd'
467
admin@str3-msn4700-01:~$ sudo config syslog rate-limit-container syncd -b 100 -i 10
admin@str3-msn4700-01:~$ docker exec -i syncd bash -c 'pidof rsyslogd'
467
admin@str3-msn4700-01:~$ show syslog rate-limit-c
SERVICE INTERVAL BURST
------------ ---------- -------
acms 300 20000
bgp 300 20000
database 300 20000
dhcp_relay 300 20000
eventd 10 100
gnmi 300 20000
lldp 300 20000
macsec 300 20000
mux 300 20000
pmon 300 20000
radv 300 20000
restapi 10 100
snmp 300 20000
swss 300 20000
syncd 10 100
teamd 300 20000
telemetry 300 20000
vnet-monitor 300 20000
Hi @Junchao-Mellanox , found one issue in mellanox and also other platform. Config rate limiter on any container won't restart. It didn't report any error. After load minigraph, issue persists. Do you have any idea?
What is the output of config syslog rate-limit-feature --help
? If subcommand rate-limit-feature
exists, please make sure your sonic-mgmt contains this PR https://github.com/sonic-net/sonic-mgmt/pull/10986
I saw the issue being widely happen in 20230531.14 which doesn't have your sonic-mgmt PR included. I will keep this issue open and check if the issue no longer exist in our internal test after that PR being merged to 202305.
Thanks.
There was a recent change related to syslog rate limit. The feature is disabled by default in that change. So, we need explicitly enable it in sonic-mgmt before doing the test.
Close it after nightly testcase pass.
Description The test will fail when randomly pick acms or syncd container to test rate limiter. Related code: https://github.com/sonic-net/sonic-buildimage/blob/master/src/sonic-containercfgd/containercfgd/containercfgd.py#L158 The root cause is that there is no containercfgd to restart when config syslog rate-limit-container on these two container.
Steps to reproduce the issue:
acms
orsyncd
From syslog, it does send the command to udpate the rate-limits. But there is no rsyslog restart.
Describe the results you received: acms or syncd container didn't restart thus test fail. Because it wait forever for pid referesh.
Describe the results you expected: Need mellanox team to confirm if the behavior is expected.
acms
orsyncd
is expected to not start after config rate limit, we should improve the test or bypassacms
andsyncd
rate-limit test.Below is the pass case of
teamd
. The container restart and test pass.From syslog, it does config the rate limit and the container restarts
Additional information you deem important: