Closed dgsudharsan closed 2 days ago
The log from other instance
Jun 21 01:10:56.212929 r-leopard-70 INFO dhcp_relay#supervisord: dhcprelayd Traceback (most recent call last):
Jun 21 01:10:56.212929 r-leopard-70 INFO dhcp_relay#supervisord: dhcprelayd File "/usr/local/lib/python3.9/dist-packages/psutil/_pslinux.py", line 1717, in wrapper
Jun 21 01:10:56.214568 r-leopard-70 INFO dhcp_relay#supervisord: dhcprelayd return fun(self, *args, **kwargs)
Jun 21 01:10:56.214568 r-leopard-70 INFO dhcp_relay#supervisord: dhcprelayd File "/usr/local/lib/python3.9/dist-packages/psutil/_common.py", line 508, in wrapper
Jun 21 01:10:56.214958 r-leopard-70 INFO dhcp_relay#supervisord: dhcprelayd raise raise_from(err, None)
Jun 21 01:10:56.214980 r-leopard-70 INFO dhcp_relay#supervisord: dhcprelayd File "<string>", line 3, in raise_from
Jun 21 01:10:56.215272 r-leopard-70 INFO dhcp_relay#supervisord: dhcprelayd File "/usr/local/lib/python3.9/dist-packages/psutil/_common.py", line 506, in wrapper
Jun 21 01:10:56.215572 r-leopard-70 INFO dhcp_relay#supervisord: dhcprelayd return fun(self)
Jun 21 01:10:56.215572 r-leopard-70 INFO dhcp_relay#supervisord: dhcprelayd File "/usr/local/lib/python3.9/dist-packages/psutil/_pslinux.py", line 1780, in _parse_stat_file
Jun 21 01:10:56.216045 r-leopard-70 INFO dhcp_relay#supervisord: dhcprelayd data = bcat("%s/%s/stat" % (self._procfs_path, self.pid))
Jun 21 01:10:56.216045 r-leopard-70 INFO dhcp_relay#supervisord: dhcprelayd File "/usr/local/lib/python3.9/dist-packages/psutil/_common.py", line 851, in bcat
Jun 21 01:10:56.216358 r-leopard-70 INFO dhcp_relay#supervisord: dhcprelayd return cat(fname, fallback=fallback, _open=open_binary)
Jun 21 01:10:56.216385 r-leopard-70 INFO dhcp_relay#supervisord: dhcprelayd File "/usr/local/lib/python3.9/dist-packages/psutil/_common.py", line 839, in cat
Jun 21 01:10:56.216837 r-leopard-70 INFO dhcp_relay#supervisord: dhcprelayd with _open(fname) as f:
Jun 21 01:10:56.216837 r-leopard-70 INFO dhcp_relay#supervisord: dhcprelayd File "/usr/local/lib/python3.9/dist-packages/psutil/_common.py", line 799, in open_binary
Jun 21 01:10:56.217223 r-leopard-70 INFO dhcp_relay#supervisord: dhcprelayd return open(fname, "rb", buffering=FILE_READ_BUFFER_SIZE)
Jun 21 01:10:56.217223 r-leopard-70 INFO dhcp_relay#supervisord: dhcprelayd FileNotFoundError: [Errno 2] No such file or directory: '/proc/537/stat'
Jun 21 01:10:56.217223 r-leopard-70 INFO dhcp_relay#supervisord: dhcprelayd
Jun 21 01:10:56.217223 r-leopard-70 INFO dhcp_relay#supervisord: dhcprelayd During handling of the above exception, another exception occurred:
Jun 21 01:10:56.217223 r-leopard-70 INFO dhcp_relay#supervisord: dhcprelayd
Jun 21 01:10:56.217223 r-leopard-70 INFO dhcp_relay#supervisord: dhcprelayd Traceback (most recent call last):
Jun 21 01:10:56.217223 r-leopard-70 INFO dhcp_relay#supervisord: dhcprelayd File "/usr/local/bin/dhcprelayd", line 8, in <module>
Jun 21 01:10:56.217223 r-leopard-70 INFO dhcp_relay#supervisord: dhcprelayd sys.exit(main())
Jun 21 01:10:56.217305 r-leopard-70 INFO dhcp_relay#supervisord: dhcprelayd File "/usr/local/lib/python3.9/dist-packages/dhcp_utilities/dhcprelayd/dhcprelayd.py", line 308, in main
Jun 21 01:10:56.217646 r-leopard-70 INFO dhcp_relay#supervisord: dhcprelayd dhcprelayd.wait()
Jun 21 01:10:56.217646 r-leopard-70 INFO dhcp_relay#supervisord: dhcprelayd File "/usr/local/lib/python3.9/dist-packages/dhcp_utilities/dhcprelayd/dhcprelayd.py", line 133, in wait
Jun 21 01:10:56.217951 r-leopard-70 INFO dhcp_relay#supervisord: dhcprelayd self._check_dhcp_relay_processes()
Jun 21 01:10:56.217951 r-leopard-70 INFO dhcp_relay#supervisord: dhcprelayd File "/usr/local/lib/python3.9/dist-packages/dhcp_utilities/dhcprelayd/dhcprelayd.py", line 166, in _check_dhcp_relay_processes
Jun 21 01:10:56.217951 r-leopard-70 INFO dhcp_relay#supervisord: dhcprelayd running_cmds = get_target_process_cmds("dhcrelay")
Jun 21 01:10:56.217951 r-leopard-70 INFO dhcp_relay#supervisord: dhcprelayd File "/usr/local/lib/python3.9/dist-packages/dhcp_utilities/common/utils.py", line 163, in get_target_process_cmds
Jun 21 01:10:56.218300 r-leopard-70 INFO dhcp_relay#supervisord: dhcprelayd if proc.name() == process_name:
Jun 21 01:10:56.218300 r-leopard-70 INFO dhcp_relay#supervisord: dhcprelayd File "/usr/local/lib/python3.9/dist-packages/psutil/__init__.py", line 656, in name
Jun 21 01:10:56.218750 r-leopard-70 INFO dhcp_relay#supervisord: dhcprelayd name = self._proc.name()
Jun 21 01:10:56.218750 r-leopard-70 INFO dhcp_relay#supervisord: dhcprelayd File "/usr/local/lib/python3.9/dist-packages/psutil/_pslinux.py", line 1717, in wrapper
Jun 21 01:10:56.219165 r-leopard-70 INFO dhcp_relay#supervisord: dhcprelayd return fun(self, *args, **kwargs)
Jun 21 01:10:56.219165 r-leopard-70 INFO dhcp_relay#supervisord: dhcprelayd File "/usr/local/lib/python3.9/dist-packages/psutil/_pslinux.py", line 1831, in name
Jun 21 01:10:56.219575 r-leopard-70 INFO dhcp_relay#supervisord: dhcprelayd name = self._parse_stat_file()['name']
Jun 21 01:10:56.219592 r-leopard-70 INFO dhcp_relay#supervisord: dhcprelayd File "/usr/local/lib/python3.9/dist-packages/psutil/_pslinux.py", line 1726, in wrapper
Jun 21 01:10:56.220157 r-leopard-70 INFO dhcp_relay#supervisord: dhcprelayd raise NoSuchProcess(self.pid, self._name)
Jun 21 01:10:56.220368 r-leopard-70 INFO dhcp_relay#supervisord: dhcprelayd psutil.NoSuchProcess: process no longer exists (pid=537)
Jun 21 01:10:56.231538 r-leopard-70 INFO dhcp_relay#supervisord 2024-06-21 01:10:56,231 INFO exited: dhcprelayd (exit status 1; not expected)
@yxieca @yaqiangz Can you please help prioritize this bug?
Hi @dgsudharsan , could you help to provide the reproduce step?
@yaqiangz As mentioned there is no reproduce step. Traceback is seen without any external trigger
@dgsudharsan please provide more details (hwsku, topology, image commit id etc.) and try reinstall image
I tried latest build for 202311 (commit: 74b81ffcd76290dd2eeff4b66e742a705f76c165) in t0/m0, didn't hit this issue, dhcprelayd works well
admin@sonic:~$ date
Tue 09 Jul 2024 08:35:05 AM UTC
admin@sonic:~$ docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
b3e1cf10590f docker-snmp:latest "/usr/local/bin/supe…" 22 minutes ago Up 22 minutes snmp
f217dc52e560 docker-sonic-mgmt-framework:latest "/usr/local/bin/supe…" 22 minutes ago Up 22 minutes mgmt-framework
bf36a280787a docker-lldp:latest "/usr/bin/docker-lld…" 22 minutes ago Up 22 minutes lldp
cf05127f3511 docker-sonic-gnmi:latest "/usr/local/bin/supe…" 22 minutes ago Up 22 minutes gnmi
cea36c5938bc 02e737784b02 "/usr/bin/docker_ini…" 22 minutes ago Up 22 minutes dhcp_relay
ac87c236084a docker-platform-monitor:latest "/usr/bin/docker_ini…" 23 minutes ago Up 23 minutes pmon
e4e3426988c0 docker-router-advertiser:latest "/usr/bin/docker-ini…" 23 minutes ago Up 23 minutes radv
8c7ae91d5787 docker-syncd-mlnx:latest "/usr/local/bin/supe…" 23 minutes ago Up 23 minutes syncd
35b4592988f5 docker-teamd:latest "/usr/local/bin/supe…" 23 minutes ago Up 23 minutes teamd
59cfd960818f docker-fpm-frr:latest "/usr/bin/docker_ini…" 23 minutes ago Up 23 minutes bgp
3d1526b8df0c docker-orchagent:latest "/usr/bin/docker-ini…" 23 minutes ago Up 23 minutes swss
835d927e4793 docker-eventd:latest "/usr/local/bin/supe…" 23 minutes ago Up 23 minutes eventd
179af8c019a0 docker-database:latest "/usr/local/bin/dock…" 24 minutes ago Up 24 minutes database
admin@sonic:~$ ps aux | grep dhc
root 12389 0.0 0.0 11216 5960 ? Ss 08:12 0:00 /bin/bash /usr/local/bin/dhcp_relay.sh wait
root 12474 0.0 0.0 11348 5880 ? S 08:12 0:00 /bin/bash /usr/bin/dhcp_relay.sh wait
root 12475 0.0 0.5 68520 42920 ? S 08:12 0:00 python3 /usr/local/bin/container wait dhcp_relay
root 12948 0.0 0.3 124160 26724 pts/0 Sl 08:12 0:00 python3 /usr/bin/supervisor-proc-exit-listener --container-name dhcp_relay
root 14013 0.0 0.1 17892 10004 pts/0 S 08:12 0:00 /usr/sbin/dhcp6relay
root 14014 0.0 0.2 39376 23624 pts/0 S 08:12 0:00 /usr/bin/python3 /usr/local/bin/dhcprelayd
root 14015 0.0 0.0 34508 6664 pts/0 Sl 08:12 0:00 /usr/sbin/dhcrelay -d -m discard -a %h:%p %P --name-alias-map-file /tmp/port-name-alias-map.txt -id Vlan1000 -iu PortChannel101 -iu PortChannel102 -iu PortChannel103 -iu PortChannel104 192.0.0.1 192.0.0.2 192.0.0.3 192.0.0.4 192.0.0.5 192.0.0.6 192.0.0.7 192.0.0.8 192.0.0.9 192.0.0.10 192.0.0.11 192.0.0.12 192.0.0.13 192.0.0.14 192.0.0.15 192.0.0.16 192.0.0.17 192.0.0.18 192.0.0.19 192.0.0.20 192.0.0.21 192.0.0.22 192.0.0.23 192.0.0.24 192.0.0.25 192.0.0.26 192.0.0.27 192.0.0.28 192.0.0.29 192.0.0.30 192.0.0.31 192.0.0.32 192.0.0.33 192.0.0.34 192.0.0.35 192.0.0.36 192.0.0.37 192.0.0.38 192.0.0.39 192.0.0.40 192.0.0.41 192.0.0.42 192.0.0.43 192.0.0.44 192.0.0.45 192.0.0.46 192.0.0.47 192.0.0.48
root 14017 0.0 0.1 99868 9768 pts/0 Sl 08:12 0:00 /usr/bin/rsyslog_plugin -r /etc/rsyslog.d/dhcp_relay_regex.json -m sonic-events-dhcp-relay
root 14035 0.0 0.1 101380 10704 pts/0 Sl 08:12 0:00 /usr/sbin/dhcpmon -id Vlan1000 -iu PortChannel101 -iu PortChannel102 -iu PortChannel103 -iu PortChannel104 -im eth0
admin 25090 0.0 0.0 6868 648 pts/0 S+ 08:35 0:00 grep dhc
admin@sonic:~$ sudo cat /var/log/syslog | grep -a dhcprelayd
Jul 9 08:12:54.466893 sonic INFO dhcp_relay#supervisord 2024-07-09 08:12:54,466 INFO spawned: 'dhcprelayd' with pid 43
Jul 9 08:12:55.541800 sonic INFO dhcp_relay#supervisord 2024-07-09 08:12:55,540 INFO success: dhcprelayd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
admin@sonic:~$
Based on commit: https://github.com/sonic-net/sonic-buildimage/tree/156b067c875967618232c02cb51b163e5e287e45 HW SKU: Mellanox-SN2700-A1-D48C8 Topology: T0
@yaqiangz Please note that this is a very rare statistical issue. Even for us it is not seen often. As mentioned in description, it happened only twice until now.
@yaqiangz, Seems to me the issue is in here https://github.com/sonic-net/sonic-buildimage/blob/202311/src/sonic-dhcp-utilities/dhcp_utilities/common/utils.py#L162
name() is not a safe call and might throw a psutil.NoSuchProcess exception. For a daemon dhcprelayd, the exception should be handled and process should not exit.
Now, we don't know what the pid 262 is for and it's hard to figure out cause the process has ended anyway and the issue is very hard to repro.
It could be any temporary process which might be spawned by a dhcpv4 related process or maybe even by rsyslogd and rsyslog plugin.
Jul 3 06:21:02.974878 r-panther-40 INFO dhcp_relay#supervisord: dhcprelayd psutil.NoSuchProcess: process no longer exists (pid=262)
Jul 3 06:21:02.989391 r-panther-40 INFO dhcp_relay#supervisord 2024-07-03 06:21:02,988 INFO exited: dhcprelayd (exit status 1; not expected)
Based on commit: https://github.com/sonic-net/sonic-buildimage/tree/156b067c875967618232c02cb51b163e5e287e45 HW SKU: Mellanox-SN2700-A1-D48C8 Topology: T0
@yaqiangz Please note that this is a very rare statistical issue. Even for us it is not seen often. As mentioned in description, it happened only twice until now.
@yaqiangz, Seems to me the issue is in here https://github.com/sonic-net/sonic-buildimage/blob/202311/src/sonic-dhcp-utilities/dhcp_utilities/common/utils.py#L162
name() is not a safe call and might throw a psutil.NoSuchProcess exception. For a daemon dhcprelayd, the exception should be handled and process should not exit.
Now, we don't know what the pid 262 is for and it's hard to figure out cause the process has ended anyway and the issue is very hard to repro.
It could be any temporary process which might be spawned by a dhcpv4 related process or maybe even by rsyslogd and rsyslog plugin.
Jul 3 06:21:02.974878 r-panther-40 INFO dhcp_relay#supervisord: dhcprelayd psutil.NoSuchProcess: process no longer exists (pid=262) Jul 3 06:21:02.989391 r-panther-40 INFO dhcp_relay#supervisord 2024-07-03 06:21:02,988 INFO exited: dhcprelayd (exit status 1; not expected)
Got it, will check for fixing
Description
The below issue happened twice in 202311 without any specific trigger. Dhcprelayd crashes and results in below error logs
There is no logs from dhcrelay during this time and hence no information on why the process is unavailable.
Steps to reproduce the issue:
1. 2. 3.
Describe the results you received:
Describe the results you expected:
Output of
show version
:Output of
show techsupport
:Additional information you deem important (e.g. issue happens only occasionally):