sonic-net / sonic-buildimage

Scripts which perform an installable binary image build for SONiC
Other
739 stars 1.43k forks source link

[202305][LLDP]: After config reload lldpd stopped accepting the configuration #18329

Open nazariig opened 8 months ago

nazariig commented 8 months ago

Description

The issue seems to be related to the socket file lock:

https://github.com/sonic-net/sonic-buildimage/commit/3d0a0da24f56d8d391315e94847fa4ea5c4e277e

Upgrade lldpd to 1.0.16
This fixes lldpcli hitting some error related to being unable to get a
lock on /var/run/lldpd.socket. This version is the version in Debian
Bookworm, even though lldpd is in the lldp container, and that is on
Bullseye. This is because there is a change that went into 1.0.12 that
uses a separate lock file for the lldpd socket instead of locking the
socket file itself. This appears to cause problems in an unprivileged
docker container for unknown reasons (privileged docker container is
fine). Bullseye is on 1.0.11, which isn't new enough to have this
change.

I can't see any specific system capability that might address this.
Rather than debugging this further, just upgrade to the Bookworm
version.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

Although the ports do exist in the kernel, neither management interface nor data ports can't be configured until the config reload or switch reboot. This could be something related to socket existence check which is implemented in waitfor_lldp_ready.sh.

LLDP supervisor config:

root@r-panther-42:/# cat /etc/supervisor/conf.d/supervisord.conf
[supervisord]
logfile_maxbytes=1MB
logfile_backups=2
nodaemon=true

[eventlistener:dependent-startup]
command=python3 -m supervisord_dependent_startup
autostart=true
autorestart=unexpected
startretries=0
exitcodes=0,3
events=PROCESS_STATE
buffer_size=1024

[eventlistener:supervisor-proc-exit-listener]
command=/usr/bin/supervisor-proc-exit-listener --container-name lldp
events=PROCESS_STATE_EXITED,PROCESS_STATE_RUNNING
autostart=true
autorestart=unexpected
buffer_size=1024

[program:rsyslogd]
command=/usr/sbin/rsyslogd -n -iNONE
priority=1
autostart=false
autorestart=false
stdout_logfile=syslog
stderr_logfile=syslog
dependent_startup=true

[program:start]
command=/usr/bin/start.sh
priority=2
autostart=false
autorestart=false
startsecs=0
stdout_logfile=syslog
stderr_logfile=syslog
dependent_startup=true
dependent_startup_wait_for=rsyslogd:running

[program:lldpd]
# https://github.com/vincentbernat/lldpd/commit/9856f2792c301116cc4a3fcfba91b9672ee5db1f
# - `-d` means to stay in foreground, log to syslog
# - `-dd` means to stay in foreground, log warnings to console
# - `-ddd` means to stay in foreground, log warnings and info to console
# - `-dddd` means to stay in foreground, log all to console
command=/usr/sbin/lldpd -d -I Ethernet*,eth0 -C eth0
priority=3
autostart=false
autorestart=false
stdout_logfile=syslog
stderr_logfile=syslog
dependent_startup=true
dependent_startup_wait_for=start:exited

[program:waitfor_lldp_ready]
command=/usr/bin/waitfor_lldp_ready.sh
priority=3
autostart=false
autorestart=false
startsecs=0
stdout_logfile=syslog
stderr_logfile=syslog
dependent_startup=true
dependent_startup_wait_for=lldpd:running

[program:lldp-syncd]
command=/usr/bin/env python3 -m lldp_syncd
priority=4
autostart=false
autorestart=false
stdout_logfile=syslog
stderr_logfile=syslog
dependent_startup=true
dependent_startup_wait_for=waitfor_lldp_ready:exited

[program:lldpmgrd]
command=/usr/bin/lldpmgrd
priority=5
autostart=false
autorestart=false
stdout_logfile=syslog
stderr_logfile=syslog
dependent_startup=true
dependent_startup_wait_for=lldp-syncd:running

LLDP wait for script:

​root@r-panther-42:/# cat /usr/bin/waitfor_lldp_ready.sh
#!/bin/bash

# Current lldpd version has a bug.
# When lldpd starts it is in the pause state by default
# But then it execute 'lldpcli resume' to configure and unpause itself.
# When lldpd execute lldpcli, it doesn't check the return code
# Sometimes lldpcli returns failure, but lldpd doesn't catch it
# and keeps working paused and unconfigured
#
# The fix below addresses the issue.
#

# wait until lldpd started
until [[ -e /var/run/lldpd.socket ]];
do
    sleep 1;
done

# Manually try to resume lldpd, until it's successful
while /bin/true;
do
    lldpcli -u /var/run/lldpd.socket -c /etc/lldpd.conf -c /etc/lldpd.d resume > /dev/null && break
    sleep 1
done

LLDP config:

admin@r-panther-42:~$ docker exec -ti lldp bash
root@r-panther-42:/# cat /etc/lldpd.conf | head -1
configure ports eth0 lldp portidsubtype local eth0

Log

SWSS/SYNCD service is started:

Feb 26 19:43:21.394043 r-panther-42 NOTICE root: Starting swss service...
Feb 26 19:43:22.382202 r-panther-42 INFO swss.sh[23626]: Starting existing swss container with HWSKU Mellanox-SN2700-A1
Feb 26 19:43:25.166204 r-panther-42 NOTICE root: Started swss service...
Feb 26 19:43:25.216332 r-panther-42 INFO systemd[1]: Started switch state service.
Feb 26 19:43:25.222574 r-panther-42 INFO systemd[1]: Starting syncd service...
Feb 26 19:43:25.230298 r-panther-42 NOTICE root: Starting syncd service...
Feb 26 19:43:25.711354 r-panther-42 INFO syncd.sh[24389]: Starting MST (Mellanox Software Tools) driver set
Feb 26 19:43:26.957013 r-panther-42 INFO systemd[1]: Started TEAMD container.
Feb 26 19:43:26.967284 r-panther-42 INFO systemd[1]: Stopping Network initialization...
Feb 26 19:43:27.657556 r-panther-42 INFO systemd[1]: Stopped Network initialization.
Feb 26 19:43:27.668668 r-panther-42 INFO systemd[1]: Starting Network initialization...
Feb 26 19:43:32.960313 r-panther-42 INFO syncd.sh[26234]: Starting existing syncd container with HWSKU Mellanox-SN2700-A1
Feb 26 19:43:33.176340 r-panther-42 NOTICE swss#orchagent: :- main: --- Starting Orchestration Agent ---
Feb 26 19:43:33.649144 r-panther-42 NOTICE root: Started syncd service...
Feb 26 19:43:33.666094 r-panther-42 INFO systemd[1]: Started syncd service.

Port init is done:

Feb 26 19:43:41.737808 r-panther-42 NOTICE swss#orchagent: :- initPort: Initialized port Ethernet0
Feb 26 19:43:41.757179 r-panther-42 NOTICE swss#orchagent: :- initPort: Initialized port Ethernet4
Feb 26 19:43:41.780421 r-panther-42 NOTICE swss#orchagent: :- initPort: Initialized port Ethernet8
Feb 26 19:43:41.810538 r-panther-42 NOTICE swss#orchagent: :- initPort: Initialized port Ethernet12
Feb 26 19:43:41.833031 r-panther-42 NOTICE swss#orchagent: :- initPort: Initialized port Ethernet16
Feb 26 19:43:41.857917 r-panther-42 NOTICE swss#orchagent: :- initPort: Initialized port Ethernet20
Feb 26 19:43:41.877781 r-panther-42 NOTICE swss#orchagent: :- initPort: Initialized port Ethernet24
Feb 26 19:43:41.909191 r-panther-42 NOTICE swss#orchagent: :- initPort: Initialized port Ethernet28
Feb 26 19:43:41.929525 r-panther-42 NOTICE swss#orchagent: :- initPort: Initialized port Ethernet32
Feb 26 19:43:41.953059 r-panther-42 NOTICE swss#orchagent: :- initPort: Initialized port Ethernet36
Feb 26 19:43:41.983255 r-panther-42 NOTICE swss#orchagent: :- initPort: Initialized port Ethernet40
Feb 26 19:43:42.009888 r-panther-42 NOTICE swss#orchagent: :- initPort: Initialized port Ethernet44
Feb 26 19:43:42.028287 r-panther-42 NOTICE swss#orchagent: :- initPort: Initialized port Ethernet48
Feb 26 19:43:42.060420 r-panther-42 NOTICE swss#orchagent: :- initPort: Initialized port Ethernet52
Feb 26 19:43:42.083768 r-panther-42 NOTICE swss#orchagent: :- initPort: Initialized port Ethernet56
Feb 26 19:43:42.112423 r-panther-42 NOTICE swss#orchagent: :- initPort: Initialized port Ethernet60
Feb 26 19:43:42.134683 r-panther-42 NOTICE swss#orchagent: :- initPort: Initialized port Ethernet64
Feb 26 19:43:42.172179 r-panther-42 NOTICE swss#orchagent: :- initPort: Initialized port Ethernet68
Feb 26 19:43:42.195500 r-panther-42 NOTICE swss#orchagent: :- initPort: Initialized port Ethernet72
Feb 26 19:43:42.228765 r-panther-42 NOTICE swss#orchagent: :- initPort: Initialized port Ethernet76
Feb 26 19:43:42.252150 r-panther-42 NOTICE swss#orchagent: :- initPort: Initialized port Ethernet80
Feb 26 19:43:42.286598 r-panther-42 NOTICE swss#orchagent: :- initPort: Initialized port Ethernet84
Feb 26 19:43:42.334951 r-panther-42 NOTICE swss#orchagent: :- initPort: Initialized port Ethernet88
Feb 26 19:43:42.399563 r-panther-42 NOTICE swss#orchagent: :- initPort: Initialized port Ethernet92
Feb 26 19:43:42.475412 r-panther-42 NOTICE swss#orchagent: :- initPort: Initialized port Ethernet96
Feb 26 19:43:42.596003 r-panther-42 NOTICE swss#orchagent: :- initPort: Initialized port Ethernet100
Feb 26 19:43:42.660309 r-panther-42 NOTICE swss#orchagent: :- initPort: Initialized port Ethernet104
Feb 26 19:43:42.698390 r-panther-42 NOTICE swss#orchagent: :- initPort: Initialized port Ethernet108
Feb 26 19:43:42.744415 r-panther-42 NOTICE swss#orchagent: :- initPort: Initialized port Ethernet112
Feb 26 19:43:42.809595 r-panther-42 NOTICE swss#orchagent: :- initPort: Initialized port Ethernet116
Feb 26 19:43:42.852316 r-panther-42 NOTICE swss#orchagent: :- initPort: Initialized port Ethernet120
Feb 26 19:43:42.897432 r-panther-42 NOTICE swss#orchagent: :- initPort: Initialized port Ethernet124

LLDP service is started:

Feb 26 19:43:45.506069 r-panther-42 INFO systemd[1]: Starting LLDP container...
Feb 26 19:43:45.515324 r-panther-42 NOTICE admin: Starting lldp service...
Feb 26 19:43:45.627023 r-panther-42 INFO lldp.sh[28648]: Starting existing lldp container with HWSKU Mellanox-SN2700-A1
Feb 26 19:43:46.220388 r-panther-42 NOTICE admin: Started lldp service...
Feb 26 19:43:46.226205 r-panther-42 INFO systemd[1]: Started LLDP container.

Feb 26 19:43:53.826357 r-panther-42 INFO lldp#supervisord 2024-02-26 17:43:53,822 INFO spawned: 'lldpd' with pid 23
Feb 26 19:43:53.894200 r-panther-42 INFO lldp#lldpd[25]: protocol LLDP enabled
Feb 26 19:43:53.894339 r-panther-42 INFO lldp#lldpd[25]: protocol CDPv1 disabled
Feb 26 19:43:53.894412 r-panther-42 INFO lldp#lldpd[25]: protocol CDPv2 disabled
Feb 26 19:43:53.894477 r-panther-42 INFO lldp#lldpd[25]: protocol SONMP disabled
Feb 26 19:43:53.894538 r-panther-42 INFO lldp#lldpd[25]: protocol EDP disabled
Feb 26 19:43:53.894602 r-panther-42 INFO lldp#lldpd[25]: protocol FDP disabled
Feb 26 19:43:53.894660 r-panther-42 INFO lldp#lldpd[25]: libevent 2.1.12-stable initialized with epoll method
Feb 26 19:43:53.900523 r-panther-42 INFO lldp#supervisord: lldpd 2024-02-26T17:43:53 [INFO/main] protocol LLDP enabled
Feb 26 19:43:53.900664 r-panther-42 INFO lldp#supervisord: lldpd 2024-02-26T17:43:53 [INFO/main] protocol CDPv1 disabled
Feb 26 19:43:53.900738 r-panther-42 INFO lldp#supervisord: lldpd 2024-02-26T17:43:53 [INFO/main] protocol CDPv2 disabled
Feb 26 19:43:53.900804 r-panther-42 INFO lldp#supervisord: lldpd 2024-02-26T17:43:53 [INFO/main] protocol SONMP disabled
Feb 26 19:43:53.900865 r-panther-42 INFO lldp#supervisord: lldpd 2024-02-26T17:43:53 [INFO/main] protocol EDP disabled
Feb 26 19:43:53.900952 r-panther-42 INFO lldp#supervisord: lldpd 2024-02-26T17:43:53 [INFO/main] protocol FDP disabled
Feb 26 19:43:53.901028 r-panther-42 INFO lldp#supervisord: lldpd 2024-02-26T17:43:53 [INFO/event] libevent 2.1.12-stable initialized with epoll method

LLDPD failed to find eth0 on init:

Feb 26 19:43:53.940991 r-panther-42 WARNING lldp#lldpcli[24]: cannot find port eth0
Feb 26 19:43:53.941054 r-panther-42 INFO lldp#supervisord: lldpd 2024-02-26T17:43:53 [WARN/lldpctl] cannot find port eth0
Feb 26 19:43:53.941082 r-panther-42 INFO lldp#lldpcli[24]: an error occurred while executing last command
Feb 26 19:43:53.941108 r-panther-42 INFO lldp#supervisord: lldpd 2024-02-26T17:43:53 [INFO/lldpctl] an error occurred while executing last command
Feb 26 19:43:53.958406 r-panther-42 INFO lldp#lldpcli[24]: management pattaren set to new value 10.210.25.56
Feb 26 19:43:53.958540 r-panther-42 INFO lldp#lldpcli[24]: system name set to new value r-panther-42
Feb 26 19:43:53.958614 r-panther-42 INFO lldp#lldpcli[24]: description set to new value SONiC Software Version: SONiC.202305_2311_POC.3-c121fc9fa_Internal - HwSku: Mellanox-SN2700-A1 - Distribution: Debian 11.9 - Kernel: 5.10.0-23-2-amd64
Feb 26 19:43:53.958680 r-panther-42 INFO lldp#lldpcli[24]: lldpd should resume operations
Feb 26 19:43:53.958744 r-panther-42 INFO lldp#supervisord: lldpd 2024-02-26T17:43:53 [INFO/lldpctl] management pattaren set to new value 10.210.25.56
Feb 26 19:43:53.958809 r-panther-42 INFO lldp#supervisord: lldpd 2024-02-26T17:43:53 [INFO/lldpctl] system name set to new value r-panther-42
Feb 26 19:43:53.958873 r-panther-42 INFO lldp#supervisord: lldpd 2024-02-26T17:43:53 [INFO/lldpctl] description set to new value SONiC Software Version: SONiC.202305_2311_POC.3-c121fc9fa_Internal - HwSku: Mellanox-SN2700-A1 - Distribution: Debian 11.9 - Kernel: 5.10.0-23-2-amd64
Feb 26 19:43:53.958936 r-panther-42 INFO lldp#supervisord: lldpd 2024-02-26T17:43:53 [INFO/lldpctl] lldpd should resume operations
Feb 26 19:43:54.956140 r-panther-42 INFO lldp#supervisord 2024-02-26 17:43:54,955 INFO success: lldpd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)

LLDPD failed to configure the rest of the ports:

Feb 26 19:43:56.003443 r-panther-42 INFO lldp#supervisord 2024-02-26 17:43:55,992 INFO spawned: 'waitfor_lldp_ready' with pid 26
Feb 26 19:43:56.005959 r-panther-42 INFO lldp#supervisord 2024-02-26 17:43:56,003 INFO success: waitfor_lldp_ready entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
Feb 26 19:43:56.068109 r-panther-42 INFO lldp#supervisord: waitfor_lldp_ready 2024-02-26T17:43:56 [WARN/lldpctl] cannot find port eth0
Feb 26 19:43:56.069148 r-panther-42 INFO lldp#supervisord 2024-02-26 17:43:56,068 INFO exited: waitfor_lldp_ready (exit status 0; expected)
Feb 26 19:43:56.123111 r-panther-42 INFO lldp#supervisord 2024-02-26 17:43:56,114 INFO spawned: 'lldp-syncd' with pid 29
Feb 26 19:43:56.305959 r-panther-42 INFO lldp#lldp-syncd [lldp_syncd] INFO: Starting SONiC LLDP sync daemon...
Feb 26 19:43:57.174441 r-panther-42 INFO lldp#supervisord 2024-02-26 17:43:57,174 INFO success: lldp-syncd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
Feb 26 19:43:58.267943 r-panther-42 INFO lldp#supervisord 2024-02-26 17:43:58,252 INFO spawned: 'lldpmgrd' with pid 33
Feb 26 19:43:58.665275 r-panther-42 INFO lldp#lldpmgrd[33]: Starting up...
Feb 26 19:43:58.832474 r-panther-42 INFO lldp#lldpmgrd[33]: Command failed '['lldpcli', 'configure', 'ports', 'Ethernet120', 'lldp', 'portidsubtype', 'local', 'etp31', 'description', 'ARISTA15T0:Ethernet1']': 2024-02-26T17:43:58 [WARN/lldpctl] cannot find port Ethernet120#012 - cmd failed 1 times, retrying again
Feb 26 19:44:51.374898 r-panther-42 ERR lldp#lldpmgrd[33]: Command failed '['lldpcli', 'configure', 'ports', 'Ethernet120', 'lldp', 'portidsubtype', 'local', 'etp31',
'description', 'ARISTA15T0:Ethernet1']': 2024-02-26T17:44:51 [WARN/lldpctl] cannot find port Ethernet120#012 - command was failed 6 times, disabling retry

Steps to reproduce the issue:

So far the issue was seen only once, no steps to reproduce or probability is too low

Describe the results you received:

Feb 26 19:43:53.940991 r-panther-42 WARNING lldp#lldpcli[24]: cannot find port eth0
Feb 26 19:43:53.941054 r-panther-42 INFO lldp#supervisord: lldpd 2024-02-26T17:43:53 [WARN/lldpctl] cannot find port eth0
Feb 26 19:43:53.941082 r-panther-42 INFO lldp#lldpcli[24]: an error occurred while executing last command
Feb 26 19:43:53.941108 r-panther-42 INFO lldp#supervisord: lldpd 2024-02-26T17:43:53 [INFO/lldpctl] an error occurred while executing last command

Feb 26 19:43:58.832474 r-panther-42 INFO lldp#lldpmgrd[33]: Command failed '['lldpcli', 'configure', 'ports', 'Ethernet120', 'lldp', 'portidsubtype', 'local', 'etp31', 'description', 'ARISTA15T0:Ethernet1']': 2024-02-26T17:43:58 [WARN/lldpctl] cannot find port Ethernet120#012 - cmd failed 1 times, retrying again
Feb 26 19:44:51.374898 r-panther-42 ERR lldp#lldpmgrd[33]: Command failed '['lldpcli', 'configure', 'ports', 'Ethernet120', 'lldp', 'portidsubtype', 'local', 'etp31','description', 'ARISTA15T0:Ethernet1']': 2024-02-26T17:44:51 [WARN/lldpctl] cannot find port Ethernet120#012 - command was failed 6 times, disabling retry

Describe the results you expected:

No errors are expected

Output of show version:

Output of show techsupport:

Additional information you deem important (e.g. issue happens only occasionally):

prabhataravind commented 8 months ago

@saiarcot895 please help take a look to see if this is an lldpd issue that needs to be fixed. This is seen on sonic 202305 with lldpd version much older than 1.0.16

saiarcot895 commented 7 months ago

I don't think this is related to the commit I made for Bookworm, upgrading lldp to 1.0.16. Based on these logs, the script that checks for the socket exited successfully:

Feb 26 19:43:56.003443 r-panther-42 INFO lldp#supervisord 2024-02-26 17:43:55,992 INFO spawned: 'waitfor_lldp_ready' with pid 26
Feb 26 19:43:56.005959 r-panther-42 INFO lldp#supervisord 2024-02-26 17:43:56,003 INFO success: waitfor_lldp_ready entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
Feb 26 19:43:56.068109 r-panther-42 INFO lldp#supervisord: waitfor_lldp_ready 2024-02-26T17:43:56 [WARN/lldpctl] cannot find port eth0
Feb 26 19:43:56.069148 r-panther-42 INFO lldp#supervisord 2024-02-26 17:43:56,068 INFO exited: waitfor_lldp_ready (exit status 0; expected)

Instead, it looks as if either the lldp container somehow got started in a separate/private network namespace with none of the interfaces, or all of the network interfaces including the management interface eth0 were somehow not present in the container. The lldp container should be using the host's network namespace, and thus all of the interfaces outside of the container should be present within the container.

bingwang-ms commented 6 months ago

@saiarcot895 Do we know why lldp container is not running in the host's network namespace?

saiarcot895 commented 6 months ago

Not sure on that; it could be that the issue isn't related to network namespaces at all, but is something else entirely (maybe it got the index of the interface early in startup, then the interfaces went away and came back, resulting in new indices, and lldpd is still trying to get the interface with those old indices? This feels unlikely to me, though). Without a live repro though, this will be a bit difficult to debug further.