sonic-net / sonic-buildimage

Scripts which perform an installable binary image build for SONiC
Other
728 stars 1.39k forks source link

[interfaces-config][ZTP]When ZTP detects config file and exits, sometimes it leaves netdev interfaces in down state #16651

Open dgsudharsan opened 1 year ago

dgsudharsan commented 1 year ago

Description

When ZTP detects config file, it exits. However sometimes interfaces-config.service in the middle of execution and it leaves the interfaces down

Sep 20 16:54:59.725443 c-panther-01 WARNING systemd[1]: interfaces-config.service: State 'final-sigterm' timed out. Killing.
Sep 20 16:54:59.726565 c-panther-01 NOTICE systemd[1]: interfaces-config.service: Killing process 21253 (python3) with signal SIGKILL.
Sep 20 16:54:59.726796 c-panther-01 WARNING systemd[1]: interfaces-config.service: Failed with result 'signal'.
Sep 20 16:54:59.726923 c-panther-01 INFO systemd[1]: interfaces-config.service: Unit process 21253 (python3) remains running after unit stopped.
Sep 20 16:54:59.727051 c-panther-01 INFO systemd[1]: Stopped Update interfaces configuration.
Sep 20 16:54:59.727166 c-panther-01 WARNING systemd[1]: interfaces-config.service: Found left-over process 21253 (python3) in control group while starting unit. Ignoring.
Sep 20 16:54:59.728026 c-panther-01 WARNING systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Sep 20 16:54:59.728158 c-panther-01 INFO systemd[1]: Starting Update interfaces configuration...
Sep 20 16:55:01.619274 c-panther-01 INFO interfaces-config.sh[28122]: net.ipv6.conf.eth0.accept_ra_defrtr = 0
Sep 20 16:55:01.619374 c-panther-01 INFO interfaces-config.sh[28122]: net.ipv6.conf.eth0.accept_ra = 0
Sep 20 16:55:01.629469 c-panther-01 INFO systemd[1]: Stopping Network initialization...
Sep 20 16:55:01.659205 c-panther-01 INFO networking[28124]: networking: Deconfiguring network interfaces
Sep 20 16:55:04.540590 c-panther-01 INFO sonic-ztp[6282]: Restarted network discovery.
Sep 20 16:55:04.540766 c-panther-01 INFO systemd[1]: Started Delays snmp container until SONiC has started.
Sep 20 16:55:04.542831 c-panther-01 INFO sonic-ztp[6282]: Configuration file '/etc/sonic/config_db.json' detected. Shutting do

From ZTP log, the restarting interfaces-config takes more than 90 seconds and the process is killed.

Sep 20 16:53:29.622270 c-panther-01 INFO sonic-ztp[6282]: Restarting network discovery.
Sep 20 16:55:04.540590 c-panther-01 INFO sonic-ztp[6282]: Restarted network discovery.

We can see the interfaces going down here

23
Sep 20 16:55:02.091920 c-panther-01 ERR dhclient[28142]: no link-local IPv6 address for eth0
Sep 20 16:55:02.092090 c-panther-01 ERR dhclient[28142]:
Sep 20 16:55:02.092172 c-panther-01 ERR dhclient[28142]: If you think you have received this message due to a bug rather
Sep 20 16:55:02.092246 c-panther-01 ERR dhclient[28142]: than a configuration issue please read the section on submitting
Sep 20 16:55:02.092317 c-panther-01 ERR dhclient[28142]: bugs on either our web page at www.isc.org or in the README file
Sep 20 16:55:02.092457 c-panther-01 ERR dhclient[28142]: before submitting a bug.  These pages explain the proper
Sep 20 16:55:02.092551 c-panther-01 ERR dhclient[28142]: process and the information we find helpful for debugging.
Sep 20 16:55:02.092642 c-panther-01 ERR dhclient[28142]:
Sep 20 16:55:02.092722 c-panther-01 ERR dhclient[28142]: exiting.
Sep 20 16:55:02.093930 c-panther-01 INFO networking[28128]: error: cmd '/sbin/dhclient -6 -r -pf /run/dhclient6.eth0.pid -lf /var/lib/dhcp/dhclient6.eth0.leases eth0 -D LL' failed: returned 1
Sep 20 16:55:02.178418 c-panther-01 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet24 admin:0 oper:0 addr:24:8a:07:f3:d5:40 ifindex:128 master:76 type:sx_netdev
Sep 20 16:55:02.179463 c-panther-01 NOTICE swss#portsyncd: :- onMsg: Publish Ethernet24(ok:down) to state db
Sep 20 16:55:02.179463 c-panther-01 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet24 admin:0 oper:0 addr:24:8a:07:f3:d5:40 ifindex:128 master:76 type:sx_netdev
Sep 20 16:55:02.183178 c-panther-01 NOTICE swss#portsyncd: :- onMsg: Publish Ethernet24(ok:down) to state db
Sep 20 16:55:02.183178 c-panther-01 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet24 admin:0 oper:0 addr:24:8a:07:f3:d5:40 ifindex:128 master:76 type:sx_netdev
Sep 20 16:55:02.185420 c-panther-01 NOTICE swss#portsyncd: :- onMsg: Publish Ethernet24(ok:down) to state db
Sep 20 16:55:02.188372 c-panther-01 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:PortChannel1020 admin:1 oper:0 addr:24:8a:07:f3:d5:40 ifindex:76 master:0 type:team
Sep 20 16:55:02.250285 c-panther-01 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet28 admin:0 oper:0 addr:24:8a:07:f3:d5:40 ifindex:129 master:76 type:sx_netdev
Sep 20 16:55:02.254183 c-panther-01 NOTICE swss#portsyncd: :- onMsg: Publish Ethernet28(ok:down) to state db
Sep 20 16:55:02.254256 c-panther-01 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet28 admin:0 oper:0 addr:24:8a:07:f3:d5:40 ifindex:129 master:76 type:sx_netdev
Sep 20 16:55:02.255330 c-panther-01 NOTICE swss#portsyncd: :- onMsg: Publish Ethernet28(ok:down) to state db
Sep 20 16:55:02.255397 c-panther-01 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet28 admin:0 oper:0 addr:24:8a:07:f3:d5:40 ifindex:129 master:76 type:sx_netdev
Sep 20 16:55:02.257547 c-panther-01 NOTICE swss#portsyncd: :- onMsg: Publish Ethernet28(ok:down) to state db

In addition reload of minigraph is done `` Sep 20 16:52:51.299532 c-panther-01 INFO python[19281]: ansible-command Invoked with _uses_shell=True _raw_params=config load_minigraph -y warn=True stdin_add_newline=True strip_empty_ends=True argv=None chdir=None executable=None creates=None removes=None stdin=None

Ethernet24: flags=4098<BROADCAST,MULTICAST> mtu 9100 ether 24:8a:07:f3:d5:40 txqueuelen 1000 (Ethernet) RX packets 5 bytes 627 (627.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 18 bytes 2632 (2.5 KiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

Ethernet28: flags=4098<BROADCAST,MULTICAST> mtu 9100 ether 24:8a:07:f3:d5:40 txqueuelen 1000 (Ethernet) RX packets 22 bytes 1759 (1.7 KiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 31 bytes 3922 (3.8 KiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

Ethernet32: flags=4098<BROADCAST,MULTICAST> mtu 9100 ether 24:8a:07:f3:d5:40 txqueuelen 1000 (Ethernet) RX packets 4 bytes 517 (517.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 23 bytes 2693 (2.6 KiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

Ethernet36: flags=4098<BROADCAST,MULTICAST> mtu 9100 ether 24:8a:07:f3:d5:40 txqueuelen 1000 (Ethernet) RX packets 19 bytes 1509 (1.4 KiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 20 bytes 2893 (2.8 KiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

#### Steps to reproduce the issue:
1.  Perform config reload minigraph with ZTP enabled
2. Sometimes the netdev interfaces are down due to interfaces-service.config getting killed which results in PortChannels down
3.

#### Describe the results you received:
Netdev interfaces are down. PortChannels are down

#### Describe the results you expected:
Netdev interfaces should not go down.

#### Output of `show version`:

SONiC Software Version: SONiC.202205_3_rc.5-0cd56cda1_Internal SONiC OS Version: 11 Distribution: Debian 11.7 Kernel: 5.10.0-18-2-amd64 Build commit: 0cd56cda1 Build date: Mon Sep 4 17:23:26 UTC 2023 Built by: sw-r2d2-bot@r-build-sonic-ci02-244

Platform: x86_64-mlnx_msn2700-r0 HwSKU: Mellanox-SN2700-D48C8 ASIC: mellanox ASIC Count: 1 Serial Number: MT1706K08716 Model Number: MSN2700-CS2FO Hardware Revision: A2 Uptime: 08:54:25 up 13:13, 2 users, load average: 1.04, 0.71, 0.84 Date: Thu 21 Sep 2023 08:54:25

Docker images: REPOSITORY TAG IMAGE ID SIZE docker-macsec latest b949af156ace 332MB docker-dhcp-relay latest e350fd90a2d9 321MB docker-syncd-mlnx 202205_3_rc.5-0cd56cda1_Internal 793bf19c90f2 905MB docker-syncd-mlnx latest 793bf19c90f2 905MB docker-sonic-telemetry 202205_3_rc.5-0cd56cda1_Internal 6ae9dc6fa870 394MB docker-sonic-telemetry latest 6ae9dc6fa870 394MB docker-teamd 202205_3_rc.5-0cd56cda1_Internal 4016c0b7065b 330MB docker-teamd latest 4016c0b7065b 330MB docker-snmp 202205_3_rc.5-0cd56cda1_Internal 4cfea9652133 364MB docker-snmp latest 4cfea9652133 364MB docker-router-advertiser 202205_3_rc.5-0cd56cda1_Internal 2591f4f97335 314MB docker-router-advertiser latest 2591f4f97335 314MB docker-platform-monitor 202205_3_rc.5-0cd56cda1_Internal 1b307a3cf3c8 750MB docker-platform-monitor latest 1b307a3cf3c8 750MB docker-orchagent 202205_3_rc.5-0cd56cda1_Internal bacc58208db7 347MB docker-orchagent latest bacc58208db7 347MB docker-mux 202205_3_rc.5-0cd56cda1_Internal ad53db10c906 362MB docker-mux latest ad53db10c906 362MB docker-lldp 202205_3_rc.5-0cd56cda1_Internal d493a34ab3a1 356MB docker-lldp latest d493a34ab3a1 356MB docker-fpm-frr 202205_3_rc.5-0cd56cda1_Internal d3c8380dfb68 359MB docker-fpm-frr latest d3c8380dfb68 359MB docker-database 202205_3_rc.5-0cd56cda1_Internal 9afae57d44af 314MB docker-database latest 9afae57d44af 314MB docker-sonic-mgmt-framework 202205_3_rc.5-0cd56cda1_Internal ce6f2b2f3843 433MB docker-sonic-mgmt-framework latest ce6f2b2f3843 433MB docker-sflow 202205_3_rc.5-0cd56cda1_Internal b186e61e3057 303MB docker-sflow latest b186e61e3057 303MB docker-nat 202205_3_rc.5-0cd56cda1_Internal 74672dd53a4b 305MB docker-nat latest 74672dd53a4b 305MB urm.nvidia.com/sw-nbu-sws-sonic-docker/sonic-wjh 1.3.5-202205-13 83e467d24196 310MB

#### Output of `show techsupport`:

(paste your output here or download and attach the file here )



#### Additional information you deem important (e.g. issue happens only occasionally):

<!--
     Also attach debug file produced by `sudo generate_dump`
-->
[sonic_dump_c-panther-01_20230921_085409.tar.gz](https://github.com/sonic-net/sonic-buildimage/files/12694504/sonic_dump_c-panther-01_20230921_085409.tar.gz)
dgsudharsan commented 1 year ago

@rajendra-dendukuri Can you please help to investigate this issue? We have seen it twice that when ZTP exits, it leaves netdev interfaces in down state because interfaces-config was just started before ZTP exit (I assume interfaces config was killed abruptly due to ZTP exit)

dgsudharsan commented 11 months ago

@rajendra-dendukuri @adyeung Any updates on this issue?

rajendra-dendukuri commented 11 months ago

@dgsudharsan I observed that you are trying to run config minigraph command (using ansible) in parallel to ztp operation. I recommend if ZTP is not to be used to push configuration, the ztp service should be stopped. I was not able to repro the issue as I don't have a valid minigraph xml file for the switch that I have. If you have a setup where this issue is easily seen, we can have a debug call to understand it further.