sonic-net / sonic-buildimage

Scripts which perform an installable binary image build for SONiC
Other
736 stars 1.42k forks source link

Intermittent Monit's routeCheck Program failure after config reload on Nokia-7215 platform #17323

Closed tudupa closed 8 months ago

tudupa commented 11 months ago

Description

In 202305 branch for Nokia 7215 platform, monit fails due to routecheck.py status failed as below -

Program 'routeCheck'
  status                       Status failed
  monitoring status            Monitored
  monitoring mode              active
  on reboot                    start
  last exit value              255
  last output                  Failure results: {{
                                   "missed_FRR_routes": [
                                       {
                                           "destSelected": true,
                                           "distance": 20,
                                           "failed": true,
                                           "installedNexthopGroupId": 384,
                                           "internalFlags": 8,
                                           "internalNextHopActiveNum": 4,
                                           "internalNextHopNum": 4,
                                           "internalStatus": 168,
                                           "metric": 0,
                                           "nexthopGroupId": 384,
                                           "nexthops": [
                                               {
                                                   "active": true,
                                                   "afi": "ipv4",
                                                   "fib": true,
                                                   "flags": 3,
                                                   "interfaceIndex": 368,
                                                   "interfaceName": "PortChannel101",
                                                   "ip": "10.0.0.57",
                                                   "weight": 1
                                               },
  data collected               Tue, 28 Nov 2023 18:29:56

We have seen this issue intermittently after config reload is performed on the platform. After debugging the above error, we found that it is related to the ip routes not being programmed in the kernel.

The output of show ip route on the platform is below and all the routes are in queued state, meaning, they are yet to be programmed in the kernel but is present in the ASIC_DB and APPL_DB of Sonic.

Codes: K - kernel route, C - connected, S - static, R - RIP,
       O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
       T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP,
       F - PBR, f - OpenFabric,
       > - selected route, * - FIB route, q - queued route, r - rejected route

B>q0.0.0.0/0 [20/0] via 10.0.0.57, PortChannel101, 1d00h09m
  q                 via 10.0.0.59, PortChannel103, 1d00h09m
  q                 via 10.0.0.61, PortChannel105, 1d00h09m
  q                 via 10.0.0.63, PortChannel106, 1d00h09m
C>*10.0.0.56/31 is directly connected, PortChannel101, 1d00h09m
C>*10.0.0.58/31 is directly connected, PortChannel103, 1d00h09m
C>*10.0.0.60/31 is directly connected, PortChannel105, 1d00h09m
C>*10.0.0.62/31 is directly connected, PortChannel106, 1d00h09m
C>*10.0.0.64/31 is directly connected, Ethernet46, 1d00h09m
C>*10.0.0.66/31 is directly connected, Ethernet47, 1d00h09m
C>*10.1.0.32/32 is directly connected, Loopback0, 1d00h11m
B>*100.1.0.29/32 [20/0] via 10.0.0.57, PortChannel101, 1d00h09m
B>*100.1.0.30/32 [20/0] via 10.0.0.59, PortChannel103, 1d00h09m
B>*100.1.0.31/32 [20/0] via 10.0.0.61, PortChannel105, 1d00h09m
B>*100.1.0.32/32 [20/0] via 10.0.0.63, PortChannel106, 1d00h09m
B>*100.1.0.33/32 [20/0] via 10.0.0.65, Ethernet46, 1d00h09m
B>*100.1.0.34/32 [20/0] via 10.0.0.67, Ethernet47, 1d00h09m
C>*152.148.144.0/21 is directly connected, eth0, 1d00h11m
C>*192.168.0.0/24 is directly connected, Vlan1000, 1d00h10m
B>*192.168.1.64/26 [20/0] via 10.0.0.65, Ethernet46, 1d00h09m
B>*192.168.1.128/26 [20/0] via 10.0.0.67, Ethernet47, 1d00h09m
B>q192.168.1.192/26 [20/0] via 10.0.0.57, PortChannel101, 1d00h09m
  q                        via 10.0.0.59, PortChannel103, 1d00h09m
  q                        via 10.0.0.61, PortChannel105, 1d00h09m
  q                        via 10.0.0.63, PortChannel106, 1d00h09m
B>q192.168.2.0/26 [20/0] via 10.0.0.57, PortChannel101, 1d00h09m
  q                      via 10.0.0.59, PortChannel103, 1d00h09m
  q                      via 10.0.0.61, PortChannel105, 1d00h09m
  q                      via 10.0.0.63, PortChannel106, 1d00h09m
B>q192.168.2.64/26 [20/0] via 10.0.0.57, PortChannel101, 1d00h09m
  q                       via 10.0.0.59, PortChannel103, 1d00h09m
  q                       via 10.0.0.61, PortChannel105, 1d00h09m
  q                       via 10.0.0.63, PortChannel106, 1d00h09m
B>q192.168.2.128/26 [20/0] via 10.0.0.57, PortChannel101, 1d00h09m
.
.
.
.
B>q192.169.104.64/26 [20/0] via 10.0.0.57, PortChannel101, 1d00h09m
  q                         via 10.0.0.59, PortChannel103, 1d00h09m
  q                         via 10.0.0.61, PortChannel105, 1d00h09m
  q                         via 10.0.0.63, PortChannel106, 1d00h09m
B>q192.169.104.128/26 [20/0] via 10.0.0.57, PortChannel101, 1d00h09m
  q                          via 10.0.0.59, PortChannel103, 1d00h09m
  q                          via 10.0.0.61, PortChannel105, 1d00h09m
  q                          via 10.0.0.63, PortChannel106, 1d00h09m
B>q192.169.104.192/26 [20/0] via 10.0.0.57, PortChannel101, 1d00h09m
  q                          via 10.0.0.59, PortChannel103, 1d00h09m
  q                          via 10.0.0.61, PortChannel105, 1d00h09m
  q                          via 10.0.0.63, PortChannel106, 1d00h09m

The output of "sudo ip route show" shows that none of the above queued routes are present in the kernel.

sudo ip route show
10.0.0.56/31 dev PortChannel101 proto kernel scope link src 10.0.0.56 
10.0.0.58/31 dev PortChannel103 proto kernel scope link src 10.0.0.58 
10.0.0.60/31 dev PortChannel105 proto kernel scope link src 10.0.0.60 
10.0.0.62/31 dev PortChannel106 proto kernel scope link src 10.0.0.62 
10.0.0.64/31 dev Ethernet46 proto kernel scope link src 10.0.0.64 
10.0.0.66/31 dev Ethernet47 proto kernel scope link src 10.0.0.66 
100.1.0.29 nhid 354 via 10.0.0.57 dev PortChannel101 proto bgp src 10.1.0.32 metric 20 
100.1.0.30 nhid 385 via 10.0.0.59 dev PortChannel103 proto bgp src 10.1.0.32 metric 20 
100.1.0.31 nhid 386 via 10.0.0.61 dev PortChannel105 proto bgp src 10.1.0.32 metric 20 
100.1.0.32 nhid 355 via 10.0.0.63 dev PortChannel106 proto bgp src 10.1.0.32 metric 20 
100.1.0.33 nhid 611 via 10.0.0.65 dev Ethernet46 proto bgp src 10.1.0.32 metric 20 
100.1.0.34 nhid 617 via 10.0.0.67 dev Ethernet47 proto bgp src 10.1.0.32 metric 20 
152.148.144.0/21 dev eth0 proto kernel scope link src 152.148.150.123 
192.168.0.0/24 dev Vlan1000 proto kernel scope link src 192.168.0.1 
192.168.1.64/26 nhid 611 via 10.0.0.65 dev Ethernet46 proto bgp src 10.1.0.32 metric 20 
192.168.1.128/26 nhid 617 via 10.0.0.67 dev Ethernet47 proto bgp src 10.1.0.32 metric 20 
240.127.1.0/24 dev docker0 proto kernel scope link src 240.127.1.1 linkdown

Steps to reproduce the issue:

  1. Bring up M0 topology
  2. Perform "sudo config reload -y -f"
  3. After the platform reloads, check "sudo monit status -B" output to see if the routeCheck has the above error. (Wait for it to initialise completely)
  4. If the status is "Ok" Repeat from 2 ( Usually seen after 7-10 reloads)

Describe the results you received:

The status of routeCheck Program in monit is failed and the routes in "show ip route" are in queued state.

Describe the results you expected:

The following is the output of "sudo monit " when the issue is not present.

Program 'routeCheck'
  status                       Status ok
  monitoring status            Monitored
  monitoring mode              active
  on reboot                    start
  last exit value              0
  last output                  -
  data collected               Tue, 28 Nov 2023 18:46:11

The ip routes are present in the output of "show ip route". Note that they are not in the queued state.

Codes: K - kernel route, C - connected, S - static, R - RIP,
       O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
       T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP,
       F - PBR, f - OpenFabric,
       > - selected route, * - FIB route, q - queued route, r - rejected route

B>*0.0.0.0/0 [20/0] via 10.0.0.57, PortChannel101, 01:57:06
  *                 via 10.0.0.59, PortChannel103, 01:57:06
  *                 via 10.0.0.61, PortChannel105, 01:57:06
  *                 via 10.0.0.63, PortChannel106, 01:57:06
C>*10.0.0.56/31 is directly connected, PortChannel101, 02:43:03
C>*10.0.0.58/31 is directly connected, PortChannel103, 02:43:03
C>*10.0.0.60/31 is directly connected, PortChannel105, 02:42:56
C>*10.0.0.62/31 is directly connected, PortChannel106, 02:42:44
C>*10.0.0.64/31 is directly connected, Ethernet46, 02:42:37
C>*10.0.0.66/31 is directly connected, Ethernet47, 02:42:37
C>*10.1.0.32/32 is directly connected, Loopback0, 02:44:13
B>*100.1.0.29/32 [20/0] via 10.0.0.57, PortChannel101, 01:57:07
B>*100.1.0.30/32 [20/0] via 10.0.0.59, PortChannel103, 01:57:06
B>*100.1.0.31/32 [20/0] via 10.0.0.61, PortChannel105, 01:57:06
B>*100.1.0.32/32 [20/0] via 10.0.0.63, PortChannel106, 01:57:06
B>*100.1.0.33/32 [20/0] via 10.0.0.65, Ethernet46, 01:57:06
B>*100.1.0.34/32 [20/0] via 10.0.0.67, Ethernet47, 01:57:06
C>*152.148.144.0/21 is directly connected, eth0, 02:44:13
C>*192.168.0.0/24 is directly connected, Vlan1000, 02:43:21
B>*192.168.1.64/26 [20/0] via 10.0.0.65, Ethernet46, 01:57:06
B>*192.168.1.128/26 [20/0] via 10.0.0.67, Ethernet47, 01:57:06
B>*192.168.1.192/26 [20/0] via 10.0.0.57, PortChannel101, 01:57:06
  *                        via 10.0.0.59, PortChannel103, 01:57:06
  *                        via 10.0.0.61, PortChannel105, 01:57:06
  *                        via 10.0.0.63, PortChannel106, 01:57:06
B>*192.168.2.0/26 [20/0] via 10.0.0.57, PortChannel101, 01:57:06
  *                      via 10.0.0.59, PortChannel103, 01:57:06
  *                      via 10.0.0.61, PortChannel105, 01:57:06
  *                      via 10.0.0.63, PortChannel106, 01:57:06
B>*192.168.2.64/26 [20/0] via 10.0.0.57, PortChannel101, 01:57:06
.
.
.

The output of "sudo ip route show "

default nhid 1553 proto bgp src 10.1.0.32 metric 20 
        nexthop via 10.0.0.57 dev PortChannel101 weight 1 
        nexthop via 10.0.0.59 dev PortChannel103 weight 1 
        nexthop via 10.0.0.61 dev PortChannel105 weight 1 
        nexthop via 10.0.0.63 dev PortChannel106 weight 1 
10.0.0.56/31 dev PortChannel101 proto kernel scope link src 10.0.0.56 
10.0.0.58/31 dev PortChannel103 proto kernel scope link src 10.0.0.58 
10.0.0.60/31 dev PortChannel105 proto kernel scope link src 10.0.0.60 
10.0.0.62/31 dev PortChannel106 proto kernel scope link src 10.0.0.62 
10.0.0.64/31 dev Ethernet46 proto kernel scope link src 10.0.0.64 
10.0.0.66/31 dev Ethernet47 proto kernel scope link src 10.0.0.66 
100.1.0.29 nhid 1540 via 10.0.0.57 dev PortChannel101 proto bgp src 10.1.0.32 metric 20 
100.1.0.30 nhid 1554 via 10.0.0.59 dev PortChannel103 proto bgp src 10.1.0.32 metric 20 
100.1.0.31 nhid 1555 via 10.0.0.61 dev PortChannel105 proto bgp src 10.1.0.32 metric 20 
100.1.0.32 nhid 1556 via 10.0.0.63 dev PortChannel106 proto bgp src 10.1.0.32 metric 20 
100.1.0.33 nhid 1579 via 10.0.0.65 dev Ethernet46 proto bgp src 10.1.0.32 metric 20 
100.1.0.34 nhid 1580 via 10.0.0.67 dev Ethernet47 proto bgp src 10.1.0.32 metric 20 
152.148.144.0/21 dev eth0 proto kernel scope link src 152.148.150.121 
192.168.0.0/24 dev Vlan1000 proto kernel scope link src 192.168.0.1 
192.168.1.64/26 nhid 1579 via 10.0.0.65 dev Ethernet46 proto bgp src 10.1.0.32 metric 20 
192.168.1.128/26 nhid 1580 via 10.0.0.67 dev Ethernet47 proto bgp src 10.1.0.32 metric 20 
192.168.1.192/26 nhid 1553 proto bgp src 10.1.0.32 metric 20 
        nexthop via 10.0.0.57 dev PortChannel101 weight 1 
        nexthop via 10.0.0.59 dev PortChannel103 weight 1 
        nexthop via 10.0.0.61 dev PortChannel105 weight 1 
        nexthop via 10.0.0.63 dev PortChannel106 weight 1 
192.168.2.0/26 nhid 1553 proto bgp src 10.1.0.32 metric 20 
        nexthop via 10.0.0.57 dev PortChannel101 weight 1 
        nexthop via 10.0.0.59 dev PortChannel103 weight 1 
        nexthop via 10.0.0.61 dev PortChannel105 weight 1 
        nexthop via 10.0.0.63 dev PortChannel106 weight 1 
192.168.2.64/26 nhid 1553 proto bgp src 10.1.0.32 metric 20 
        nexthop via 10.0.0.57 dev PortChannel101 weight 1 
        nexthop via 10.0.0.59 dev PortChannel103 weight 1 
        nexthop via 10.0.0.61 dev PortChannel105 weight 1 
        nexthop via 10.0.0.63 dev PortChannel106 weight 1 
192.168.2.128/26 nhid 1553 proto bgp src 10.1.0.32 metric 20 
        nexthop via 10.0.0.57 dev PortChannel101 weight 1 
        nexthop via 10.0.0.59 dev PortChannel103 weight 1 
        nexthop via 10.0.0.61 dev PortChannel105 weight 1 
        nexthop via 10.0.0.63 dev PortChannel106 weight 1
.
.

Output of show version:

SONiC Software Version: SONiC.20230531.06
SONiC OS Version: 11
Distribution: Debian 11.7
Kernel: 5.10.0-18-2-armmp
Build commit: f572350569
Build date: Fri Oct 13 06:42:32 UTC 2023
Built by: cloudtest@312b76e7c000001

Platform: armhf-nokia_ixs7215_52x-r0
HwSKU: Nokia-7215
ASIC: marvell
ASIC Count: 1
Serial Number: NK232410037
Model Number: 3HE16794AARE01
Hardware Revision: 4
Uptime: 18:56:33 up 1 day,  2:18,  1 user,  load average: 1.46, 1.20, 1.18
Date: Tue 28 Nov 2023 18:56:33

Docker images:
REPOSITORY                 TAG           IMAGE ID       SIZE
docker-orchagent           20230531.06   088f5a4c5461   351MB
docker-orchagent           latest        088f5a4c5461   351MB
docker-fpm-frr             20230531.06   0628314f7860   357MB
docker-fpm-frr             latest        0628314f7860   357MB
docker-teamd               20230531.06   9ca0e2245dc1   341MB
docker-teamd               latest        9ca0e2245dc1   341MB
docker-macsec              latest        bd757008b280   342MB
docker-platform-monitor    20230531.06   f6da00cb1eb9   598MB
docker-platform-monitor    latest        f6da00cb1eb9   598MB
docker-syncd-mrvl          20230531.06   1def980c3bbc   425MB
docker-syncd-mrvl          latest        1def980c3bbc   425MB
docker-dhcp-relay          latest        7dce5cb021f8   336MB
docker-eventd              20230531.06   4e416222842f   329MB
docker-eventd              latest        4e416222842f   329MB
docker-snmp                20230531.06   be42d64f81c4   365MB
docker-snmp                latest        be42d64f81c4   365MB
docker-lldp                20230531.06   18e8f56212b4   334MB
docker-lldp                latest        18e8f56212b4   334MB
docker-mux                 20230531.06   88633297752f   342MB
docker-mux                 latest        88633297752f   342MB
docker-sonic-gnmi          20230531.06   93c5e41ad697   401MB
docker-sonic-gnmi          latest        93c5e41ad697   401MB
docker-database            20230531.06   b423e463c728   329MB
docker-database            latest        b423e463c728   329MB
docker-acms                20230531.06   5806d48a79e6   337MB
docker-acms                latest        5806d48a79e6   337MB
docker-sonic-telemetry     20230531.06   917f9f22acb8   401MB
docker-sonic-telemetry     latest        917f9f22acb8   401MB
docker-router-advertiser   20230531.06   5b3c6169d604   329MB
docker-router-advertiser   latest        5b3c6169d604   329MB

Output of show techsupport:

NA

Additional information you deem important (e.g. issue happens only occasionally):

This issue happens intermittently with multiple config reloads.

stepanblyschak commented 11 months ago

@tudupa Is the issue persistent or routes are getting * eventually?

tudupa commented 11 months ago

@stepanblyschak The routes are in queued state until we do a config reload.

prgeor commented 11 months ago

@prsunny could you take a look?