sonic-net / sonic-buildimage

Scripts which perform an installable binary image build for SONiC
Other
722 stars 1.38k forks source link

[warm-reboot] Warm-reboot does not work with control plane assistant configuration #12699

Closed stepanblyschak closed 1 year ago

stepanblyschak commented 1 year ago

Description

Recently warm-reboot with control plane assistant setup started failing on orchagent restart check:

Nov  2 14:38:07.123764 r-tigris-25 NOTICE swss#orchagent_restart_check: :- main: RESTARTCHECK failed, orchagent is not ready for warm restart with status NOT_READY
Nov  2 14:38:07.123764 r-tigris-25 NOTICE swss#orchagent_restart_check: :- main: requested orchagent to do warm restart state check, retry count: 5
Nov  2 14:38:07.123837 r-tigris-25 NOTICE swss#orchagent: :- doTask: RESTARTCHECK notification for orchagent
Nov  2 14:38:07.123837 r-tigris-25 NOTICE swss#orchagent: :- doTask: orchagent|NoFreeze:false|SkipPendingTaskCheck:false
Nov  2 14:38:07.123879 r-tigris-25 WARNING swss#orchagent: :- addTunnelUser: Unable to find EVPN VTEP. user=0 remote_vtep=192.168.8.1
Nov  2 14:38:07.123879 r-tigris-25 WARNING swss#orchagent: :- addOperation: Vxlan tunnelPort doesn't exist: 192.168.8.1
Nov  2 14:38:07.123912 r-tigris-25 NOTICE swss#orchagent: :- warmRestartCheck: WarmRestart check found pending tasks:
Nov  2 14:38:07.123924 r-tigris-25 NOTICE swss#orchagent: :- warmRestartCheck:     VXLAN_REMOTE_VNI_TABLE:Vlan1000:192.168.8.1|SET|vni:1000
Nov  2 14:38:07.123924 r-tigris-25 NOTICE swss#orchagent: :- warmRestartCheck: Restart check result: NOT_READY
Nov  2 14:38:07.123996 r-tigris-25 NOTICE swss#orchagent_restart_check: :- main: RESTARTCHECK failed, orchagent is not ready for warm restart with status NOT_READY
Nov  2 14:38:07.130843 r-tigris-25 NOTICE admin: fastfast-reboot failure (10) cleanup ...
Nov  2 14:38:07.136823 r-tigris-25 NOTICE admin: Tearing down control plane assistant: 10.213.84.41 ...

The issue started to happen after SAI implementation update. In our latest SAI P2P tunnels are now supported (SAI_TUNNEL_PEER_MODE_P2P). Therefore the logic inside orchdaemon chooses a p2p implementation of evpn remote vni orch:

https://github.com/sonic-net/sonic-swss/blob/master/orchagent/orchdaemon.cpp#L432:

    if (vxlan_tunnel_orch->isDipTunnelsSupported())
    {
        EvpnRemoteVnip2pOrch* evpn_remote_vni_orch = new EvpnRemoteVnip2pOrch(m_applDb, APP_VXLAN_REMOTE_VNI_TABLE_NAME);
        gDirectory.set(evpn_remote_vni_orch);
        m_orchList.push_back(evpn_remote_vni_orch);
    }
    else
    {
        EvpnRemoteVnip2mpOrch* evpn_remote_vni_orch = new EvpnRemoteVnip2mpOrch(m_applDb, APP_VXLAN_REMOTE_VNI_TABLE_NAME);
        gDirectory.set(evpn_remote_vni_orch);
        m_orchList.push_back(evpn_remote_vni_orch);
    }

And thus, this error starts to happen only with EvpnRemoteVnip2pOrch.

With old SAI and with EvpnRemoteVnip2mpOrch, this error does not appear, but a warning is printed instead:

Nov 11 16:57:54.068035 r-tigris-25 NOTICE swss#orchagent: :- addOperation: Vxlan tunnel 'neigh_adv' was added
Nov 11 16:57:54.071833 r-tigris-25 WARNING swss#orchagent: :- createTunnelHw: creation src = 0
Nov 11 16:57:54.071848 r-tigris-25 NOTICE swss#orchagent: :- create_tunnel: create_tunnel:encapmaplist[0]=0x290000000015dd
Nov 11 16:57:54.071862 r-tigris-25 NOTICE swss#orchagent: :- create_tunnel: create_tunnel:encapmaplist[1]=0x290000000015df
Nov 11 16:57:54.089346 r-tigris-25 NOTICE swss#orchagent: :- addBridgePort: Add bridge port Port_SRC_VTEP_10.1.0.32 to default 1Q bridge
Nov 11 16:57:54.091246 r-tigris-25 NOTICE swss#orchagent: :- addOperation: Vxlan tunnel map entry 'map_1' for tunnel 'neigh_adv' was created
Nov 11 16:57:54.100728 r-tigris-25 NOTICE swss#orchagent: :- attach: Attached next hop observer of route 192.168.8.0/25 for destination IP 192.168.8.1
Nov 11 16:57:54.100811 r-tigris-25 NOTICE swss#orchagent: :- updateNextHop: Updating mirror session neighbor_advertiser with route 192.168.8.0/25
Nov 11 16:57:54.100843 r-tigris-25 NOTICE swss#orchagent: :- updateNextHop:     next hop IPs: 10.0.0.1@PortChannel101,10.0.0.5@PortChannel102,10.0.0.9@PortChannel103,10.0.0.13@PortChannel104
Nov 11 16:57:54.100858 r-tigris-25 NOTICE swss#orchagent: :- updateNextHop: Updated mirror session state db neighbor_advertiser nexthop to 10.0.0.1@PortChannel101
Nov 11 16:57:54.100884 r-tigris-25 NOTICE swss#orchagent: :- getNeighborInfo: Mirror session neighbor_advertiser neighbor is PortChannel101
Nov 11 16:57:54.104051 r-tigris-25 NOTICE swss#orchagent: :- activateSession: Activated mirror session neighbor_advertiser
Nov 11 16:57:54.104051 r-tigris-25 NOTICE swss#orchagent: :- createEntry: Created mirror session neighbor_advertiser
Nov 11 16:57:54.105541 r-tigris-25 WARNING swss#orchagent: :- addOperation: Remote VNI add: Source VTEP not found. remote=192.168.8.1 vid=1000
Nov 11 16:57:57.078562 r-tigris-25 NOTICE swss#orchagent: :- add: Successfully created ACL rule rule_arp in table EVERFLOW
Nov 11 16:57:57.083350 r-tigris-25 NOTICE swss#orchagent: :- add: Successfully created ACL rule rule_nd in table EVERFLOWV6

Steps to reproduce the issue:

  1. Ensure SAI supports P2P tunnels (SAI_TUNNEL_PEER_MODE_P2P)
  2. Do warm-reboot with control plane assistant - sudo warm-reboot -c X.X.X.X.
  3. Observe warm-reboot failure

Describe the results you received:

Warm-reboot failure on orchagent restart check.

Describe the results you expected:

Warm-reboot passes.

Output of show version:

SONiC Software Version: SONiC.202205.52-247c8dd99_Internal
Distribution: Debian 11.5
Kernel: 5.10.0-12-2-amd64
Build commit: 247c8dd99
Build date: Mon Oct 31 19:37:44 UTC 2022
Built by: sw-r2d2-bot@r-build-sonic-ci02-241

Platform: x86_64-mlnx_msn3800-r0
HwSKU: Mellanox-SN3800-D112C8
ASIC: mellanox
ASIC Count: 1
Serial Number: MT2208X03840
Model Number: MSN3800-CS2FOS
Hardware Revision: A3
Uptime: 15:45:20 up  4:15,  1 user,  load average: 0.75, 1.07, 1.11
Date: Wed 02 Nov 2022 15:45:20

Docker images:
REPOSITORY                    TAG                            IMAGE ID       SIZE
docker-platform-monitor       202205.52-247c8dd99_Internal   cb38dfa7caf5   867MB
docker-platform-monitor       latest                         cb38dfa7caf5   867MB
docker-syncd-mlnx             202205.52-247c8dd99_Internal   6093bc7e05bc   862MB
docker-syncd-mlnx             latest                         6093bc7e05bc   862MB
docker-orchagent              202205.52-247c8dd99_Internal   8cba3aa5f8cd   478MB
docker-orchagent              latest                         8cba3aa5f8cd   478MB
docker-fpm-frr                202205.52-247c8dd99_Internal   34013381887d   489MB
docker-fpm-frr                latest                         34013381887d   489MB
docker-teamd                  202205.52-247c8dd99_Internal   3d4dc22f7fa3   459MB
docker-teamd                  latest                         3d4dc22f7fa3   459MB
docker-mux                    202205.52-247c8dd99_Internal   3ba56487d725   492MB
docker-mux                    latest                         3ba56487d725   492MB
docker-database               202205.52-247c8dd99_Internal   d34b60889a16   443MB
docker-database               latest                         d34b60889a16   443MB
docker-snmp                   202205.52-247c8dd99_Internal   02ffce4f4d2d   488MB
docker-snmp                   latest                         02ffce4f4d2d   488MB
docker-macsec                 latest                         9cd320e755bf   461MB
docker-sonic-telemetry        202205.52-247c8dd99_Internal   25543496d124   524MB
docker-sonic-telemetry        latest                         25543496d124   524MB
docker-dhcp-relay             latest                         2e0b89ffb3df   453MB
docker-lldp                   202205.52-247c8dd99_Internal   b500360afa1e   486MB
docker-lldp                   latest                         b500360afa1e   486MB
docker-router-advertiser      202205.52-247c8dd99_Internal   d6bce1ebf47b   443MB
docker-router-advertiser      latest                         d6bce1ebf47b   443MB
docker-nat                    202205.52-247c8dd99_Internal   53221ae69f78   431MB
docker-nat                    latest                         53221ae69f78   431MB
docker-sflow                  202205.52-247c8dd99_Internal   cac026510f73   429MB
docker-sflow                  latest                         cac026510f73   429MB
docker-sonic-mgmt-framework   202205.52-247c8dd99_Internal   aac866542364   558MB
docker-sonic-mgmt-framework   latest                         aac866542364   558MB

Output of show techsupport:

(paste your output here or download and attach the file here )

sonic_dump_r-tigris-25_20221102_154508.tar.gz

Additional information you deem important (e.g. issue happens only occasionally):

azure-pipelines-wrapper[bot] commented 1 year ago

Thanks for opening this issue!

stepanblyschak commented 1 year ago

@vaibhavhd @prsunny Since it is related to vxlan and warm reboot could you please check?