sonic-net / sonic-buildimage

Scripts which perform an installable binary image build for SONiC
Other
746 stars 1.44k forks source link

Chassis: Orchagent crashes are seen in Voq chassis while running sonic-mgmt PC and voq suites #20507

Open saksarav-nokia opened 1 month ago

saksarav-nokia commented 1 month ago

Description

With PR https://github.com/sonic-net/sonic-swss/pull/3269, the orchagent crashes are seen while running sonic-mgmt PC and Voq suites.

Steps to reproduce the issue:

  1. Run PC and Voq suites with latest master

Describe the results you received:

Orchagent crashed multiple times

Describe the results you expected:

No crashes

Output of show version:

(paste your output here)

Output of show techsupport:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

saksarav-nokia commented 1 month ago

Already discussed the issue and fix with @arlakshm and @abdosi . Testing the fix

kenneth-arista commented 1 month ago

@saksarav-nokia can you paste into this issue the crash backtrace. We suspect that you may be encountering a similar backtrace to what is documented here https://github.com/sonic-net/sonic-buildimage/issues/20605

saksarav-nokia commented 1 month ago

@kenneth-arista t 28 19:45:49.253094 ixre-egl-board1 NOTICE syncd1#syncd: [07:00.0] SAI_API_NEXT_HOP:brcm_sai_remove_next_hop:441 Removing nhid 15 if_id 536923493 2024 Oct 28 19:45:49.253400 ixre-egl-board1 NOTICE syncd0#syncd: [06:00.0] SAI_API_NEXT_HOP:brcm_sai_remove_next_hop:441 Removing nhid 38 if_id 536923495 2024 Oct 28 19:45:49.253738 ixre-egl-board1 NOTICE swss1#orchagent: :- removeNeighbor: Removed next hop 3.3.3.31 on ixre-egl-board27|asic0|Ethernet-IB0 2024 Oct 28 19:45:49.254133 ixre-egl-board1 NOTICE swss0#orchagent: :- removeNeighbor: Removed next hop 3.3.3.31 on ixre-egl-board27|asic0|Ethernet-IB0 2024 Oct 28 19:45:49.254564 ixre-egl-board1 NOTICE swss1#orchagent: :- removeNeighbor: Removed neighbor 40:7c:7d:bb:25:ab on ixre-egl-board27|asic0|Ethernet-IB0 2024 Oct 28 19:45:49.255134 ixre-egl-board1 NOTICE syncd1#syncd: [07:00.0] SAI_API_NEXT_HOP:brcm_sai_remove_next_hop:441 Removing nhid 16 if_id 536923500 2024 Oct 28 19:45:49.255147 ixre-egl-board1 NOTICE swss1#nbrmgrd: :- delKernelRoute: IPv4 Route Del cmd: /sbin/ip route del 3.3.3.31/32 2024 Oct 28 19:45:49.255440 ixre-egl-board1 NOTICE swss1#orchagent: :- removeNeighbor: Removed next hop 3333::3:19 on ixre-egl-board27|asic0|Ethernet-IB0 2024 Oct 28 19:45:49.255519 ixre-egl-board1 NOTICE swss0#orchagent: :- removeNeighbor: Removed neighbor 40:7c:7d:bb:25:ab on ixre-egl-board27|asic0|Ethernet-IB0 2024 Oct 28 19:45:49.256189 ixre-egl-board1 NOTICE swss0#nbrmgrd: :- delKernelRoute: IPv4 Route Del cmd: /sbin/ip route del 3.3.3.31/32 2024 Oct 28 19:45:49.256231 ixre-egl-board1 NOTICE syncd0#syncd: [06:00.0] SAI_API_NEXT_HOP:brcm_sai_remove_next_hop:441 Removing nhid 39 if_id 536923502 2024 Oct 28 19:45:49.256231 ixre-egl-board1 NOTICE swss1#orchagent: :- removeNeighbor: Removed neighbor 40:7c:7d:bb:25:ab on ixre-egl-board27|asic0|Ethernet-IB0 2024 Oct 28 19:45:49.256375 ixre-egl-board1 ERR swss1#orchagent: :- meta_generic_validation_remove: object 0x104000000002165 reference count is 672, can't remove 2024 Oct 28 19:45:49.256444 ixre-egl-board1 ERR swss1#orchagent: :- removeNeighbor: Failed to remove next hop 10.0.0.163 on ixre-egl-board27|asic0|Ethernet120, rv:-17 2024 Oct 28 19:45:49.256502 ixre-egl-board1 ERR swss1#orchagent: :- handleSaiRemoveStatus: Encountered failure in remove operation, exiting orchagent, SAI API: SAI_API_NEXT_HOP, status: SAI_STATUS_OBJECT_IN_USE 2024 Oct 28 19:45:49.256527 ixre-egl-board1 NOTICE swss1#orchagent: :- notifySyncd: sending syncd: SYNCD_INVOKE_DUMP 2024 Oct 28 19:45:49.256751 ixre-egl-board1 NOTICE syncd1#syncd: :- processNotifySyncd: Invoking SAI failure dump 2024 Oct 28 19:45:49.256934 ixre-egl-board1 NOTICE swss0#orchagent: :- removeNeighbor: Removed next hop 3333::3:19 on ixre-egl-board27|asic0|Ethernet-IB0 2024 Oct 28 19:45:49.257924 ixre-egl-board1 NOTICE swss0#orchagent: :- removeNeighbor: Removed neighbor 40:7c:7d:bb:25:ab on ixre-egl-board27|asic0|Ethernet-IB0

arlakshm commented 2 weeks ago

@saksarav-nokia has PR to fix this.

saksarav-nokia commented 2 weeks ago

The IMM has two asics and has 2 pot channels in each asic and 2 port members in each port channel. The ip address is configured on each port channel and bgp is eanbled. The neighbor and routes are learned on these port channel. In sonic-mgmt pc suite, the test case po-update removes the port members from one of the port channel, removes the ip address configured on that port channel, creates new port channel, adds the same port members to the new port channel, adds the same ip address to the new port channel. In the remote asic, before all the routes learned on the old port channel are removed by routeOrch, the neighbor and nexthop for the old portchannel are being attempted to be removed. But since the routes are pending, the old nexthop and neighbor are not removed. Then the neighbor and nexthop for the new port channel are being added. If the neighbor is learned on remote system port in remote asic, the nexthop is added with alias as inband port's alias, so the key (ip,alias) is same for both old nexthop and new nexthop. When the new nexthop is added , it calls hasNextHop function to check if the nexthop with (ip-address, alias) as key and since the old nexthop is not removed yet, the hasNextHop returns true, however the assert(!hasNextHop) does n't trigger the crash. So addNextHop function replace the old nexthop with old rif-id with new nexthop with new old rif-id in the nexthop map. Then after all the routes learned on old port channel is removed, the old neighbor and old nexthop are removed. Sine the old nexthop was replaced with new nexthop, when orchagent tries to delete the old nexthop, it actually deletes the new nexthop from SAI. Then when it tries to remove the old neighbor, SAI returns error since orchagent removed the new nexthop from SAI instead of old nexthop and old neighbor is still referenced by the old nexthop in SAI. So orchagent crashes when SAI returns error. the same issue is seen when the config reload is done in remote IMM or sometimes even with reboot.