sonic-net / sonic-buildimage

Scripts which perform an installable binary image build for SONiC
Other
730 stars 1.4k forks source link

orchagent crashed when adding the members, addresses from an existing portchannel to a newly created portchannel #17665

Open mannytaheri opened 9 months ago

mannytaheri commented 9 months ago

Description

orchagent crashed in testcase test_po_update_io_no_loss[lc]::teardown. This issue is also reproducible manually The orchagent crash is caused by:

  1. Removing the members and ipv4/ipv6 address of an existing PortChannel
  2. Then creating a new Portchannel and adding the members, ipv4/ipv6 address from the old Portchannel to the newly created portchannel .

Please see syslog attached.

syslog.txt

Jan 3 21:40:15.387019 ixre-egl-board3 NOTICE swss1#orchagent: :- removeRouterIntfs: Remove router interface for port PortChannel106 Jan 3 21:40:15.492690 ixre-egl-board3 NOTICE swss1#orchagent: :- addNextHopGroup: Create next hop group 10.0.0.1@Ethernet-IB1,10.0.0.5@Ethernet-IB1,10.0.0.7@PortChannel999,10.0.0.11@Ethernet184 Jan 3 21:40:15.521978 ixre-egl-board3 NOTICE swss1#orchagent: :- addNextHopGroup: Create next hop group 10.0.0.7@PortChannel999,10.0.0.11@Ethernet184 Jan 3 21:40:15.711079 ixre-egl-board3 NOTICE swss0#orchagent: :- removeNextHopGroup: Delete next hop group fc00::2@PortChannel102,fc00::a@Ethernet64,fc00::e@Ethernet-IB0,fc00::16@Ethernet-IB0 Jan 3 21:40:15.717072 ixre-egl-board3 NOTICE swss0#orchagent: :- removeNextHopGroup: Delete next hop group fc00::e@Ethernet-IB0,fc00::16@Ethernet-IB0 Jan 3 21:40:15.749977 ixre-egl-board3 NOTICE swss0#orchagent: :- addLag: Create an empty LAG ixre-egl-board3|asic1|PortChannel999 lid:2000000000a95 Jan 3 21:40:15.754730 ixre-egl-board3 NOTICE syncd0#syncd: :- removeRif: Trying to remove nonexisting router interface counter from Id 0x6000000000847 Jan 3 21:40:15.755829 ixre-egl-board3 ERR swss0#orchagent: :- meta_generic_validation_remove: object 0x6000000000847 reference count is 4, can't remove Jan 3 21:40:15.756968 ixre-egl-board3 ERR swss0#orchagent: :- removeRouterIntfs: Failed to remove router interface for port ixre-egl-board3|asic1|PortChannel106, rv:-17 Jan 3 21:40:15.758146 ixre-egl-board3 ERR swss0#orchagent: :- handleSaiRemoveStatus: Encountered failure in remove operation, exiting orchagent, SAI API: SAI_API_ROUTER_INTERFACE, status: SAI_STATUS_OBJECT_IN_USE

Steps to reproduce the issue:

  1. sudo config portchannel -n asic1 member del PortChannel106 Ethernet208
  2. sudo config portchannel -n asic1 member del PortChannel106 Ethernet216
  3. sudo config interface -n asic1 ip remove PortChannel106 10.0.0.6/31
  4. sudo config interface -n asic1 ip remove PortChannel106 fc00::d/126
  5. sudo config portchannel -n asic1 add PortChannel999
  6. sudo config portchannel -n asic1 member add PortChannel999 Ethernet208
  7. sudo config portchannel -n asic1 member add PortChannel999 Ethernet216
  8. sudo config interface -n asic1 ip add PortChannel999 10.0.0.6/31
  9. sudo config interface -n asic1 ip add PortChannel999 fc00::d/126

Describe the results you received:

Describe the results you expected:

Output of show version:

(paste your output here)

Output of show techsupport:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

prsunny commented 9 months ago

Do you've any delays between the above commands or copy-paste the whole set? Would you try with a sleep of 3-4 sec between step 4 and step 5 and share the result?

linqingxuan commented 8 months ago

In which version did you encounter this problem? Please add a screenshot or text of "show version"

mannytaheri commented 8 months ago

show ver

SONiC Software Version: SONiC.20220532.54 SONiC OS Version: 11 Distribution: Debian 11.8 Kernel: 5.10.0-23-2-amd64 Build commit: b9e6caad98 Build date: Tue Jan 9 00:13:06 UTC 2024 Built by: cloudtest@95bebd0dc000000

volodymyrsamotiy commented 8 months ago

Fix will be in sonic-mgmt test as well as in swss to add some protection. @prsunny, will discuss with swss owners and further triage.

saksarav-nokia commented 8 months ago

@prsunny , we had the orchagent crash with same signature when we remove the ports from Portchannel and remove the Ip and ipv6 address from Portchannel. What i noticed in the log/code is that the addNeighbor adds the remote system neighbor against the remote system port and increment the RIF reference counter for remote system port. However when it adds the nextHop in addNextHop , it adds it against Inband port with RIF-ID of remote system port, but increases the RIF reference count of Inband port instead of remote system port. When the neighbor is removed in removeNeighbor, it decreases the ref count of remote system port for RIF. But when it removes the nexthop in removeNextHop, it decreases the ref count for remote system port. I think the SWSS PR https://github.com/sonic-net/sonic-swss/pull/1686 made this change in addNextHop as part of Mpls support. Since RIF-If od remote system port is used for nexthop, we should be increasing the ref count for remote system port in addNextHop right?. Please let me know your thoughts on this.

prsunny commented 8 months ago

I see. Would you provide a fix PR?

saksarav-nokia commented 8 months ago

Yes. We are testing the fix for both Port channel scenarios and once we confirm that the fix is working, i will create a PR.

saksarav-nokia commented 8 months ago

Created PR https://github.com/sonic-net/sonic-swss/pull/3042

kenneth-arista commented 8 months ago

This is the same as https://github.com/sonic-net/sonic-buildimage/issues/17204