sonic-net / sonic-buildimage

Scripts which perform an installable binary image build for SONiC
Other
737 stars 1.43k forks source link

LAG member flapping causes sporadic SwSS crash with Broadcom SAI #14876

Open lukasstockner opened 1 year ago

lukasstockner commented 1 year ago

Description

When a LAG member port becomes enabled and then quickly disabled again, a race condition can occur with the Broadcom SAI that causes an error which crashes orchagent.

The problem appears to be that setting SAI_LAG_MEMBER_ATTR_EGRESS_DISABLE to true is not idempotent - when a LAG member which is already disabled for egress is disabled again, instead of just not doing anything, set_lag_member_attribute fails with SAI_STATUS_ITEM_NOT_FOUND:

Apr 26 12:03:12.077585 test-dx030 ERR syncd#syncd: [none] SAI_API_NEIGHBOR:brcm_sai_xgs_neighbor_bcast_update_lag_member:850 mcast egress remove failed with error Entry not found (0xfffffff9).
Apr 26 12:03:12.077881 test-dx030 ERR syncd#syncd: [none] SAI_API_LAG:brcm_sai_set_lag_member_attribute:1531 bcast ip update failed with error -7.
Apr 26 12:03:12.078199 test-dx030 ERR syncd#syncd: :- sendApiResponse: api SAI_COMMON_API_SET failed in syncd mode: SAI_STATUS_ITEM_NOT_FOUND
Apr 26 12:03:12.078839 test-dx030 ERR syncd#syncd: :- processQuadEvent: VID: oid:0x1b000000000f24 RID: oid:0x16001b0000000a
Apr 26 12:03:12.079096 test-dx030 ERR syncd#syncd: :- processQuadEvent: attr: SAI_LAG_MEMBER_ATTR_EGRESS_DISABLE: true
Apr 26 12:03:12.081376 test-dx030 ERR swss#orchagent: :- set: set status: SAI_STATUS_ITEM_NOT_FOUND
Apr 26 12:03:12.081376 test-dx030 ERR swss#orchagent: :- setDistributionOnLagMember: Failed to disable distribution on LAG member Ethernet21
Apr 26 12:03:12.081376 test-dx030 ERR swss#orchagent: :- handleSaiSetStatus: Encountered failure in set operation, exiting orchagent, SAI API: SAI_API_LAG, status: SAI_STATUS_ITEM_NOT_FOUND

This can happen if the LAG member is flapping - teamsyncd will update the key in LAG_MEMBER_TABLE with status enabled and then soon after with disabled. If orchagent does not process the event fast enough, both will be handled in the same iteration, which will cause it to ignore all but the latest SET event since it assumes that the other one is outdated (see Consumer::addToSync).

Therefore, setDistributionOnLagMember will be called to disable egress on a LAG member where it is already disabled, and the error will occur.

Steps to reproduce the issue:

  1. Create a PortChannel with two members
  2. Either:
    • Bring a member port up and down repeatedly and wait for the error to occur (can take days due to the timing-related nature of the bug)
    • Send SIGSTOP to swss, bring the member port up and down, then send SIGCONT (this will cause it to miss the first SET until the second one exists, and therefore reliably trigger the bug)
    • Patch setDistributionOnLagMember to set the same attribute twice, then bring a member port down

Describe the results you received:

Setting SAI_LAG_MEMBER_ATTR_EGRESS_DISABLE to true when it's already true fails with SAI_STATUS_ITEM_NOT_FOUND and the control plane restarts.

Describe the results you expected:

Setting SAI_LAG_MEMBER_ATTR_EGRESS_DISABLE to true when it's already true should not do anything.

Output of show version:

Reproduced both in 202111 (with Broadcom SAI 6.1.0.3) and 202205 (with Broadcom SAI 7.1.36.4)

Output of show techsupport:

Additional information you deem important (e.g. issue happens only occasionally):

A sufficient workaround is to query the attribute before setting it, and only setting it when it's different.

gechiang commented 1 year ago

@adyeung will check with SAI community what is the proper handling of attribute already set and got another same set attribute again. Adam will also ask BRCM SAI team to address this in the meanwhile. The fix for this needs to be backported to other previous branches (202205). @prsunny can you help check if this is also a problem for 202012 based image?

prsunny commented 1 year ago

Yes, this is present in 202012 based images as well. In fact SAI_LAG_MEMBER_ATTR_EGRESS_DISABLEwas introduced from 201911

adyeung commented 1 year ago

@lukasstockner Thanks for submitting the issue. SAI_LAG_MEMBER_ATTR_EGRESS_DISABLE in BRCM SAI is more than an attribute set, there is a SDK tbl resource that needs to be removed when a LAG member port is disabled. After the first call successfully remove the tbl resource, the subsequent call to the same without _ENABLE will fail to find the required resource to process the _DISABLE hence the RC _NOT_FOUND. I am following up internally to explore options with BRCM SAI team to address the expectation.

adyeung commented 1 year ago

@lukasstockner Would you happen to know if the same attribute set returns differently on non BRCM device? Just curious.

lukasstockner commented 1 year ago

@lukasstockner Would you happen to know if the same attribute set returns differently on non BRCM device? Just curious.

I don't have a device from a different vendor for testing here, so no idea, sorry.