Open lukasstockner opened 1 year ago
@adyeung will check with SAI community what is the proper handling of attribute already set and got another same set attribute again. Adam will also ask BRCM SAI team to address this in the meanwhile. The fix for this needs to be backported to other previous branches (202205). @prsunny can you help check if this is also a problem for 202012 based image?
Yes, this is present in 202012 based images as well. In fact SAI_LAG_MEMBER_ATTR_EGRESS_DISABLE
was introduced from 201911
@lukasstockner Thanks for submitting the issue. SAI_LAG_MEMBER_ATTR_EGRESS_DISABLE in BRCM SAI is more than an attribute set, there is a SDK tbl resource that needs to be removed when a LAG member port is disabled. After the first call successfully remove the tbl resource, the subsequent call to the same without _ENABLE will fail to find the required resource to process the _DISABLE hence the RC _NOT_FOUND. I am following up internally to explore options with BRCM SAI team to address the expectation.
@lukasstockner Would you happen to know if the same attribute set returns differently on non BRCM device? Just curious.
@lukasstockner Would you happen to know if the same attribute set returns differently on non BRCM device? Just curious.
I don't have a device from a different vendor for testing here, so no idea, sorry.
Description
When a LAG member port becomes enabled and then quickly disabled again, a race condition can occur with the Broadcom SAI that causes an error which crashes orchagent.
The problem appears to be that setting
SAI_LAG_MEMBER_ATTR_EGRESS_DISABLE
totrue
is not idempotent - when a LAG member which is already disabled for egress is disabled again, instead of just not doing anything,set_lag_member_attribute
fails withSAI_STATUS_ITEM_NOT_FOUND
:This can happen if the LAG member is flapping -
teamsyncd
will update the key inLAG_MEMBER_TABLE
with statusenabled
and then soon after withdisabled
. Iforchagent
does not process the event fast enough, both will be handled in the same iteration, which will cause it to ignore all but the latest SET event since it assumes that the other one is outdated (seeConsumer::addToSync
).Therefore,
setDistributionOnLagMember
will be called to disable egress on a LAG member where it is already disabled, and the error will occur.Steps to reproduce the issue:
swss
, bring the member port up and down, then send SIGCONT (this will cause it to miss the first SET until the second one exists, and therefore reliably trigger the bug)setDistributionOnLagMember
to set the same attribute twice, then bring a member port downDescribe the results you received:
Setting
SAI_LAG_MEMBER_ATTR_EGRESS_DISABLE
totrue
when it's alreadytrue
fails withSAI_STATUS_ITEM_NOT_FOUND
and the control plane restarts.Describe the results you expected:
Setting
SAI_LAG_MEMBER_ATTR_EGRESS_DISABLE
totrue
when it's alreadytrue
should not do anything.Output of
show version
:Reproduced both in 202111 (with Broadcom SAI 6.1.0.3) and 202205 (with Broadcom SAI 7.1.36.4)
Output of
show techsupport
:Additional information you deem important (e.g. issue happens only occasionally):
A sufficient workaround is to query the attribute before setting it, and only setting it when it's different.