Open anamehra opened 10 months ago
Hi @rlhui , @abdosi , please assign this to me for now. May we plan to discuss this in the upcoming chassis meeting? Thanks
@vperumal , @rajendrat , for your viz.
why we need bfd session to be the fastest to come up before bgp? We may not want that right, as bfd is for monitoring/resiliency, but not necessarily needed in normal cases, unlike BGP which is critical.
what functional issue are we seeing because of this delay?
priority queue present in swss already.
dual tor has the mechanism to delay bgp session bring up using a FRR configuration.
dual tor has the mechanism to delay bgp session bring up using a FRR configuration.
Thanks @arlakshm , I will check on that.
what functional issue are we seeing because of this delay?
HI @rlhui , as such no functionality impact observed but the overall bringup of all bgp paths gets delayed.
dual tor has the mechanism to delay bgp session bring up using a FRR configuration.
Hi @arlakshm , I tried following config but it did not help much. bgp graceful-restart restart-time 240 bgp graceful-restart select-defer-time 45
@arlakshm please include this in sonic-common-infra subgroup as one high priority problem to solve, thanks.
I created a PR to fix https://github.com/sonic-net/sonic-buildimage/issues/19569 Can someone verify it also fix this issue? https://github.com/sonic-net/sonic-swss/pull/3269
The issue in orchagent is: massive low priority event may block high priority event. mor detail can find in issue #19569
I created a PR to fix #19569 Can someone verify it also fix this issue? sonic-net/sonic-swss#3269
The issue in orchagent is: massive low priority event may block high priority event. mor detail can find in issue #19569
Thanks @liuh-80 , I am validating this fix.
I created a PR to fix #19569 Can someone verify it also fix this issue? sonic-net/sonic-swss#3269
The issue in orchagent is: massive low priority event may block high priority event. mor detail can find in issue #19569
This change LGTM. I see lot of improvement with port an bfd notification handling.
Description
On packet chassis, LC to LC connectivity via fabric uses iBGP sessions. Internal BFD over fabric interfaces is used to do fault detection for these iBGP sessions. On a scale setup with 3 or more fabric cards, it takes more time for orchagent to process the bfd session-up notifications from SAI during config reload or reboot. The reason for this delay is due to same notification queue being used for bfd notifications and bgp route learning notifications. During bgp/swss docker start, bfd and bgp configuration is applied together. As soon as a few bfd sessions come up, iBGP sessions start establishing. This also starts a flood of route-learning notifications for Orchagent. During this time when new bfd session-up notifications are sent by SAI, the processing for these new messages gets delayed. On a scale setup with 5 FCs we observe that it may take up to 12 mins for orchagent to process all bfd session up messages since the start of docker.
If bgp sessions are kept in a down state during first ~3 mins of docker bring up, bfd session up messages are handled on time. After that, if bgp is started, the session bring up and route learning happens properly.
This GitHub issue is to find and implement a better way of handling the bfd and bgp session on chassis-packet.
Steps to reproduce the issue:
1. 2. 3.
Describe the results you received:
Describe the results you expected:
Output of
show version
:Output of
show techsupport
:Additional information you deem important (e.g. issue happens only occasionally):