osrg / gobgp

BGP implemented in the Go Programming Language
https://osrg.github.io/gobgp/
Apache License 2.0
3.65k stars 698 forks source link

graceful-restart: Prefix not advertised to non-gr peer #2596

Open rkojedzinszky opened 2 years ago

rkojedzinszky commented 2 years ago

History: https://github.com/cloudnativelabs/kube-router/issues/1389

I have a gobgp instance configured for graceful restart, and an external peer without graceful restart capability. The symptom is that when gobgp is started, the bgp session gets established, howewer, no prefixes get advertised to the neighbor. If then I restart the peer, prefixes get advertised.

I wrote a sample source which demonstrates this.

gobgp.zip

Note that graceful restart is configured for ipv4 only. Then, when bgp is established, I get the following log entries:

DEBU[0000] Now syncing, suppress sending updates. start deferral timer  Duration=10 Key=192.168.111.21 Topic=Server
DEBU[0001] received update                               Key=192.168.111.21 Topic=Peer attributes="[]" nlri="[]" withdrawals="[]"
DEBU[0001] EOR received                                  AddressFamily=ipv4-unicast Key=192.168.111.21 Topic=Peer
INFO[0001] sync finished                                 Topic=Server

Then, all address families receive EOR (https://github.com/osrg/gobgp/blob/master/pkg/server/server.go#L1699). But when proceeding, it turns out that p.isGracefulRestartEnabled() returns false, thus not sending out updates.

Kube-router until v1.5.1 configured graceful restart for both ipv4 and ipv6. That made the deferral-timeout trigger, and then a different code path had sent out advertisements, howewer only after the timeout.

Once the peer is restarted, then the whole FSM knows about the peer is not configured for graceful restart, and it gets advertisements as soon the session is up.

For the peer a simple frr configuration is enough:

router bgp 10
 no bgp ebgp-requires-policy
 bgp graceful-restart-disable
 neighbor 192.168.111.1 remote-as 20
rkojedzinszky commented 1 year ago

ping

aauren commented 1 year ago

@fujita - Any chance you could take a look at this one?

@rkojedzinszky has done quite a bit of work tracking down this seeming inconsistency with GoBGP's handling of GracefulRestart when GoBGP is configured for GracefulRestart, but the peer is not (see https://github.com/cloudnativelabs/kube-router/issues/1389 for the full history and context). This is causing them quite a bit of a headache as anytime kube-router is restarted they lose all routes until they also restart their peer.

fujita commented 1 year ago

This is a regression(old version gobgp worked)?

rkojedzinszky commented 1 year ago

This is a regression(old version gobgp worked)?

I dont know, as I was using kube-router only. I suspect a change in kube-router triggered this bug. Earlier, a bug in kube-router masked or changed behavior in a way that it seemed that everything is ok, but it turns out that the applied change in kube-router is correct (https://github.com/cloudnativelabs/kube-router/pull/1327).

aauren commented 1 year ago

Elaborating just a bit in this thread (obviously all work can be found in the PR that @rkojedzinszky linked)...

Previously kube-router used to announce graceful restart for the gobgpapi.Family_AFI_IP and gobgpapi.Family_AFI_IP6 AFI families whether we used them or not. As a matter of fact, at the point in time that @rkojedzinszky linked to, the BGP implementation in kube-router was only capable of working with IPv4 or IPv6 and could not work with both at the same time.

Some BGP peer implementations were fine ignoring an AFI setup for a family that wasn't used, but other ones it caused breakages unless users configured an AFI group that they weren't using and usually weren't capable of using.

However, this also seems to have worked around a bug (?) in GoBGP that @rkojedzinszky documented in the description of this issue up above.

fujita commented 1 year ago

Hmm, seems that I can't reproduce this.

btw, in the sample code, why LocalRestarting is set to true? It should not.

rkojedzinszky commented 1 year ago

@fujita I have created my sample code based on kube-router behavior. I will need some time to pick this up again, I'll report as soon as I can.

rkojedzinszky commented 1 year ago

Hmm, seems that I can't reproduce this.

btw, in the sample code, why LocalRestarting is set to true? It should not.

@fujita

I could reproduce it again. So use the attached code, and make sure to use the following frr configartion:

# sh running-config 
Building configuration...

Current configuration:
!
frr version 8.4.2
frr defaults traditional
hostname frr
log syslog informational
no ip forwarding
no ipv6 forwarding
service integrated-vtysh-config
!
router bgp 10
 bgp graceful-restart-disable
 neighbor 192.168.111.1 remote-as 20
 !
 address-family ipv4 unicast
  neighbor 192.168.111.1 route-map bgp-default in
  neighbor 192.168.111.1 route-map bgp-default out
 exit-address-family
exit
!
route-map bgp-default permit 10
exit
!
end

Pay attention to the bgp graceful-restart-disable statement. Ensure that the sample code is started after frr. Then I observed that after 3 mins there are no advertisements:

frr# sh ip bgp summary 

IPv4 Unicast Summary (VRF default):
BGP router identifier 192.168.111.55, local AS number 10 vrf-id 0
BGP table version 8
RIB entries 0, using 0 bytes of memory
Peers 1, using 724 KiB of memory

Neighbor        V         AS   MsgRcvd   MsgSent   TblVer  InQ OutQ  Up/Down State/PfxRcd   PfxSnt Desc
192.168.111.1   4         20        42        41        0    0    0 00:03:31            0        0 N/A

Total number of neighbors 1

Howewer, if I make a clear ip bgp on frr, then the sessions gets established again, and then prefixes gets advertised immediately:

frr# clear ip bgp *
frr# sh ip bgp summary 

IPv4 Unicast Summary (VRF default):
BGP router identifier 192.168.111.55, local AS number 10 vrf-id 0
BGP table version 10
RIB entries 0, using 0 bytes of memory
Peers 1, using 724 KiB of memory

Neighbor        V         AS   MsgRcvd   MsgSent   TblVer  InQ OutQ  Up/Down State/PfxRcd   PfxSnt Desc
192.168.111.1   4         20        46        49        0    0    0 00:00:02         Idle        0 N/A

Total number of neighbors 1
frr# sh ip bgp summary 

IPv4 Unicast Summary (VRF default):
BGP router identifier 192.168.111.55, local AS number 10 vrf-id 0
BGP table version 10
RIB entries 0, using 0 bytes of memory
Peers 1, using 724 KiB of memory

Neighbor        V         AS   MsgRcvd   MsgSent   TblVer  InQ OutQ  Up/Down State/PfxRcd   PfxSnt Desc
192.168.111.1   4         20        46        50        0    0    0 00:00:05       Active        0 N/A

Total number of neighbors 1
frr# sh ip bgp summary 

IPv4 Unicast Summary (VRF default):
BGP router identifier 192.168.111.55, local AS number 10 vrf-id 0
BGP table version 10
RIB entries 0, using 0 bytes of memory
Peers 1, using 724 KiB of memory

Neighbor        V         AS   MsgRcvd   MsgSent   TblVer  InQ OutQ  Up/Down State/PfxRcd   PfxSnt Desc
192.168.111.1   4         20        46        50        0    0    0 00:00:12       Active        0 N/A

Total number of neighbors 1
frr# sh ip bgp summary 

IPv4 Unicast Summary (VRF default):
BGP router identifier 192.168.111.55, local AS number 10 vrf-id 0
BGP table version 11
RIB entries 1, using 192 bytes of memory
Peers 1, using 724 KiB of memory

Neighbor        V         AS   MsgRcvd   MsgSent   TblVer  InQ OutQ  Up/Down State/PfxRcd   PfxSnt Desc
192.168.111.1   4         20        49        52        0    0    0 00:00:01            1        0 N/A

Total number of neighbors 1

This behavior is consistent with cisco ios implemetation too, we had never had issues with frr-cisco setups with gr enabled on one end and disabled on the other.

fujita commented 1 year ago

I can reproduce the issue. The following change fixed the issue.

diff --git a/main.go b/main.go
index 87bb682..eea63c7 100644
--- a/main.go
+++ b/main.go
@@ -91,7 +91,7 @@ func main() {
                                Enabled:         true,
                                RestartTime:     120,
                                DeferralTime:    10,
-                               LocalRestarting: true,
+                               // LocalRestarting: true,
                        },
                        AfiSafis: []*gobgpapi.AfiSafi{
rkojedzinszky commented 1 year ago

@aauren

I created my own example code based on kube-router, where LocalRestarting is also set to true. I think there is a good reason why it is done that way, howewer, I am not really familiar with such depths of bgp/gobgp fsm. Can you please comment on it?

aauren commented 1 year ago

@rkojedzinszky This was unfortunately way before my time with kube-router. It was introduced with the graceful-restart feature here: https://github.com/cloudnativelabs/kube-router/pull/220/files#diff-d0dae64b5424393b01d606bfcabfa3e8fd82d2c466eaeb28187a4da833105273R504 and no description was given as to why it was added.

@fujita Thanks for tracking this down! We really appreciate your time and effort! I found a bit of code documentation about this feature here: https://github.com/osrg/gobgp/blob/master/internal/pkg/config/bgp_configs.go#L4246-L4252

However, I'm still not certain that I understand the implications of setting or not setting it and without understanding it better I feel hesitant to change a setting that has been in the graceful restart implementation since day 1 with kube-router.

Is there any chance that you would be willing to clarify what the option does and what the impact to it being set to true or false is?

rkojedzinszky commented 1 year ago

@fujita Also please note that in the sample code if you enable IPv6 address family too, and follow the same sequence of starting the speakers, then as commented in the code, a timer will fire soon, and advertisements will begin to be sent. This is with LocalRestarting: true. So, it seems to be an inconsistent behavior.

fujita commented 1 year ago

LocalRestarting is configuration for a restarting speaker: https://github.com/osrg/gobgp/blob/master/docs/sources/graceful-restart.md#restarting-speaker

should not be set in your case.

I can't explain the behavior with IPv6 enabled. Needs to investigate.

aauren commented 1 year ago

@fujita Thanks again for continuing to pursue this with us.

You mentioned that you didn't think that LocalRestarting should be set in our case. And in the case of @rkojedzinszky's example code, I think I understand what you mean because it is starting for the first time, and yet it is still setting the LocalRestarting value. However, I'm not sure if it would actually be good to disable it for the kube-router case.

First off, to ensure that I understand what this flag does correctly from your documentation, is it correct to say that it lets it's peers know that it is recovering from a restart by setting the restart bit. And that this is the same procedure that is outlined by RFC4724 here:

To re-establish the session with its peer, the Restarting Speaker MUST set the "Restart State" bit in the Graceful Restart Capability of the OPEN message. Unless allowed via configuration, the "Forwarding State" bit for an address family in the capability can be set only if the forwarding state has indeed been preserved for that address family during the restart.

If so, then I'm not sure how to proceed with this setting when it comes to kube-router as a whole. You see, kube-router is most frequently run as a pod that can start and stop at any time. Forwarding state is preserved by the fact that kube-router writes out the routing table to the host's network. However, other than that, kube-router does not preserve state across reboots, so it isn't possible for it to reliably know whether it is starting for the first time, or if it is recovering from a previous run where it had previously established graceful-restart enabled BGP sessions. However, in a typical system, the latter is more common than the former.

According to the documentation that you linked it says (-r being the same as setting LocalRestarting via the Go API):

Without -r option, the peers which are under helper procedure will immediately withdraw all routes which were advertised from gobgpd.

So, my understanding of all of this put together then, is:

I would assume that sometime later after route selection is complete, these routes would then come back, but there would be a period of service outage from the routes being withdrawn correct?

rkojedzinszky commented 1 year ago

@aauren Thanks, great explanation. We are using this feature of kube-router for inter-node disruption-free upgrades, howewer, for our external connections, we dont want to use gr. This way we expect that if a node goes down for any reason (even just for an upgrade), traffic instantly gets rerouted to another node.

rkojedzinszky commented 1 year ago

@fujita can we move forward somehow?

rkojedzinszky commented 1 year ago

@fujita ping

rkusler-intermedia commented 1 year ago

I'm interested in this one too. Currently we lose our BGP routes every time kubespray restarts the kube-router containers

rkusler-intermedia commented 1 year ago

@rkojedzinszky I understand that you want graceful restarts for BGP sessions between nodes and want it disabled for BGP sessions to your upstream router. In theory, IF you did not need graceful restarts for inter-node BGP sessions, would removing the --bgp-graceful-restart resolve this issue that you're seeing on sessions to the upstream routers?

rkojedzinszky commented 1 year ago

@rkojedzinszky I understand that you want graceful restarts for BGP sessions between nodes and want it disabled for BGP sessions to your upstream router. In theory, IF you did not need graceful restarts for inter-node BGP sessions, would removing the --bgp-graceful-restart resolve this issue that you're seeing on sessions to the upstream routers?

Yes, indeed. If GR is disabled in gobgp, everything works fine, prefixes get advertised as expected to upstream routers, in any scenario.

YutaroHayakawa commented 5 months ago

I think this PR fixed the issue which is released in v3.26.0.

In Cilium, we could reproduce with the version using GoBGP v3.23.0, but couldn't with the latest main using v3.26.0. So, once someone else could confirm, I think we can close this issue.

Seems like the way I tried was bad. Even with v3.26.0, the issue was still there.

rkojedzinszky commented 5 months ago

@YutaroHayakawa I will take a look in the next days, thanks!

YutaroHayakawa commented 5 months ago

Scratching my comment above. Seems like the way I tried was bad. Even with v3.26.0, the issue was still there.