ipspace / netlab

Making virtual networking labs suck less
https://netlab.tools
Other
409 stars 58 forks source link

[BUG] Dell OS10: Remote AS number is missing on interface EBGP sessions #1163

Closed ipspace closed 2 months ago

ipspace commented 2 months ago

The current configuration template does not configure neighbor AS number for interface EBGP session, resulting in flapping EBGP sessions and no route propagation.

Device configuration:

router bgp 65000
 router-id 10.0.0.1
 !
 address-family ipv4 unicast
  network 10.0.0.1/32
 !
 template unnumbered_ebgp
 !
 neighbor interface ethernet1/1/1
  description x1
  inherit template unnumbered_ebgp inherit-type ebgp
  send-community standard
  no shutdown
  !
  address-family ipv4 unicast
   soft-reconfiguration inbound
 !
 neighbor interface ethernet1/1/2
  description x2
  inherit template unnumbered_ebgp inherit-type ebgp
  send-community standard
  no shutdown
  !
  address-family ipv4 unicast
   soft-reconfiguration inbound
 !
 neighbor 172.16.0.4
  description x3
  remote-as 65102
  send-community standard
  no shutdown
  !
  address-family ipv4 unicast
   soft-reconfiguration inbound
ssasso commented 2 months ago

So, I found this comment in my code:

! This is an unnumbered eBGP session
! WTF Remote-AS configuration not supported for unnumbered peer

Let's see if this has been fixed with newer versions ;)

Otherwise I will declare a caveat.

ssasso commented 2 months ago
dut(config)# router bgp 65000
dut(config-router-bgp-65000)# neighbor interface ethernet1/1/1
dut(config-router-neighbor)# rem
remote-as         remove-private-as
dut(config-router-neighbor)# remote-as 123
% Error: Remote-AS configuration not supported for unnumbered peer

Again - WTF.

ssasso commented 2 months ago

Btw, the flap seems not to be caused by the missing Remote AS (it seems OS10 accept whatever comes on the OPEN), but

Dell (OS10) %BGP_NBR_BKWD_STATE_CHG: Backward state change occurred ADJCHANGE: Session down for Nbr over:ethernet1/1/1 VRF:default
Dell (OS10) %BGP_NBR_BKWD_STATE_CHG: Backward state change occurred UPDATE ERR: Invalid nexthop recvd from Nbr over:ethernet1/1/1 VRF:default
ssasso commented 2 months ago

I found an "interesting" paragraph on OS10 docs, saying:

Behavior of iBGP unnumbered with cumulus
By default, SmartFabric OS10 has next-hop-self configuration enabled for unnumbered peers under both IPv4 and IPv6 addressfamilies.
Routes that are sent to an iBGP unnumbered peer have Next Hop resolved with Next Hop length as 32. In Cumulus, IPv4 NLRI is
advertised with link-local Next Hop and Next Hop length as 16. IPv6 NLRI is advertised with Next Hop unchanged if you do not
configure next-hop-self; otherwise, with next-hop-self configured with link-local address the Next hop length as 16.
IPv4 NLRI with Next Hop length as 16 is accepted only if you enable the link-local-only-nexthop command for that
unnumbered peer. Otherwise, this results in an update error.
IPv6 NLRI with link-local address as Next Hop and length as 16 is accepted only if you enable the link-local-onlynexthop command for that unnumbered peer. Otherwise, this results in an update error.

It seems that enabling that option at the template level is solving the issue.

Testing it also on a topology OS10-to-OS10 and OS10-to-VyOS.

ipspace commented 2 months ago

Now the regular EBGP unnumbered check is failing (bgp/06-unnumbered.yml). Will publish the test results once I manage to get DellOS10 to work at least once in each test ;)

ssasso commented 2 months ago

bgp/06 was the one initially referenced by this issue, and the one I used for the first testing...

This on my env:

root@hippo:~/TOPOLOGIES/bugs/bgp06# netlab validate
[WARNING] Initial wait time extended by 30 seconds required by dellos10
[session] Check EBGP sessions with DUT (wait up to 30 seconds) [ node(s): x1,x2,x3 ]
[PASS]    x1: Neighbor eth1 (dut) is in state Established
[PASS]    x2: Neighbor eth1 (dut) is in state Established
[PASS]    x3: Neighbor 172.16.0.1 (dut) is in state Established
[PASS]    Test succeeded

[pfx_x2]  Check whether DUT propagates the X2 prefix [ node(s): x1 ]
[PASS]    x1: The prefix 172.42.42.0/24 is in the BGP table
[PASS]    Test succeeded

[pfx_x3]  Check whether DUT propagates the X3 prefix [ node(s): x1 ]
[PASS]    x1: The prefix 172.42.43.0/24 is in the BGP table
[PASS]    Test succeeded

[SUCCESS] Tests passed: 5
ipspace commented 2 months ago

Looks like I forgot (yet again) to push new code to the test server. In totally unrelated news, the Dell OS10 SSH server failure rate is currently above 80% :(( I hate this crap...

ipspace commented 2 months ago

The new test results are online: https://tests.netlab.tools/_html/dellos10-libvirt

BGP works, the only failed VRF test is the common services VRF using OSPF.

Great job, thank you!