srl-labs / containerlab

container-based networking labs
https://containerlab.dev
BSD 3-Clause "New" or "Revised" License
1.56k stars 266 forks source link

srl: ipv6 link-local address conflict between parent (srbase) and child (srbase-default) breaks unnumbered bgp #2280

Open spoignant-proton opened 3 weeks ago

spoignant-proton commented 3 weeks ago

I've hit this issue while trying to test an unnumbered bgp session (ipv6 link local addresses) between two srl containers running 24.7.2 or 23.10. It can be reproduced with the following minimal setup comprised of a leaf (srl/ixrd2l) and a spine (srl/ixrd3l). Port e1-51 of the leaf is connected to port e1-1 of the spine.

Configuration as in the attached files: spine.txt leaf.txt

The BGP session fails to establish, packet capture between the two containers shows that the ICMPv6 RA are correctly exchanged and one side attempt to establish the session. However, the other side replies with both SYN/ACK and RST before any OPEN has been sent which ends up the session prematurely.

11:48:09.367030 1a:98:1f:ff:00:01 > 1a:ed:07:ff:00:33, ethertype IPv6 (0x86dd), length 94: fe80::1898:1fff:feff:1.40955 > fe80::18ed:7ff:feff:33.179: Flags [S], seq 1125336807, win 16384, options [mss 1024,sackOK,TS val 2445953177 ecr 0,nop,wscale 0], length 0
11:48:09.368821 1a:ed:07:ff:00:33 > 1a:98:1f:ff:00:01, ethertype IPv6 (0x86dd), length 94: fe80::18ed:7ff:feff:33.179 > fe80::1898:1fff:feff:1.40955: Flags [S.], seq 1164848763, ack 1125336808, win 8896, options [mss 8908,sackOK,TS val 3667797282 ecr 2445953177,nop,wscale 0], length 0
11:48:09.368874 1a:98:1f:ff:00:01 > 1a:ed:07:ff:00:33, ethertype IPv6 (0x86dd), length 74: fe80::1898:1fff:feff:1.40955 > fe80::18ed:7ff:feff:33.179: Flags [R], seq 1125336808, win 0, length 0
11:48:09.370306 1a:98:1f:ff:00:01 > 1a:ed:07:ff:00:33, ethertype IPv6 (0x86dd), length 86: fe80::1898:1fff:feff:1.40955 > fe80::18ed:7ff:feff:33.179: Flags [.], ack 1, win 16384, options [nop,nop,TS val 2445953180 ecr 3667797282], length 0
11:48:09.370341 1a:98:1f:ff:00:01 > 1a:ed:07:ff:00:33, ethertype IPv6 (0x86dd), length 143: fe80::1898:1fff:feff:1.40955 > fe80::18ed:7ff:feff:33.179: Flags [P.], seq 1:58, ack 1, win 16384, options [nop,nop,TS val 2445953180 ecr 3667797282], length 57: BGP
11:48:09.372008 1a:ed:07:ff:00:33 > 1a:98:1f:ff:00:01, ethertype IPv6 (0x86dd), length 74: fe80::18ed:7ff:feff:33.179 > fe80::1898:1fff:feff:1.40955: Flags [R], seq 1164848764, win 0, length 0
11:48:09.372036 1a:ed:07:ff:00:33 > 1a:98:1f:ff:00:01, ethertype IPv6 (0x86dd), length 74: fe80::18ed:7ff:feff:33.179 > fe80::1898:1fff:feff:1.40955: Flags [R], seq 1164848764, win 0, length 0

Additionally, ping on the peer's link-local address reports duplicates:

A:LEAF# ping network-instance default fe80::182f:1fff:feff:1%ethernet-1/51.0
Using network instance default
PING fe80::182f:1fff:feff:1%e1-51.0(fe80::182f:1fff:feff:1%e1-51.0) 56 data bytes
64 bytes from fe80::182f:1fff:feff:1%e1-51.0: icmp_seq=1 ttl=64 time=1.14 ms
64 bytes from fe80::182f:1fff:feff:1%e1-51.0: icmp_seq=1 ttl=64 time=3.17 ms (DUP!)
64 bytes from fe80::182f:1fff:feff:1%e1-51.0: icmp_seq=2 ttl=64 time=1.84 ms
64 bytes from fe80::182f:1fff:feff:1%e1-51.0: icmp_seq=2 ttl=64 time=2.78 ms (DUP!)
64 bytes from fe80::182f:1fff:feff:1%e1-51.0: icmp_seq=3 ttl=64 time=1.81 ms
64 bytes from fe80::182f:1fff:feff:1%e1-51.0: icmp_seq=3 ttl=64 time=2.81 ms (DUP!)

After further investigation i believe this is because the parent interface inside the srbase container has a link-local address enabled, that happens to be the same than the child inside srbase-default. For example:

admin@SPINE:~$ ip netns exec srbase ip a ls dev e1-1
9384: e1-1@if9383: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9232 qdisc noqueue state UP group default 
    link/ether 1a:2f:1f:ff:00:01 brd ff:ff:ff:ff:ff:ff link-netnsid 9
    inet6 fe80::182f:1fff:feff:1/64 scope link 
       valid_lft forever preferred_lft forever

admin@SPINE:~$ ip netns exec srbase-default ip a ls dev e1-1.0
10: e1-1.0@if29: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 1a:2f:1f:ff:00:01 brd ff:ff:ff:ff:ff:ff link-netns srbase
    inet6 fe80::182f:1fff:feff:1/64 scope link 
       valid_lft forever preferred_lft forever

Therefore, the SYN is received inside both containers, srbase rejects it with RST (which causes the issue) and srbase-default accepts it.

To confirm, i ran the following commands, in order to get rid of the link-local address on the parent interface:

admin@LEAF:~$ sudo ip netns exec srbase sysctl net.ipv6.conf.e1-51.disable_ipv6=1

admin@SPINE:~$ sudo ip netns exec srbase sysctl net.ipv6.conf.e1-1.disable_ipv6=1

After this, the BGP session establishes successfully, and ping no longer shows duplicate packets, which confirm the aforementioned assumptions.

As another workaround, using vlan-tagging on the p2p interfaces prevent the issue from happening.

hellt commented 3 weeks ago

Hi @spoignant-proton

Unfortunately I can't seem to reproduce this issue. I took your config examples and created a self-contained containerlab topology:

name: v6lla

topology:
  nodes:
    leaf1:
      kind: nokia_srlinux
      image: ghcr.io/nokia/srlinux:24.7.2
      startup-config: |
        interface ethernet-1/51 {
            description "To SPINE:e1-1"
            admin-state enable
            subinterface 0 {
                type routed
                ipv6 {
                    admin-state enable
                    router-advertisement {
                        router-role {
                            admin-state enable
                            max-advertisement-interval 120
                            min-advertisement-interval 30
                        }
                    }
                }
            }
        }
        network-instance default {
            admin-state enable
            interface ethernet-1/51.0 {
            }
            protocols {
                bgp {
                    autonomous-system 65001
                    router-id 100.65.32.4
                    dynamic-neighbors {
                        interface ethernet-1/51.0 {
                            peer-group underlay_fabric
                            allowed-peer-as [
                                65004
                            ]
                        }
                    }
                    afi-safi ipv4-unicast {
                        admin-state enable
                        multipath {
                            allow-multiple-as true
                            maximum-paths 8
                        }
                    }
                    afi-safi ipv6-unicast {
                        admin-state disable
                    }
                    group underlay_fabric {
                        admin-state enable
                    }
                }
            }
        }
    spine1:
      kind: nokia_srlinux
      image: ghcr.io/nokia/srlinux:24.7.2
      startup-config: |
        interface ethernet-1/1 {
            description "To Leaf:e1-51"
            admin-state enable
            subinterface 0 {
                type routed
                ipv6 {
                    admin-state enable
                    router-advertisement {
                        router-role {
                            admin-state enable
                            max-advertisement-interval 120
                            min-advertisement-interval 30
                        }
                    }
                }
            }
        }
        network-instance default {
            admin-state enable
            interface ethernet-1/1.0 {
            }
            protocols {
                bgp {
                    admin-state enable
                    autonomous-system 65004
                    router-id 100.65.32.1
                    dynamic-neighbors {
                        interface ethernet-1/1.0 {
                            peer-group underlay_fabric
                            allowed-peer-as [
                                65001
                            ]
                        }
                    }
                    afi-safi ipv4-unicast {
                        admin-state enable
                        multipath {
                            allow-multiple-as true
                            maximum-paths 8
                        }
                    }
                    afi-safi ipv6-unicast {
                        admin-state disable
                    }
                    group underlay_fabric {
                        admin-state enable
                        failure-detection {
                            enable-bfd true
                            fast-failover true
                        }
                    }
                }
            }
        }
  links:
    - endpoints: [leaf1:e1-51, spine1:e1-1]

This topology embeds the configs, so that you should have the nodes boot with the corresponding config bits applied. Once I deployed this topo I see the bgp established just fine:

--{ running }--[  ]--
A:leaf1# show network-instance default protocols bgp neighbor *
-------------------------------------------------------------------------------------------------------------------------------------------------
BGP neighbor summary for network-instance "default"
Flags: S static, D dynamic, L discovered by LLDP, B BFD enabled, - disabled, * slow
-------------------------------------------------------------------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------------------------------------------------------------------
+----------------+-----------------------+----------------+------+---------+-------------+-------------+------------+-----------------------+
|    Net-Inst    |         Peer          |     Group      | Flag | Peer-AS |    State    |   Uptime    |  AFI/SAFI  |    [Rx/Active/Tx]     |
|                |                       |                |  s   |         |             |             |            |                       |
+================+=======================+================+======+=========+=============+=============+============+=======================+
| default        | fe80::187d:1ff:feff:1 | underlay_fabri | D    | 65004   | established | 0d:0h:0m:19 | ipv4-      | [0/0/0]               |
|                | %ethernet-1/51.0      | c              |      |         |             | s           | unicast    |                       |
+----------------+-----------------------+----------------+------+---------+-------------+-------------+------------+-----------------------+
-------------------------------------------------------------------------------------------------------------------------------------------------
Summary:
0 configured neighbors, 0 configured sessions are established, 0 disabled peers
1 dynamic peers

fwiw, my v6 LLA are different in the two net nses:

--{ running }--[  ]--
A:spine1# bash
admin@spine1:~$ ip netns exec srbase ip a ls dev e1-1
10493: e1-1@if10494: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9232 qdisc noqueue state UP group default 
    link/ether 1a:05:01:ff:00:01 brd ff:ff:ff:ff:ff:ff link-netnsid 1
    inet6 fe80::a8c1:abff:fece:430c/64 scope link 
       valid_lft forever preferred_lft forever

admin@spine1:~$ ip netns exec srbase-default ip a ls dev e1-1.0
4: e1-1.0@if7: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 1a:05:01:ff:00:01 brd ff:ff:ff:ff:ff:ff link-netns srbase
    inet6 fe80::1805:1ff:feff:1/64 scope link 
       valid_lft forever preferred_lft forever
spoignant-proton commented 3 weeks ago

Hi @hellt and thanks for your reply,

That's interesting, because in your case you also have the same MAC address but the LLA are different, with the one on the srbase side derived from a different MAC address. It is my understanding that i'm experiencing the issue because the LLA are the same.

I tried to load a new topology from the file you shared on the same host system, and i observe the same than you:

admin@spine1:~$ ip netns exec srbase ip a ls dev e1-1
10779: e1-1@if10780: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9232 qdisc noqueue state UP group default 
    link/ether 1a:b8:01:ff:00:01 brd ff:ff:ff:ff:ff:ff link-netnsid 1
    inet6 fe80::a8c1:abff:fe14:1ab8/64 scope link 
       valid_lft forever preferred_lft forever

admin@spine1:~$ ip netns exec srbase-default ip a ls dev e1-1.0
7: e1-1.0@if10: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 1a:b8:01:ff:00:01 brd ff:ff:ff:ff:ff:ff link-netns srbase
    inet6 fe80::18b8:1ff:feff:1/64 scope link 
       valid_lft forever preferred_lft forever

So it looks like something in my topology is preventing the LLA from being derived from that different MAC address, but it is not clear why and how i can avoid it. A key difference is that i'm bootstraping the containers with their default config, and load the actual config to be tested only later. This is because my config files are too big to be loaded using the startup-config node attribute.

I'll make further tests to try to narrow this further down.

hellt commented 3 weeks ago

A key difference is that i'm bootstraping the containers with their default config, and load the actual config to be tested only later. This is because my config files are too big to be loaded using the startup-config node attribute.

can you share more details about it? I can see how you might now want to include it in the topology file due to the size, but you should be able to use startup-config: cfg1.txt where cfg1.txt contains the add-on config you want to test, and it should work regarless the size.

protonjhow commented 3 weeks ago

i think we will pause this because we are under heavy time pressures to complete additional testing for a new deployment imminently.

if you dont mind leaving this open a few weeks, we can come back to the deep dive, for now, we have a workaround that works well enough

spoignant-proton commented 3 weeks ago

So the way the configuration is loaded (either directly using startup-config, or by our own means after the container is created to overcome size limit) is unrelated with this issue. As it turned out, the issue is triggered by having device creation ordering rules, for example if we want to start leaves only after spines have been created and configured.

The following is a reproducing topology:

name: v6lla

topology:
  nodes:
    leaf1:
      kind: nokia_srlinux
      image: ghcr.io/nokia/srlinux:24.7.2
      stages:
        create:
          wait-for:
            - node: spine1
              stage: configure
      startup-config: leaf1.txt
    spine1:
      kind: nokia_srlinux
      image: ghcr.io/nokia/srlinux:24.7.2
      startup-config: spine1.txt
  links:
    - endpoints: [leaf1:e1-51, spine1:e1-1]

leaf1.txt spine1.txt

Results:

A:spine1# show network-instance default protocols bgp neighbor
---------------------------------------------------------------------------------------------
BGP neighbor summary for network-instance "default"
Flags: S static, D dynamic, L discovered by LLDP, B BFD enabled, - disabled, * slow
---------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------
+---------+---------+---------+---------+---------+---------+---------+---------+---------+
|  Net-   |  Peer   |  Group  |  Flags  | Peer-AS |  State  | Uptime  | AFI/SAF | [Rx/Act |
|  Inst   |         |         |         |         |         |         |    I    | ive/Tx] |
+=========+=========+=========+=========+=========+=========+=========+=========+=========+
| default | fe80::1 | underla | D       |         | active  | -       |         |         |
|         | 84f:ff: | y_fabri |         |         |         |         |         |         |
|         | feff:33 | c       |         |         |         |         |         |         |
|         | %ethern |         |         |         |         |         |         |         |
|         | et-     |         |         |         |         |         |         |         |
|         | 1/1.0   |         |         |         |         |         |         |         |
+---------+---------+---------+---------+---------+---------+---------+---------+---------+
---------------------------------------------------------------------------------------------
Summary:
0 configured neighbors, 0 configured sessions are established, 0 disabled peers
1 dynamic peers

admin@spine1:~$ ip netns exec srbase ip a ls dev e1-1
10874: e1-1@if10873: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9232 qdisc noqueue state UP group default 
    link/ether 1a:53:01:ff:00:01 brd ff:ff:ff:ff:ff:ff link-netnsid 4
    inet6 fe80::1853:1ff:feff:1/64 scope link 
       valid_lft forever preferred_lft forever

admin@spine1:~$ ip netns exec srbase-default ip a ls dev e1-1.0
7: e1-1.0@if10: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 1a:53:01:ff:00:01 brd ff:ff:ff:ff:ff:ff link-netns srbase
    inet6 fe80::1853:1ff:feff:1/64 scope link 
       valid_lft forever preferred_lft forever

The issue no longer happens if the create section is removed from the leaf1 node definition.

spoignant-proton commented 3 weeks ago

FYI i'm using the latest clab version:

    version: 0.59.0
     commit: 9e964727
       date: 2024-10-23T02:44:27Z
     source: https://github.com/srl-labs/containerlab
 rel. notes: https://containerlab.dev/rn/0.59/
hellt commented 3 weeks ago

I can imagine how doing the delayed start between the nodes that share the same link might cause some issues.

Do you need this stage after all? Can you live without it? Stages work best when you either delay creation of nodes that do not connect one with another (like islands of your topology) or if they do not share the same link segment, like connecting over a bridge.

protonjhow commented 3 weeks ago

Its to manage the delivery of the topology inside a nominal host. our full topology is like 50 nodes and even with a decent sized host, it gets very sad during the initial standup because the cpus are murdered with all the provisioning.

stages allow to break up the startup hit, when the running topology sits happily inside the hardware we have on hand

spoignant-proton commented 2 weeks ago

Stages were also meant to have certain parts of the network start before others, e.g. have a ready core layer before starting fabrics. I've tried removing those and replacing with --max-workers N to reduce the cpu stress when starting the topology. It is not obvious to me why it should not be used in the first place. Eventually if a device must start at the same time than all other devices that it is connected to, inside our relatively complex topology that amount to say all device must start at the same time, and by reducing concurrency we may hit the same issue again.

spoignant-proton commented 2 weeks ago

As expected, after removing the stages section from all srl nodes, with --max-workers 4, i'm still hitting the issue:

admin@SPINE01:~$ ip netns exec srbase ip a ls dev e1-1
38: e1-1@if37: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8986 qdisc noqueue state UP group default 
    link/ether 1a:83:1f:ff:00:01 brd ff:ff:ff:ff:ff:ff link-netnsid 5
    inet6 fe80::1883:1fff:feff:1/64 scope link 
       valid_lft forever preferred_lft forever

admin@SPINE01:~$ ip netns exec srbase-default ip a ls dev e1-1.0
4: e1-1.0@if8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8968 qdisc noqueue state UP group default qlen 1000
    link/ether 1a:83:1f:ff:00:01 brd ff:ff:ff:ff:ff:ff link-netns srbase
    inet6 fe80::1883:1fff:feff:1/64 scope link 
       valid_lft forever preferred_lft forever
spoignant-proton commented 2 weeks ago

Well even without stages and --max-workers the problem still affects a few interfaces randomly for a moderately sized topology. For example, one uplink on that leaf is affected, the other ones are not:

admin@LEAF01:~$ ip netns exec srbase ip a ls dev e1-51
756: e1-51@if755: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8986 qdisc noqueue state UP group default 
    link/ether 1a:03:07:ff:00:33 brd ff:ff:ff:ff:ff:ff link-netnsid 12
    inet6 fe80::1803:7ff:feff:33/64 scope link 
       valid_lft forever preferred_lft forever

admin@LEAF01:~$ ip netns exec srbase-default ip a ls dev e1-51.0
4: e1-51.0@if14: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8968 qdisc noqueue state UP group default qlen 1000
    link/ether 1a:03:07:ff:00:33 brd ff:ff:ff:ff:ff:ff link-netns srbase
    inet6 fe80::1803:7ff:feff:33/64 scope link 
       valid_lft forever preferred_lft forever
admin@LEAF01:~$ ip netns exec srbase ip a ls dev e1-52
760: e1-52@if759: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8986 qdisc noqueue state UP group default 
    link/ether 1a:03:07:ff:00:34 brd ff:ff:ff:ff:ff:ff link-netnsid 6
    inet6 fe80::a8c1:abff:fe1c:cc45/64 scope link 
       valid_lft forever preferred_lft forever

admin@LEAF01:~$ ip netns exec srbase-default ip a ls dev e1-52.0
5: e1-52.0@if15: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8968 qdisc noqueue state UP group default qlen 1000
    link/ether 1a:03:07:ff:00:34 brd ff:ff:ff:ff:ff:ff link-netns srbase
    inet6 fe80::1803:7ff:feff:34/64 scope link 
       valid_lft forever preferred_lft forever

So this all looks like a race condition when creating containers and wiring up. Do we ever need to have an LLA on the parent (srbase) interface? If not maybe this can be avoided using the workaround described in my first post.