projectcalico / bird

Calico's fork of the BIRD protocol stack
89 stars 85 forks source link

IP-in-IP does not compose with ECMP routes #90

Open nelljerram opened 3 years ago

nelljerram commented 3 years ago

In a dual ToR setup (attempt), with ECMP routes to some node loopback IPs and pod /26 blocks, we tried enabling IP-in-IP (to address a separate bootstrapping issue), and observed that most of the routes did not change to IP-in-IP routes:

ubuntu@lance-ipv6:~$ docker exec kind-worker ip r
default via 172.31.12.2 dev eth1 src 172.31.12.3
172.31.10.0/23 src 172.31.10.4
        nexthop dev eth0 weight 1
        nexthop dev eth1 weight 1
172.31.11.0/24 dev eth0 proto kernel scope link src 172.31.11.3
172.31.12.0/24 dev eth1 proto kernel scope link src 172.31.12.3
172.31.20.0/23 src 172.31.10.4
        nexthop via 172.31.11.1 dev eth0 weight 1
        nexthop via 172.31.12.1 dev eth1 weight 1
172.31.21.0/24 via 172.31.11.1 dev eth0
172.31.22.0/24 via 172.31.12.1 dev eth1
192.168.82.0/26 via 172.31.10.3 dev tunl0 proto bird onlink
192.168.110.128/26 proto bird
        nexthop via 172.31.11.1 dev eth0 weight 1
        nexthop via 172.31.12.1 dev eth1 weight 1
blackhole 192.168.162.128/26 proto bird
192.168.162.129 dev calif2ae67c76cf scope link
192.168.162.130 dev calic1faebfe4cf scope link
192.168.162.131 dev calib66283a5cca scope link
192.168.162.133 dev calid730704be43 scope link
192.168.195.192/26 proto bird
        nexthop via 172.31.11.1 dev eth0 weight 1
        nexthop via 172.31.12.1 dev eth1 weight 1

In fact, only the non-ECMP routes change to go via the IP-in-IP tunnel device tunl0.

On reviewing the BIRD code, it's clear this is because of lack of support in the BIRD code. There are two places where the code handles EA_KRT_TUNNEL, and in both places the handling is also conditional on RTD_ROUTER - which means "a route with a single path". In both places we would need to add corresponding code for RTD_MULTIPATH.

Does this matter?

Well, it depends if there are any use cases for ECMP routes while IP-in-IP is in use. The dual ToR work unconditionally added "merge paths on" to the BIRD config, which means that any TSEE deployment would program an ECMP route if it received more than one possible path for a given prefix. We should perhaps make that conditional somehow.

nelljerram commented 3 years ago

I’ve spent some time looking at the work that would be needed, and it's quite hard. I think it would be better for us to spend equivalent time and effort on rethinking and eliminating the BIRD code hacks that we’ve made for IP-in-IP, replacing them with

I will record the relevant history, and my analysis, in following comments here.

nelljerram commented 3 years ago

I began a discussion on the BIRD ML in 2016, about upstreaming our BIRD patches for IP-in-IP: https://bird.network.cz/pipermail/bird-users/2016-September/010629.html

The maintainer (Ondrej)’s response to that was that we should only need to change BIRD to allow setting “onlink” and “iface” from a filter, and then we could achieve everything else we need in config. I now think that is probably true, and the right way to go. In fact, it might only be “onlink” that we need a code change for, because I think there is already a writable “ifname” property that a filter can set.

I began the changes, for extending our current hacks for ECMP, as you can see at https://github.com/neiljerram/bird/commit/fdf04ac68583802de69231d2c486c9d390bada5a.

However, the thing that is hard, about that, is that we don’t have storage equivalent to orig_gw for the possible paths of an ECMP route. We could probably add that, but it would need understanding when and how the route merging happens in BIRD, and from a quick look that did not seem straightforward.

As I researched this, it started looking like we might actually not need orig_gw at all, because that information is probably already available as the BA_NEXT_HOP eattr. That led me down the path of reviewing our overall design here, and of whether we need all our current hacks.

nelljerram commented 3 years ago

Here is a review of our current BIRD code changes, and the filter config that we use in conjunction with those:

  +--------+  [1] [2]  +---------+    [4]    +----------+  [5]
  | Local  |<--export--| BIRD    |--export-->| BGP      |<-------
  | Kernel |           | routing |           | protocol |         BGP peers
  |        |--import-->| table   |<--import--|          |------->
  +--------+    [3]    +---------+     all   +----------+

[1] calico_kernel_programming filter:

    reject if net ~ /rejectcidrs
    reject IPv4 IP pool CIDRs with VXLAN mode
    set EA_KRT_TUNNEL and accept if in IP pool with IP-IP
    accept everything else

    Instead of setting EA_KRT_TUNNEL:

        onlink = true
        gw = nexthop
        ifname = tunl0

[2] when programming route into local kernel

    netlink.c: nl_send_route: if EA_KRT_TUNNEL provided, use it, with orig_gw
    i.e. set gw and oif

        From nl_add_rte, from krt_replace_rte

    This is designed for when BIRD has received a route from a BGP
    peer (which might not be directly connected).  After receipt by
    the local BGP, and import into the routing table, the route will
    be like

        10.65.128.192/26 via <resolved next hop> dev eth0

    When BGP peers are directly connected, <resolved next hop> will be
    the peer IP.  For GCP, <resolved next hop> will be the GCP gateway
    10.240.0.1.  For cases like GCP, where we use IP-IP to get through
    the gateway, we want to program that as

        10.65.128.192/26 via <IP of host for that block> dev tunl0 onlink

    If that route came to us directly from BIRD on the relevant host,
    we could get <IP of host for that block> as the route's "from"
    attribute.  But that would be wrong if it came via an RR.  In both
    cases the right IP is whatever was in the NEXTHOP attribute of the
    BGP UPDATE message, so that's what we should use, and that's what
    our "orig_gw" change is achieving.

[3] when importing route from local kernel to BIRD routing table

    netlink.c: nl_parse_multipath: Always NEF_ONLINK in neigh_find2 call
    netlink.c: nl_parse_route: Always NEF_ONLINK in neigh_find2 call

    krt.c: krt_same_dest: Tunnel flapping change
    krt.c: krt_got_route: Tunnel flapping change

        From nl_announce_route, from nl_parse_route/nl_parse_end

[4] bgp_import_control
      bgp_update_attrs if route came from BGP
        Set next hop to self if:
          No existing next hop ||
          "next hop self" ||
          (!"next hop keep" && more conditions...)
      bgp_create_attrs otherwise
        Set next hop to self if:
          "next hop self" ||
          rta->dest != RTD_ROUTER ||
          rta->gw is none or is link local ||
          (!"next hop keep" && more conditions...)
        Otherwise set next hop to rta->gw

    filter calico_export_to_bgp_peers:

        accept block CIDRs
        reject more specific routes inside block CIDRs
        accept if net ~ /staticroutes
        accept if net ~ IPv4 IP pool CIDRs
        reject

[5] On receipt of UPDATE message from a BGP peer

bgp_rx_update
  bgp_do_rx_update
    bgp_decode_attrs
      Maps BGP UPDATE attrs to struct rta
    bgp_set_next_hop
      direct vs recursive algorithm here
      Calico: set orig_gw to BGP nexthop, before possibly changing the latter
      rta_set_recursive_next_hop
    bgp_rte_update
      rte_update2
        Run import filter
        Add to BIRD routing table

    packets.c: bgp_set_next_hop: 'a->orig_gw = *nexthop' just before rta_set_recursive_next_hop call

       This is called from bgp_do_rx_update (two versions, IPv4 and IPv6),
       from bgp_rx_update, so on receipt of a BGP UPDATE message from a
       peer.

(Another view of the same thing is at https://github.com/projectcalico/bird/pull/37#issuecomment-301123323, but I believe the ASCII art above is a better representation of the BIRD internal architecture, and it also shows where the relevant import and export filters are involved.)

nelljerram commented 3 years ago

(In addition to the points above where code is actually doing things, there are these infrastructure changes, between upstream and our fork, to facilitate the above:

- filter.c: Add EAF_TYPE_STRING
- filter.c: Earlier declaration of two variables used for EAF_TYPE_IP_ADDRESS
- route.h: Add 'orig_gw' field next to 'gw'
- route.h: Add EAF_TYPE_STRING
- krt-sys.h: Add EA_KRT_TUNNEL
- netlink.Y: Add KRT_TUNNEL (x2)
- netlink.c: krt_sys_get_addr: Add EA_KRT_TUNNEL
- config.h: "v1.6.3" -> "v0.3.2+birdv1.6.3"

)

nelljerram commented 3 years ago

Other useful input is from the PR at https://github.com/projectcalico/bird/pull/21. The contributor was proposing to update the iface for a tunnel route in the BIRD routing table, instead of only when exporting it to the local kernel. At the time I favoured a less dramatic change, but I now think the contributor’s approach may have been along better lines.

nelljerram commented 3 years ago

Putting all of that together, it now seems to me that - maybe -

        onlink = true
        gw = nexthop
        ifname = tunl0

Ideally, then, we’d upstream the “onlink” setting capability, and then we’d no longer need our own BIRD fork.

However, there are these things I’m not yet sure about, and that need more review or testing:

Finally, after all that, we still need to work out how to do the equivalent of that import filter code for each path of an ECMP route. But I think it would be much easier to ask about that - and to upstream any further code changes that might be needed for it - if we were using the upstream BIRD code.

nelljerram commented 3 years ago

The first step of the above plan - removing the orig_gw field and using BA_NEXT_HOP instead - has been done by https://github.com/projectcalico/bird/pull/75 and https://github.com/projectcalico/node/pull/380.

Tigera themselves have no current schedule for pursuing other pieces of the plan above, so contributions would be very welcome.

rcarpa commented 1 year ago

Hi team. This message is just to inform you that latest bird release (2.0.12) implemented setting the onlink attribute on routes https://gitlab.nic.cz/labs/bird/-/commit/7144c9ca46f092da33a4e051bbce8f973a3bd8c4

nelljerram commented 1 year ago

Thanks @rcarpa . It would take a bit of work for us to update from our current code based on 1.6.8, but we certainly might do that in future, and it's useful to know that a version of this particular tweak is now upstream.

jjqq2013 commented 1 year ago

I have confirmed that the following version still has the issue.

BIRD version v0.3.3+birdv1.6.8

Is there any version that have fixed this issue?

rcarpa commented 1 year ago

@jjqq2013 , you probably miss-understood my comment. It's about vanila upstream bird 2.0.12 supporting a feature which would allow the calico project to switch from their old fork of bird to upstream bird. When this is done, ECMP support should (tm) improve. I think it would even be possible to advertise multiple routes to the same node via bgp add-path and solve the issue of imbalanced traffic to services having externalTrafficPolicy = Local. However, as of today, there was no work done to perform this migration on the calico side. This issue is not fixed yet. Also, I have no links to the calico development team, but I would assume nobody is even working on it yet ;)

jjqq2013 commented 1 year ago

@rcarpa thank you for your explanation.

I am new to this project.

To be sure that the issue is same, please let me explain the issue I am facing. If it is a different issue, I will create a new ticket.

There is a server with has following ip routes, i.e., ECMP

100.z.z.z/m proto bgp src 100.117.134.89 metric 20
        nexthop via inet6 fe80::920a:84ff:fe7e:f16c dev eth0 weight 1
        nexthop via inet6 fe80::920a:84ff:fe7e:db6c dev eth1 weight 1

What I expect is the BIRD create ip routes like:

...
192.168.x.x/y via 100.z.z.z dev tunl0 proto bird onlink
...

according to actual peer status.

However, the actual result is like:

...
unreachable 192.168.x.x/y proto bird
...

A notable thing is that: if I don't use ECMP, then BIRD works.

i.e., if I change the route from

100.z.z.z/m proto bgp src 100.117.134.89 metric 20
        nexthop via inet6 fe80::920a:84ff:fe7e:f16c dev ens3f1.1037 weight 1
        nexthop via inet6 fe80::920a:84ff:fe7e:db6c dev ens3f0.1037 weight 1

to

100.z.z.z/m nhid 14 via inet6 fe80::920a:84ff:fe7e:f16c dev eth0 proto bgp src 100.117.134.89 metric 20

then BIRD will produce

192.168.x.x/y via 100.z.z.z dev tunl0 proto bird onlink

PS, the BIRD config actually does not matter that much, because it can produce expected above result when not using ECMP.

Just in case, let me paste part of BIRD config:

...
filter calico_kernel_programming {

  if ( net ~ 192.168.0.0/16 ) then {
    krt_tunnel = "tunl0";                     ################## this is the famous hacking
    accept;
  }

  accept;
}

router id 100.117.134.89;

# Configure synchronization between routing tables and kernel.
protocol kernel {
  learn;             # Learn all alien routes from the kernel
  persist;           # Don't remove routes on bird shutdown
  scan time 2;       # Scan kernel routing table every 2 seconds
  import all;
  export filter calico_kernel_programming; # Default is export none
  graceful restart;  # Turn on graceful restart to reduce potential flaps in
                     # routes when reloading BIRD configuration.  With a full
                     # automatic mesh, there is no way to prevent BGP from
                     # flapping since multiple nodes update their BGP
                     # configuration at the same time, GR is not guaranteed to
                     # work correctly in this scenario.
}

...

# Template for all BGP clients
template bgp bgp_template {
  debug { states };
  description "Connection to BGP peer";
  local as 64512;
  multihop;
  gateway recursive; # This should be the default, but just in case.
  import all;        # Import all routes, since we don't know what the upstream
                     # topology is and therefore have to trust the ToR/RR.
  export filter calico_export_to_bgp_peers;  # Only want to export routes for workloads.
  source address 100.117.134.89;  # The local address we use for the TCP connection
  add paths on;
  graceful restart;  # See comment in kernel section about graceful restart.
  connect delay time 2;
  connect retry time 5;
  error wait time 5,30;
}

protocol bgp Mesh_100_89_27_72 from bgp_template {
  neighbor 100.89.27.72 as 64512;
}

... many bgp peers ..., omitted here
jjqq2013 commented 1 year ago

I think the issue I mentioned happens like this:

100.z.z.z/m proto bgp src 100.117.134.89 metric 20
        nexthop via inet6 fe80::920a:84ff:fe7e:f16c dev ens3f1.1037 weight 1
        nexthop via inet6 fe80::920a:84ff:fe7e:db6c dev ens3f0.1037 weight 1

Am I correct?

jjqq2013 commented 1 year ago

If my thought is correct, I guess I need find where the logic of

BIRD want to make sure that the 100.z.z.z is reachable, so BIRD check the kernel ip route table to make it recognize ECMP route (please correct my term if it is wrong).

It would be greatly appreciated If anyone can point out the location.

For now, I am checking Coder' Document such as https://bird.network.cz/?get_doc&v=20&f=prog.html#toc2.2 to understand the source.

jjqq2013 commented 1 year ago

My static analysis shows that BIRD seems can not import following ECMP routes from protocol kernel

100.z.z.z/m proto bgp src 100.117.134.89 metric 20
        nexthop via inet6 fe80::920a:84ff:fe7e:f16c dev ens3f1.1037 weight 1
        nexthop via inet6 fe80::920a:84ff:fe7e:db6c dev ens3f0.1037 weight 1

There is a also proof from the log:

bird: KRT: Received strange multipath route 0.0.0.0/0
bird: KRT: Received strange multipath route 10.0.0.0/24
bird: KRT: Received strange multipath route 10.0.0.0/10
bird: KRT: Received strange multipath route 10.0.0.0/8
bird: KRT: Received strange multipath route 10.0.24.0/23
bird: KRT: Received strange multipath route 10.0.209.132/32
bird: KRT: Received strange multipath route 10.0.209.137/32
bird: KRT: Received strange multipath route 10.1.1.0/24
...

It happens like this:

Normally BIRD prepare hostentry with rt_update_hostentry which initially set RTD_UNREACHABLE, then change to RTD_ROUTER etc according to the kernel route definition.

static int
rt_update_hostentry(rtable *tab, struct hostentry *he)
{
  rta *old_src = he->src;
  int pxlen = 0;

  /* Reset the hostentry */ 
  he->src = NULL;
  he->gw = IPA_NONE;
  he->dest = RTD_UNREACHABLE;                   ######## this is essentially the only place which set RTD_UNREACHABLE
  he->igp_metric = 0;

  net *n = net_route(tab, he->addr, MAX_PREFIX_LENGTH);     ####### this returns NULL or an entry with RTD_UNREACHABLE.

  ...
    he->dest = RTD_ROUTER;
  ...
   he->dest = a->dest;
  ...

(Another place which set RTD_UNREACHABLE is unrelated, it is in nl_parse_route function which just copy kernel route table's RTD_UNREACHABLE)


So, I come up with following workarounds, hope either of them works, the essential idea is to tell BIRD that there is an route to any external routers.

default via x.x.x.x dev x onlink

(since Calico-BIRD only produce kernel route entries for those need IPIP tunnel, any router being treated as alive is OK).

jjqq2013 commented 1 year ago

Fixed this issue in https://github.com/jjqq2013/calico-bird/commit/716e5451f93d4acffb44c501bb307610dc23e9bb.

I have tested in an actual environment, seems works well.

Sorry I don't have much time to create a complete PR.

kavana-14 commented 2 weeks ago

I'm working on Kubernetes networking, with calico cni plugin. I want to configure the BGP BIRD that is available in the ProjectCalico repo. But that static routing format that is supported in vanilla BIRD is not supported by Calico BIRD project.
https://github.com/projectcalico/bird.git //where the repo was cloned installation step: ./configure make make install
in vanilla BIRD we used to configure like this:

protocol static {
        ipv4 { export all; };
        route 10.0.0.0/24 via 55.55.55.44;
        route 10.10.0.0/16 blackhole;
        route 10.20.0.0/20 unreachable;
        route 10.30.50.0/28 prohibit;
        route 20.20.20.0/24 via 2001:1::2;
}

 protocol static {
        ipv6 { export all; };
        route 2001:db8:1::/48 via 5555::6666;
        route 2001:db8:2::/48 blackhole;
        route 2001:db8:3::/48 prohibit;
        route 2001:db8:4::/48 unreachable;
}

but this doesnt work anymore with calico BIRD

I'm expecting the correct format of static routing for the calico bird conf file

nelljerram commented 2 weeks ago

@kavana-14 https://github.com/projectcalico/bird.git is currently still based on BIRD v1.6.8, and the BIRD config format changed between 1.6.x and 2.x. So that explains your observations.

I'm not sure it was a very big format change; I think it was mostly adding the ipv4 and ipv6 subsections. In order to make your config above work with Calico's BIRD, you might only need to remove those, i.e. to use instead

protocol static {
        export all;
        route 10.0.0.0/24 via 55.55.55.44;
        route 10.10.0.0/16 blackhole;
        route 10.20.0.0/20 unreachable;
        route 10.30.50.0/28 prohibit;
        route 20.20.20.0/24 via 2001:1::2;
}

in the BIRD ipv4 config, and

 protocol static {
        export all;
        route 2001:db8:1::/48 via 5555::6666;
        route 2001:db8:2::/48 blackhole;
        route 2001:db8:3::/48 prohibit;
        route 2001:db8:4::/48 unreachable;
}

in the BIRD ipv6 config.

Longer term, we are working on removing the differences between Calico BIRD and BIRD v1.6.8, and once we have done that it should become much easier for us to update to a current upstream BIRD.

kavana-14 commented 2 weeks ago

@nelljerram. With calico-bird we are also observing as below:

  1. bird without enable-ipv6 is not distributing ipv6 routes through BGP and
  2. bird with enable-ipv6 is not distributing ipv4 routes through BGP

Is there a way one address family redistributing routes for both ipv4 and ipv6?

nelljerram commented 2 weeks ago

In current Calico the v4 and v6 BIRD processing are independent of each other.

If you are seeing ipv4 problems, I recommend we focus first on that. What ip r output do you see on the nodes?

kavana-14 commented 2 weeks ago

@nelljerram. Tried the same thing in Vanilla BIRD, that configuration is working for both ipv4 and ipv6 static routes.

output of ip r:

$ ip r
default via 172.27.1.1 dev ens3 proto dhcp src 172.27.1.251 metric 100
32.32.32.32 via 172.27.1.102 dev ens3
169.254.169.254 via 172.27.1.2 dev ens3 proto dhcp src 172.27.1.251 metric 100
172.27.1.0/24 dev ens3 proto kernel scope link src 172.27.1.251 metric 100
172.27.1.1 dev ens3 proto dhcp scope link src 172.27.1.251 metric 100
172.27.1.2 dev ens3 proto dhcp scope link src 172.27.1.251 metric 100
192.168.68.0/26 via 172.27.1.231 dev ens3 proto 80 onlink
blackhole 192.168.161.128/26 proto 80
192.168.161.139 dev cali987dccbe7d7 scope link
192.168.161.140 dev calia0a429faea3 scope link
nelljerram commented 2 weeks ago

Great. So is there a remaining problem at the moment?

kavana-14 commented 2 weeks ago

@nelljerram. I think ip r output was from Vanilla BIRD, But while configuring the Calico's BIRD still the IPv6 address is not parsing.

Part of the code from cf-lex.l

{DIGIT}+\.{DIGIT}+\.{DIGIT}+\.{DIGIT}+ {
  ip4_addr a;
//  printf("Matched IPv4 address: %s\n", yytext);
  if (!ip4_pton(yytext, &a))
    cf_error("Invalid IPv4 address %s", yytext);

#ifdef IPV6
  cf_lval.i32 = ip4_to_u32(a);
  return RTRID;
#else
  cf_lval.a = ipa_from_ip4(a);
  return IPA;
#endif
}

({XIGIT}*::|({XIGIT}*:){3,})({XIGIT}*|{DIGIT}+\.{DIGIT}+\.{DIGIT}+\.{DIGIT}+) {
#ifdef IPV6
  if (ipa_pton(yytext, &cf_lval.a))
    return IPA;
  cf_error("Invalid IPv6 address %s", yytext);
#else
  cf_error("This is an IPv4 router, therefore IPv6 addresses are not supported");

#endif
}

bird.conf:

protocol static {
        export all;
        route 10.0.0.0/16 via 55.55.55.44;
        route 10.10.0.0/16 blackhole;
        route 10.20.0.0/20 unreachable;
        route 10.30.50.0/28 prohibit;
#       route 20.20.20.0/24 via 2001:1::2;
}

protocol static {
    route 2001:db8:1::/48 via 5555::6666;
    route 2001:db8:2::/48 blackhole;
    route 2001:db8:3::/48 prohibit;
    route 2001:db8:4::/48 unreachable;
    export all;
}

output: $ ./bird -p -c ./bird.conf bird: ./bird.conf:This is an IPv4 router, therefore IPv6 addresses are not supported

nelljerram commented 2 weeks ago

Yes, as the error says, Calico BIRD doesn't support forwarding to an IPv4 prefix via an IPv6 gateway.

kavana-14 commented 2 weeks ago

@nelljerram Why don't it support both ipv4 and ipv6 static routes. Why only either ipv4 or ipv6? Only one of those are supported. But same is supported in Vanilla BIRD. I'm new to this concept, could you please elaborate on it.