projectcalico / bird

Calico's fork of the BIRD protocol stack
90 stars 86 forks source link

Use BA_NEXT_HOP attr instead of orig_gw #75

Closed nelljerram closed 4 years ago

nelljerram commented 4 years ago

While working on another BIRD problem, I noticed that we do not need the orig_gw change that we made in our fork, because I think we can access equivalent information as the BA_NEXT_HOP attribute.

Here's a diagram summarising the BIRD internal architecture and where our various hacks are:

  +--------+  [1] [2]  +---------+    [4]    +----------+  [5]
  | Local  |<--export--| BIRD    |--export-->| BGP      |<-------
  | Kernel |           | routing |           | protocol |         BGP peers
  |        |--import-->| table   |<--import--|          |------->
  +--------+    [3]    +---------+     all   +----------+

[1] calico_kernel_programming filter:

    reject if net ~ /rejectcidrs
    reject IPv4 IP pool CIDRs with VXLAN mode
    set EA_KRT_TUNNEL and accept if in IP pool with IP-IP
    accept everything else

[2] when programming route into local kernel

    netlink.c: nl_send_route: if EA_KRT_TUNNEL provided, use it, with orig_gw
    i.e. set gw and oif

        From nl_add_rte, from krt_replace_rte

    This is designed for when BIRD has received a route from a BGP
    peer (which might not be directly connected).  After receipt by
    the local BGP, and import into the routing table, the route will
    be like

        10.65.128.192/26 via <resolved next hop> dev eth0

    When BGP peers are directly connected, <resolved next hop> will be
    the peer IP.  For GCP, <resolved next hop> will be the GCP gateway
    10.240.0.1.  For cases like GCP, where we use IP-IP to get through
    the gateway, we want to program that as

        10.65.128.192/26 via <IP of host for that block> dev tunl0 onlink

    If that route came to us directly from BIRD on the relevant host,
    we could get <IP of host for that block> as the route's "from"
    attribute.  But that would be wrong if it came via an RR.  In both
    cases the right IP is whatever was in the NEXTHOP attribute of the
    BGP UPDATE message, so that's what we should use, and that's what
    our "orig_gw" change is achieving.

[3] when importing route from local kernel to BIRD routing table

    netlink.c: nl_parse_multipath: Always NEF_ONLINK in neigh_find2 call
    netlink.c: nl_parse_route: Always NEF_ONLINK in neigh_find2 call

    krt.c: krt_same_dest: Tunnel flapping change
    krt.c: krt_got_route: Tunnel flapping change

        From nl_announce_route, from nl_parse_route/nl_parse_end

[4] bgp_import_control
      bgp_update_attrs if route came from BGP
        Set next hop to self if:
          No existing next hop ||
          "next hop self" ||
          (!"next hop keep" && more conditions...)
      bgp_create_attrs otherwise
        Set next hop to self if:
          "next hop self" ||
          rta->dest != RTD_ROUTER ||
          rta->gw is none or is link local ||
          (!"next hop keep" && more conditions...)
        Otherwise set next hop to rta->gw

    filter calico_export_to_bgp_peers:

        accept block CIDRs
        reject more specific routes inside block CIDRs
        accept if net ~ /staticroutes
        accept if net ~ IPv4 IP pool CIDRs
        reject

[5] On receipt of UPDATE message from a BGP peer

bgp_rx_update
  bgp_do_rx_update
    bgp_decode_attrs
      Maps BGP UPDATE attrs to struct rta
    bgp_set_next_hop
      direct vs recursive algorithm here
      Calico: set orig_gw to BGP nexthop, before possibly changing the latter
      rta_set_recursive_next_hop
    bgp_rte_update
      rte_update2
        Run import filter
        Add to BIRD routing table

    packets.c: bgp_set_next_hop: 'a->orig_gw = *nexthop' just before rta_set_recursive_next_hop call

       This is called from bgp_do_rx_update (two versions, IPv4 and IPv6),
       from bgp_rx_update, so on receipt of a BGP UPDATE message from a
       peer.

We store orig_gw on receipt of a BGP UPDATE message (step [5]), but the nexthop attribute from the message is also stored as the BA_NEXT_HOP attribute. I previously thought that BA_NEXT_HOP could be overwritten by "next hop self" processing, but in fact that only happens when we're exporting a locally originated route to BGP - so that will never interact with tunnel route programming for a route that we've received from elsewhere. Therefore we can revert the orig_gw field change and use BA_NEXT_HOP instead.

(More generally, I'm doing this as part of a push to reduce or eliminate our diffs w.r.t. upstream BIRD.)

nelljerram commented 4 years ago

Test in progress at https://semaphoreci.com/calico/node/branches/pull-request-380/builds/1