Open fasaxc opened 9 years ago
Yeah, we should be black holing this, it's a known outstanding enhancement.
This actually requires an enhancement to our data model so that Felix knows what address ranges it should be blackholing. @tomdee or @spikecurtis, do you want to make sure you include this requirement in your thinking?
More helpful than just black-holing the traffic would be to generate ICMP Unreachables. Knowledge of the IP pools Calico will be using is also helpful for correctly configuring BIRD (to prevent compute nodes from exporting non-Calico routes). calico-docker has this in etcd [https://github.com/Metaswitch/calico-docker/blob/master/docs/etcdStructure.md] and BIRD config. Should be simple to extend it to Felix.
@spikecurtis If we add a 'reject' route rather than a blackhole route that will get us ICMP Network Unreachable messages. Is that suitable, or are you worried about Network Unreachable being overbroad?
Note that if we use a 'reject' route instead of a 'blackhole' route then we will no longer proxy-arp for things that we don't have routes to. Arguably this is a desired behaviour, but it's worth knowing about.
You should also know that by default there's already a route for each subnet on the host, pointing to the dummy interface used by dnsmasq. Not sure what to do about that.
I suppose we need to think about live migration too. While we're moving workloads around, we need to either blackhole or bounce traffic to the new destination. Responding with an unreachable would cause TCP connections to drop.
I think we should blackhole unconditionally (i.e. no ICMP unreachables) both for live migration and for initial container startup. In both there are race conditions which we are not going to try to fix, and the apps in the containers will be more robust to blackhole than ICMP unreachable.
Sad, but true. I'm convinced.
This is fixed in container deployments where we configure bird to blackhole local IPAM blocks. Still an issue in OpenStack?
Well we certainly don't have any blackhole routes in OpenStack deployments (like the ones we have with containers). So yes, I believe this is still an issue in OpenStack.
For the sake of my full understanding... In container-land, I believe we address these through BIRD config. Why do we do that, rather than by having Felix program those routes? (@fasaxc)
We have a similar situation:
This is a typical pod traffic escape, internal pod traffic should never go out. This can be even worse if the switch and the route send this package out by default gateway to the internet. This package will never return, and may use out router's snat table, which can cause more serious issues.
This is fixed in container deployments where we configure bird to blackhole local IPAM blocks.
I suggest we should at lease block local IPAM blocks. Bird did this with calico bgp networking plugin, but canal just run felix without bird.
/cc @fasaxc @neiljerram
I wonder if the way forward here might be for Felix to program the blackhole routes, so that they're present regardless of BIRD or other routing setup?
I also note - IIUC - that our current Calico/confd solution only programs blackhole routes for the IP blocks on that particular host, not for the entire IP Pool CIDR. So I think all the misrouting problems described above can still happen if a packet gets routed to a host with destination IP inside one of Calico's IP pools but outside all of the block CIDRs for that host - right? To cover that, would it work for us to program but not export blackhole routes for the whole IP Pool CIDRs? (Of course, saying "but not export" implies that we will still need cooperation from BIRD or whatever other routing daemon is running.)
@fasaxc @caseydavenport WDYT?
that our current Calico/confd solution only programs blackhole routes for the IP blocks on that particular host, not for the entire IP Pool CIDR.
Actually, we create a Daemonset to blackhole the entire IP Pool CIDR, we know our situation and the side effect for block the whole CIDR, but I'm not sure everyone can do it in this way.
To cover that, would it work for us to program but not export blackhole routes for the whole IP Pool CIDRs?
I can't immediately think of a reason this shouldn't work - we should have more specific routes for all the other blocks of addresses within use in the cluster. I'd want to think about it some more though - there may be scenarios it doesn't fly.
There might be exotic scenarios where Calico shares IP address space with some other thing, and operators still want to use a default route for the IP Pool. Can we add the blackhole route at a low priority to make it easy for operators to overrule if they desire?
CNI plugins such as the AWS one mix host and pod IPs within the same subnet so in those environments we cant assume that the pool is 100% ours.
I believe some users also use IP pools as a way to tell Calico that an external CIDR is reachable without SNAT.
I think that rules our blackholing entire pools, we could blackhole the local blocks though.
CNI plugins such as the AWS one mix host and pod IPs within the same subnet so in those environments we cant assume that the pool is 100% ours.
In those scenarios we don't typically use an IP pool though, since we're not doing the networking.
TL;DR: sent packet from Jenkins server to VM IP address that wasn't in compute node's routing table yet; packet got forwarded to default LAN gateway even though it was for an address in the IP pool. Should we black-hole such traffic instead?
We hit this in the test rig; I'm not 100% sure if it's relevant in a real deployment. We're using subnets of 10/8 for our various network IP pools and we have our jenkins test server set up with a static route for all of 10/8 that goes to one of the compute nodes. Jenkins creates a network, and adds a VM into it then it tries to ssh into the VM.
There is a race where the ssh TCP packets get to the compute node before the route to the VM is in place and until the route is in place, the compute node seems to forward packets that were destined for the VM to its default gateway, which is on our internal LAN.
Since the internal LAN has some routes to (a different) 10/8 (I know, we're being naughty re-using addresses), our packets end up forwarded round the houses and come back (as dest unreachable) via a different route. I think that incorrect route then gets cached causing further problems (we were seeing Jenkins trying to send packets to the VM via the default gateway instead of via the compute node).