Black-hole traffic to unknown IP pool addresses

fasaxc commented 9 years ago

TL;DR: sent packet from Jenkins server to VM IP address that wasn't in compute node's routing table yet; packet got forwarded to default LAN gateway even though it was for an address in the IP pool. Should we black-hole such traffic instead?

We hit this in the test rig; I'm not 100% sure if it's relevant in a real deployment. We're using subnets of 10/8 for our various network IP pools and we have our jenkins test server set up with a static route for all of 10/8 that goes to one of the compute nodes. Jenkins creates a network, and adds a VM into it then it tries to ssh into the VM.

There is a race where the ssh TCP packets get to the compute node before the route to the VM is in place and until the route is in place, the compute node seems to forward packets that were destined for the VM to its default gateway, which is on our internal LAN.

Since the internal LAN has some routes to (a different) 10/8 (I know, we're being naughty re-using addresses), our packets end up forwarded round the houses and come back (as dest unreachable) via a different route. I think that incorrect route then gets cached causing further problems (we were seeing Jenkins trying to send packets to the VM via the default gateway instead of via the compute node).

Lukasa commented 9 years ago

Yeah, we should be black holing this, it's a known outstanding enhancement.

Lukasa commented 9 years ago

This actually requires an enhancement to our data model so that Felix knows what address ranges it should be blackholing. @tomdee or @spikecurtis, do you want to make sure you include this requirement in your thinking?

spikecurtis commented 9 years ago

More helpful than just black-holing the traffic would be to generate ICMP Unreachables. Knowledge of the IP pools Calico will be using is also helpful for correctly configuring BIRD (to prevent compute nodes from exporting non-Calico routes). calico-docker has this in etcd [https://github.com/Metaswitch/calico-docker/blob/master/docs/etcdStructure.md] and BIRD config. Should be simple to extend it to Felix.

Lukasa commented 9 years ago

@spikecurtis If we add a 'reject' route rather than a blackhole route that will get us ICMP Network Unreachable messages. Is that suitable, or are you worried about Network Unreachable being overbroad?

Lukasa commented 9 years ago

Note that if we use a 'reject' route instead of a 'blackhole' route then we will no longer proxy-arp for things that we don't have routes to. Arguably this is a desired behaviour, but it's worth knowing about.

You should also know that by default there's already a route for each subnet on the host, pointing to the dummy interface used by dnsmasq. Not sure what to do about that.

fasaxc commented 9 years ago

I suppose we need to think about live migration too. While we're moving workloads around, we need to either blackhole or bounce traffic to the new destination. Responding with an unreachable would cause TCP connections to drop.

lxpollitt commented 9 years ago

I think we should blackhole unconditionally (i.e. no ICMP unreachables) both for live migration and for initial container startup. In both there are race conditions which we are not going to try to fix, and the apps in the containers will be more robust to blackhole than ICMP unreachable.

spikecurtis commented 9 years ago

Sad, but true. I'm convinced.

fasaxc commented 8 years ago

This is fixed in container deployments where we configure bird to blackhole local IPAM blocks. Still an issue in OpenStack?

nelljerram commented 8 years ago

Well we certainly don't have any blackhole routes in OpenStack deployments (like the ones we have with containers). So yes, I believe this is still an issue in OpenStack.

For the sake of my full understanding... In container-land, I believe we address these through BIRD config. Why do we do that, rather than by having Felix program those routes? (@fasaxc)

ijumps commented 5 years ago

We have a similar situation:

we use canal(vxlan) for our k8s networking plugin
when we do a benchmark about setup many server and curl them, some of the server may down but client still curl the died ip (actually this can happen for normal usage, e.g. when k8s do health check)
then client package went to the server host via vxlan
the host unpack the vxlan packages, and cause the ip not existed anymore, the unpack packages went to host's default gateway.

This is a typical pod traffic escape, internal pod traffic should never go out. This can be even worse if the switch and the route send this package out by default gateway to the internet. This package will never return, and may use out router's snat table, which can cause more serious issues.

This is fixed in container deployments where we configure bird to blackhole local IPAM blocks.

I suggest we should at lease block local IPAM blocks. Bird did this with calico bgp networking plugin, but canal just run felix without bird.

/cc @fasaxc @neiljerram

nelljerram commented 5 years ago

I wonder if the way forward here might be for Felix to program the blackhole routes, so that they're present regardless of BIRD or other routing setup?

I also note - IIUC - that our current Calico/confd solution only programs blackhole routes for the IP blocks on that particular host, not for the entire IP Pool CIDR. So I think all the misrouting problems described above can still happen if a packet gets routed to a host with destination IP inside one of Calico's IP pools but outside all of the block CIDRs for that host - right? To cover that, would it work for us to program but not export blackhole routes for the whole IP Pool CIDRs? (Of course, saying "but not export" implies that we will still need cooperation from BIRD or whatever other routing daemon is running.)

@fasaxc @caseydavenport WDYT?

ijumps commented 5 years ago

that our current Calico/confd solution only programs blackhole routes for the IP blocks on that particular host, not for the entire IP Pool CIDR.

Actually, we create a Daemonset to blackhole the entire IP Pool CIDR, we know our situation and the side effect for block the whole CIDR, but I'm not sure everyone can do it in this way.

caseydavenport commented 5 years ago

To cover that, would it work for us to program but not export blackhole routes for the whole IP Pool CIDRs?

I can't immediately think of a reason this shouldn't work - we should have more specific routes for all the other blocks of addresses within use in the cluster. I'd want to think about it some more though - there may be scenarios it doesn't fly.

spikecurtis commented 5 years ago

There might be exotic scenarios where Calico shares IP address space with some other thing, and operators still want to use a default route for the IP Pool. Can we add the blackhole route at a low priority to make it easy for operators to overrule if they desire?

fasaxc commented 5 years ago

CNI plugins such as the AWS one mix host and pod IPs within the same subnet so in those environments we cant assume that the pool is 100% ours.

I believe some users also use IP pools as a way to tell Calico that an external CIDR is reachable without SNAT.

fasaxc commented 5 years ago

I think that rules our blackholing entire pools, we could blackhole the local blocks though.

caseydavenport commented 4 years ago

CNI plugins such as the AWS one mix host and pod IPs within the same subnet so in those environments we cant assume that the pool is 100% ours.

In those scenarios we don't typically use an IP pool though, since we're not doing the networking.

projectcalico / calico

Black-hole traffic to unknown IP pool addresses #3498