Closed djenriquez closed 4 years ago
Hi @djenriquez and thank you for reporting this! Just wanted to let you know that this looks very similar to #6580 and I'm investigating this currently.
Thanks @tgross. I saw it a few more times, but again, not really an issue since it seems to resolve right away on reschedule. Definitely seems to be some kind of race condition.
Ok, so as described in https://github.com/hashicorp/nomad/issues/6580#issuecomment-548596608, it turns out this is separate from the connectivity problem described there.
My theory from #6580 which I'll investigate here:
My hypothesis here is that when we call out to CNI to set up the veth pair, the veth pair is being set up correctly but then we're getting an error for bridge or iptables rules setup, which bubbles up as an error to
networking_bridge_linux.go#L163-L166
. In the second and third attempts, we find that the veth pair is already created, which causes the allocation to fail and get rescheduled. Maybe we're handling what the CNI plugin is giving us incorrectly or maybe there's a bug in the CNI plugin, but it's definitely a bug either way.
tl;dr
There are two things to do here:
firewall
and portmap
plugin. I'll start on this immediately.nomad
bridge and iptables chains we need at startup to avoid any undiscovered or future race conditions in the CNI plugins. (See https://github.com/hashicorp/nomad/issues/6618)Nomad's Connect integration uses CNI plugins to create the network namespaces, the bridge network, container veth pairs, and iptables rules. In the happy path case, we configure CNI using the go-cni
library and pass a chain of CNI plugin configurations. When Nomad gets a Connect-enabled job, it hands off control to the go-cni
library, which invokes the three plugins as external binaries. These external binaries in turn shell out to tools like iptables
on the host. When things go well:
bridge
plugin creates the nomad
bridge and the veth pair for the container, placing one end inside the container as eth0
.firewall
plugin creates some iptables chains and associated rules.portmap
plugin creates some iptables chains and associated rules.The problem we're seeing is a race condition in creating iptables chains when multiple Connect-enabled task groups are starting on the same client node, which is of course the common case for folks who are trying out our Connect integration in a dev environment.
In order to reproduce this, you need to flush all firewall chains/rules (it's easiest to simply restart the node). One of the allocations will fail. Then Nomad will retry the allocation a few seconds later. At that point everything proceeds as planned, and we'll never hit the race condition again on the same client node.
The error messages we get back from CNI are unhelpful but I was able to determine what was happening by using execsnoop
to trace CNI's invocations of iptables.
Here are relevant excerpts from a log where I ran execsnoop -tx | tee execsnoop.log
:
# in these logs, PID 2886 is Nomad itself
# this is the portmap execution run for task group 1
$ head -1 execsnoop.log && grep 3332 execsnoop.log
TIME(s) PCOMM PID PPID RET ARGS
2.082 portmap 3332 2886 0 "/opt/cni/bin/portmap"
2.087 iptables 3342 3332 0 "/usr/sbin/iptables --version"
2.089 iptables 3343 3332 0 "/usr/sbin/iptables -t nat -S --wait"
2.090 iptables 3345 3332 0 "/usr/sbin/iptables -t nat -N CNI-HOSTPORT-SETMARK --wait"
2.091 iptables 3346 3332 0 "/usr/sbin/iptables -t nat -C CNI-HOSTPORT-SETMARK -m comment --comment CNI portfwd masquerade mark -j MARK --set-xmark 0x2000/0x2000 --wait"
2.101 iptables 3353 3332 0 "/usr/sbin/iptables -t nat -A CNI-HOSTPORT-SETMARK -m comment --comment CNI portfwd masquerade mark -j MARK --set-xmark 0x2000/0x2000 --wait"
2.103 iptables 3356 3332 0 "/usr/sbin/iptables -t nat -N CNI-HOSTPORT-MASQ --wait"
2.105 iptables 3358 3332 0 "/usr/sbin/iptables -t nat -C CNI-HOSTPORT-MASQ -m mark --mark 0x2000/0x2000 -j MASQUERADE --wait"
2.107 iptables 3359 3332 0 "/usr/sbin/iptables -t nat -A CNI-HOSTPORT-MASQ -m mark --mark 0x2000/0x2000 -j MASQUERADE --wait"
...
# this is the portmap execution run for task group 2
$ head -1 execsnoop.log && grep 3334 execsnoop.log
2.082 portmap 3334 2886 0 "/opt/cni/bin/portmap"
2.089 iptables 3341 3334 0 "/usr/sbin/iptables --version"
2.091 iptables 3344 3334 0 "/usr/sbin/iptables -t nat -S --wait"
2.092 iptables 3347 3334 0 "/usr/sbin/iptables -t nat -C CNI-HOSTPORT-SETMARK -m comment --comment CNI portfwd masquerade mark -j MARK --set-xmark 0x2000/0x2000 --wait"
2.100 iptables 3352 3334 0 "/usr/sbin/iptables -t nat -A CNI-HOSTPORT-SETMARK -m comment --comment CNI portfwd masquerade mark -j MARK --set-xmark 0x2000/0x2000 --wait"
2.102 iptables 3355 3334 0 "/usr/sbin/iptables -t nat -S --wait"
2.104 iptables 3357 3334 0 "/usr/sbin/iptables -t nat -N CNI-HOSTPORT-MASQ --wait"
# this portmap invocation stops here!
So in this situation, we're seeing a timeline like the following:
bridge
plugin creates the nomad
bridge and the veth pair for the container, placing one end inside the container as eth0
.bridge
plugin tries to create the nomad
bridge.
eth0
.firewall
plugin creates some IP tables rules (eliding details here as we missed the race in this run).firewall
plugin creates some IP tables rules.portmap
plugin checks for the existence of the CNI-HOSTPORT-MASQ
chain and doesn't find it.portmap
plugin checks for the existence of the CNI-HOSTPORT-MASQ
chain and doesn't find it.portmap
plugin creates the CNI-HOSTPORT-MASQ
chain.portmap
plugin tries to create the CNI-HOSTPORT-MASQ
chain but its exists so it fails!
go-cni
and starts over at the top.bridge
plugin tries to create the nomad
bridge and safely no-ops.bridge
plugin tries to create the veth pair and finds it already exists, so it fails!It turns out (through a bunch of experimenting) that there are two opportunities to hit the race: one in the firewall
plugin and one in the portmap
plugin.
Unfortunately, because we're handing off control to the chain of CNI plugins at this point, there's not really a good way for us to handle this condition inside Nomad itself. We don't have enough information at the call site to determine what's happened. Any error message that bubbles up from the plugin comes from the iptables
call and those aren't really part of the CNI spec API contract. And once we get into this state, no matter how many times we retry we'll get the same result; the only way Nomad can handle this is by giving up on the allocation and starting over.
I dug into the code for the CNI plugins and found that portmap
claims to create its iptables chain idempotently but has a TOCTOU race condition. The firewall
plugin doesn't even make this claim. I've verified my theory above by running CNI plugins built from the following patch which is, uh, "not production ready" :grinning: Using this patch the jobs run every time.
diff --git a/plugins/meta/firewall/iptables.go b/plugins/meta/firewall/iptables.go
index faae35c..ded69ce 100644
--- a/plugins/meta/firewall/iptables.go
+++ b/plugins/meta/firewall/iptables.go
@@ -43,7 +43,9 @@ func ensureChain(ipt *iptables.IPTables, table, chain string) error {
}
}
- return ipt.NewChain(table, chain)
+ // DEBUG(tgross): ignore errors
+ ipt.NewChain(table, chain)
+ return nil
}
func generateFilterRule(privChainName string) []string {
diff --git a/plugins/meta/portmap/chain.go b/plugins/meta/portmap/chain.go
index bca8214..129e1c0 100644
--- a/plugins/meta/portmap/chain.go
+++ b/plugins/meta/portmap/chain.go
@@ -41,9 +41,8 @@ func (c *chain) setup(ipt *iptables.IPTables) error {
return err
}
if !exists {
- if err := ipt.NewChain(c.table, c.name); err != nil {
- return err
- }
+ // DEBUG(tgross): ignore errors
+ ipt.NewChain(c.table, c.name)
}
// Add the rules to the chain
I'm going to try to work this up into a production-ready patch for submission to the CNI project.
Also, while the iptables rules for each task group are specific to an allocation, the chains are a client-global resource. So from the Nomad side we could create the nomad
bridge and all expected CNI iptables chains on startup, rather than risking concurrent creation by CNI and playing wack-a-mole with their implementation. There's probably some design discussion to be had about whether that's overreaching in terms of changing the the client's state without being explicitly asked, but I'd lean towards doing it. (See https://github.com/hashicorp/nomad/issues/6618 for further discussion.)
cc @nickethier @shoenig and @schmichael as a heads up and for discussion
I've opened https://github.com/containernetworking/plugins/pull/408 for the upstream fix.
https://github.com/containernetworking/plugins/pull/408 was merged but just missed 0.8.3, so we'll want to look out for it being released in their 0.8.4 release and then update the documentation for our Connect integration.
I'll keep this issue open until that's been released and we have our docs updated.
Glad to hear, @tgross. So this fix would require an update to the CNI binary, would it require any changes to the Nomad side?
Sent from my iPhone
On Nov 15, 2019, at 10:51 AM, Tim Gross notifications@github.com wrote:
containernetworking/plugins#408 was merged but just missed 0.8.3, so we'll want to look out for it being released in their 0.8.4 release and then update the documentation for our Connect integration.
I'll keep this issue open until that's been released and we have our docs updated.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.
Right. No changes to Nomad.
https://github.com/containernetworking/plugins/releases/tag/v0.8.4 was just released. I'm going to get that into testing and then we'll update the docs to recommend that version.
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.
Nomad version
Operating system and Environment details
Amazon Linux 2
Issue
An allocation being scheduled failed with:
Reproduction steps
Nothing special, an update to a job was submitted to increase its count and change some meta to force allocation recreate. The rest of the allocations scheduled fine, just saw this one. It did reschedule just fine on the same host, just figured this should be logged.
Nomad Client logs (if appropriate)
More logs sent via email.