0.10.0 pre-run hook "network" failed

djenriquez commented 4 years ago

Nomad version

$ nomad --version
Nomad v0.10.0 (25ee121d951939504376c70bf8d7950c1ddb6a82)

Operating system and Environment details

Amazon Linux 2

Issue

An allocation being scheduled failed with:

failed to setup alloc: pre-run hook "network" failed: failed to configure networking for alloc: failed to configure bridge network: container veth name provided (eth0) already exists

Reproduction steps

Nothing special, an update to a job was submitted to increase its count and change some meta to force allocation recreate. The rest of the allocations scheduled fine, just saw this one. It did reschedule just fine on the same host, just figured this should be logged.

Nomad Client logs (if appropriate)

$ cat /var/log/messages | grep nomad | grep 917cdc42-b4ba-2004-876a-0aafdbdabf98
Oct 25 20:08:18 ip-10-178-43-106.compute.internal nomad[6003]: client.alloc_runner.runner_hook: failed to configure bridge network: alloc_id=917cdc42-b4ba-2004-876a-0aafdbdabf98 err="unable to create chain CNI-HOSTPORT-SETMARK: running [/usr/sbin/iptables -t nat -N CNI-HOSTPORT-SETMARK --wait]: exit status 1: iptables: Chain already exists.#012" attempt=1
Oct 25 20:08:20 ip-10-178-43-106.compute.internal nomad[6003]: client.alloc_runner.runner_hook: failed to configure bridge network: alloc_id=917cdc42-b4ba-2004-876a-0aafdbdabf98 err="container veth name provided (eth0) already exists" attempt=2
Oct 25 20:08:21 ip-10-178-43-106.compute.internal nomad[6003]: client.alloc_runner.runner_hook: failed to configure bridge network: alloc_id=917cdc42-b4ba-2004-876a-0aafdbdabf98 err="container veth name provided (eth0) already exists" attempt=3
Oct 25 20:08:21 ip-10-178-43-106.compute.internal nomad[6003]: client.alloc_runner: prerun failed: alloc_id=917cdc42-b4ba-2004-876a-0aafdbdabf98 error="pre-run hook "network" failed: failed to configure networking for alloc: failed to configure bridge network: container veth name provided (eth0) already exists"
Oct 25 20:08:21 ip-10-178-43-106.compute.internal nomad[6003]: client.gc: marking allocation for GC: alloc_id=917cdc42-b4ba-2004-876a-0aafdbdabf98
Oct 25 20:08:21 ip-10-178-43-106.compute.internal nomad[6003]: client.alloc_runner.runner_hook: failed to cleanup network for allocation, resources may have leaked: alloc_id=917cdc42-b4ba-2004-876a-0aafdbdabf98 alloc=917cdc42-b4ba-2004-876a-0aafdbdabf98 error="missing prevResult from earlier plugin"

More logs sent via email.

tgross commented 4 years ago

Hi @djenriquez and thank you for reporting this! Just wanted to let you know that this looks very similar to #6580 and I'm investigating this currently.

djenriquez commented 4 years ago

Thanks @tgross. I saw it a few more times, but again, not really an issue since it seems to resolve right away on reschedule. Definitely seems to be some kind of race condition.

tgross commented 4 years ago

Ok, so as described in https://github.com/hashicorp/nomad/issues/6580#issuecomment-548596608, it turns out this is separate from the connectivity problem described there.

My theory from #6580 which I'll investigate here:

My hypothesis here is that when we call out to CNI to set up the veth pair, the veth pair is being set up correctly but then we're getting an error for bridge or iptables rules setup, which bubbles up as an error to networking_bridge_linux.go#L163-L166. In the second and third attempts, we find that the veth pair is already created, which causes the allocation to fail and get rescheduled. Maybe we're handling what the CNI plugin is giving us incorrectly or maybe there's a bug in the CNI plugin, but it's definitely a bug either way.

tgross commented 4 years ago

tl;dr

There are two things to do here:

We should submit patches to containernetworking/plugins for concurrency problems in the firewall and portmap plugin. I'll start on this immediately.
Nomad clients that are network-namespace capable should probably create the nomad bridge and iptables chains we need at startup to avoid any undiscovered or future race conditions in the CNI plugins. (See https://github.com/hashicorp/nomad/issues/6618)

Nomad's Connect integration uses CNI plugins to create the network namespaces, the bridge network, container veth pairs, and iptables rules. In the happy path case, we configure CNI using the go-cni library and pass a chain of CNI plugin configurations. When Nomad gets a Connect-enabled job, it hands off control to the go-cni library, which invokes the three plugins as external binaries. These external binaries in turn shell out to tools like iptables on the host. When things go well:

CNI bridge plugin creates the nomad bridge and the veth pair for the container, placing one end inside the container as eth0.
CNI firewall plugin creates some iptables chains and associated rules.
CNI portmap plugin creates some iptables chains and associated rules.

The problem we're seeing is a race condition in creating iptables chains when multiple Connect-enabled task groups are starting on the same client node, which is of course the common case for folks who are trying out our Connect integration in a dev environment.

In order to reproduce this, you need to flush all firewall chains/rules (it's easiest to simply restart the node). One of the allocations will fail. Then Nomad will retry the allocation a few seconds later. At that point everything proceeds as planned, and we'll never hit the race condition again on the same client node.

The error messages we get back from CNI are unhelpful but I was able to determine what was happening by using execsnoop to trace CNI's invocations of iptables.

Here are relevant excerpts from a log where I ran execsnoop -tx | tee execsnoop.log:

# in these logs, PID 2886 is Nomad itself

# this is the portmap execution run for task group 1
$ head -1 execsnoop.log && grep 3332 execsnoop.log
TIME(s) PCOMM            PID    PPID   RET ARGS
2.082   portmap          3332   2886     0 "/opt/cni/bin/portmap"
2.087   iptables         3342   3332     0 "/usr/sbin/iptables --version"
2.089   iptables         3343   3332     0 "/usr/sbin/iptables -t nat -S --wait"
2.090   iptables         3345   3332     0 "/usr/sbin/iptables -t nat -N CNI-HOSTPORT-SETMARK --wait"
2.091   iptables         3346   3332     0 "/usr/sbin/iptables -t nat -C CNI-HOSTPORT-SETMARK -m comment --comment CNI portfwd masquerade mark -j MARK --set-xmark 0x2000/0x2000 --wait"
2.101   iptables         3353   3332     0 "/usr/sbin/iptables -t nat -A CNI-HOSTPORT-SETMARK -m comment --comment CNI portfwd masquerade mark -j MARK --set-xmark 0x2000/0x2000 --wait"
2.103   iptables         3356   3332     0 "/usr/sbin/iptables -t nat -N CNI-HOSTPORT-MASQ --wait"
2.105   iptables         3358   3332     0 "/usr/sbin/iptables -t nat -C CNI-HOSTPORT-MASQ -m mark --mark 0x2000/0x2000 -j MASQUERADE --wait"
2.107   iptables         3359   3332     0 "/usr/sbin/iptables -t nat -A CNI-HOSTPORT-MASQ -m mark --mark 0x2000/0x2000 -j MASQUERADE --wait"
...

# this is the portmap execution run for task group 2
$ head -1 execsnoop.log && grep 3334 execsnoop.log
2.082   portmap          3334   2886     0 "/opt/cni/bin/portmap"
2.089   iptables         3341   3334     0 "/usr/sbin/iptables --version"
2.091   iptables         3344   3334     0 "/usr/sbin/iptables -t nat -S --wait"
2.092   iptables         3347   3334     0 "/usr/sbin/iptables -t nat -C CNI-HOSTPORT-SETMARK -m comment --comment CNI portfwd masquerade mark -j MARK --set-xmark 0x2000/0x2000 --wait"
2.100   iptables         3352   3334     0 "/usr/sbin/iptables -t nat -A CNI-HOSTPORT-SETMARK -m comment --comment CNI portfwd masquerade mark -j MARK --set-xmark 0x2000/0x2000 --wait"
2.102   iptables         3355   3334     0 "/usr/sbin/iptables -t nat -S --wait"
2.104   iptables         3357   3334     0 "/usr/sbin/iptables -t nat -N CNI-HOSTPORT-MASQ --wait"
# this portmap invocation stops here!

So in this situation, we're seeing a timeline like the following:

[group 1] CNI bridge plugin creates the nomad bridge and the veth pair for the container, placing one end inside the container as eth0.
[group 2] CNI bridge plugin tries to create the nomad bridge.
- This is a safe no-op in CNI 0.8.2 because of https://github.com/containernetworking/plugins/pull/366. Pre-0.8.2. we'll throw an error here instead, but that's known.
- The plugin creates the veth pair for the container, placing one end inside the container as eth0.
[group 1] CNI firewall plugin creates some IP tables rules (eliding details here as we missed the race in this run).
[group 2] CNI firewall plugin creates some IP tables rules.
[group 1] CNI portmap plugin checks for the existence of the CNI-HOSTPORT-MASQ chain and doesn't find it.
[group 2] Concurrently, CNI portmap plugin checks for the existence of the CNI-HOSTPORT-MASQ chain and doesn't find it.
[group 1] CNI portmap plugin creates the CNI-HOSTPORT-MASQ chain.
[group 2] Concurrently, CNI portmap plugin tries to create the CNI-HOSTPORT-MASQ chain but its exists so it fails!
- At this point, Nomad gets the error back from go-cni and starts over at the top.
- [group 2] CNI bridge plugin tries to create the nomad bridge and safely no-ops.
- [group 2] CNI bridge plugin tries to create the veth pair and finds it already exists, so it fails!
- [group 2] Start over at the top again, get the same error. This is attempt 3, so we fail the allocation.

It turns out (through a bunch of experimenting) that there are two opportunities to hit the race: one in the firewall plugin and one in the portmap plugin.

Unfortunately, because we're handing off control to the chain of CNI plugins at this point, there's not really a good way for us to handle this condition inside Nomad itself. We don't have enough information at the call site to determine what's happened. Any error message that bubbles up from the plugin comes from the iptables call and those aren't really part of the CNI spec API contract. And once we get into this state, no matter how many times we retry we'll get the same result; the only way Nomad can handle this is by giving up on the allocation and starting over.

I dug into the code for the CNI plugins and found that portmap claims to create its iptables chain idempotently but has a TOCTOU race condition. The firewall plugin doesn't even make this claim. I've verified my theory above by running CNI plugins built from the following patch which is, uh, "not production ready" :grinning: Using this patch the jobs run every time.

diff --git a/plugins/meta/firewall/iptables.go b/plugins/meta/firewall/iptables.go
index faae35c..ded69ce 100644
--- a/plugins/meta/firewall/iptables.go
+++ b/plugins/meta/firewall/iptables.go
@@ -43,7 +43,9 @@ func ensureChain(ipt *iptables.IPTables, table, chain string) error {
                }
        }

-       return ipt.NewChain(table, chain)
+       // DEBUG(tgross): ignore errors
+       ipt.NewChain(table, chain)
+       return nil
 }

 func generateFilterRule(privChainName string) []string {
diff --git a/plugins/meta/portmap/chain.go b/plugins/meta/portmap/chain.go
index bca8214..129e1c0 100644
--- a/plugins/meta/portmap/chain.go
+++ b/plugins/meta/portmap/chain.go
@@ -41,9 +41,8 @@ func (c *chain) setup(ipt *iptables.IPTables) error {
                return err
        }
        if !exists {
-               if err := ipt.NewChain(c.table, c.name); err != nil {
-                       return err
-               }
+               // DEBUG(tgross): ignore errors
+               ipt.NewChain(c.table, c.name)
        }

        // Add the rules to the chain

I'm going to try to work this up into a production-ready patch for submission to the CNI project.

Also, while the iptables rules for each task group are specific to an allocation, the chains are a client-global resource. So from the Nomad side we could create the nomad bridge and all expected CNI iptables chains on startup, rather than risking concurrent creation by CNI and playing wack-a-mole with their implementation. There's probably some design discussion to be had about whether that's overreaching in terms of changing the the client's state without being explicitly asked, but I'd lean towards doing it. (See https://github.com/hashicorp/nomad/issues/6618 for further discussion.)

cc @nickethier @shoenig and @schmichael as a heads up and for discussion

tgross commented 4 years ago

I've opened https://github.com/containernetworking/plugins/pull/408 for the upstream fix.

tgross commented 4 years ago

https://github.com/containernetworking/plugins/pull/408 was merged but just missed 0.8.3, so we'll want to look out for it being released in their 0.8.4 release and then update the documentation for our Connect integration.

I'll keep this issue open until that's been released and we have our docs updated.

djenriquez commented 4 years ago

Glad to hear, @tgross. So this fix would require an update to the CNI binary, would it require any changes to the Nomad side?

Sent from my iPhone

On Nov 15, 2019, at 10:51 AM, Tim Gross notifications@github.com wrote:

containernetworking/plugins#408 was merged but just missed 0.8.3, so we'll want to look out for it being released in their 0.8.4 release and then update the documentation for our Connect integration.

I'll keep this issue open until that's been released and we have our docs updated.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

tgross commented 4 years ago

Right. No changes to Nomad.

tgross commented 4 years ago

https://github.com/containernetworking/plugins/releases/tag/v0.8.4 was just released. I'm going to get that into testing and then we'll update the docs to recommend that version.

github-actions[bot] commented 1 year ago

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

hashicorp / nomad