Docker containers managed by Nomad in bridge network mode are brought back up with broken networks.

Jess3Jane commented 9 months ago

Nomad version

Nomad v1.7.4
BuildDate 2024-02-08T14:34:12Z
Revision 29019121564e2ef7f5e2a227af6b959510bcc142

Though we are hitting it in v1.7.2 as well

Operating system and Environment details

root@client-1:~# uname -a
Linux client-1 5.15.0-67-generic #74-Ubuntu SMP Wed Feb 22 14:14:39 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
root@client-1:~# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 22.04.2 LTS
Release:    22.04
Codename:   jammy

We have hit this on multiple machines with slightly different versions, though all are Ubuntu 22.04. These are the details of a completely fresh Digital Ocean instance I used to reproduce the bug.

Issue

We have noticed that when we restart the Docker daemon on our machines every Nomad job on the client is brought back up with a busted network. To be more specific, it is brought up with no network. For example, my test container before restarting docker has the following networks:

$ ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
2: eth0@if8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default 
    link/ether 82:4c:5d:70:4e:fc brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 172.26.64.4/20 brd 172.26.79.255 scope global eth0
       valid_lft forever preferred_lft foreve

and after restarting the daemon, is brought back up with just loopback:

$ ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft foreve

This happens with every container, including the Nomad init container. Docker restarts the containers (as expected), the veths get recreated (as expected), but the containers now lack any interfaces other than loopback (unexpected).

Things that might be notable, the nomad network changes from <BROADCAST,MULTICAST,UP,LOWER_UP> to <NO-CARRIER,BROADCAST,MULTICAST,UP> and on machines with systemd-networkd, it's logs complain about the veth's loosing carrier.

Reproduction steps

Spin up a fresh Ubuntu 22.04 server (I used a Digital Ocean droplet for our reproduction but we've noticed this happening across our fleet so I don't think they're doing anything weird).
Install docker-ce as per their docs (I used Docker's apt registry to install it).
Install Nomad as per the docs (for the reproduction I specifically used the version of Nomad from Hashicorps repos).
Install the base CNI plugins by placing the contents of https://github.com/containernetworking/plugins/releases/download/v1.0.0/cni-plugins-linux-amd64-v1.0.0.tgzinto /opt/cni/bin
systemctl start docker
systemctl start nomad
Run literally any job (I've included my job file below but we've seen this happen with many jobs)
systemctl restart docker

Expected Result

The ip/port combo that the job binds should be curl-able. It is before docker is restarted.

Actual Result

If you curl the ip/port combo it will complain about having no route to host:

root@client-1:~# curl -v localhost:27846
*   Trying 127.0.0.1:27846...
*   Trying ::1:27846...
* connect to ::1 port 27846 failed: Connection refused
* connect to 127.0.0.1 port 27846 failed: No route to host
* Failed to connect to localhost port 27846 after 3061 ms: No route to host
* Closing connection 0
curl: (7) Failed to connect to localhost port 27846 after 3061 ms: No route to host

This makes sense as executing ip addr from within the container will now reveal the container has lost it's bridge network veth.

Job file (if appropriate)

We've noticed this happen with every job but the job file I used for the reproduction is:

job "jess-test-job" {
    type = "service"
    datacenters = ["*"]
    group "http" {
        network {
            mode = "bridge"
            port "http" {
                to = "80"
            }
        }
        task "whoami" {
            driver = "docker"
            config {
                image = "strm/helloworld-http"
                ports = ["http"]
            }
        }
    }
}

The toy instance I used for reproduction has a broken journal so sadly I have no logs from that to provide. If reproduction turns out to be an issue I'd be happy to send over some logs from one of our actual failing instances but I have a hunch this won't be that hard to reproduce.

Jess3Jane commented 9 months ago

Ah, I failed to mention that the reproduction was done with the default configuration that ships with Nomad so I don't think it's something weird in there breaking things.

p1u3o commented 9 months ago

I have this issue, it seems to be caused by the Docker/Nomad service being offline less than the heartbeat_grace, so Nomad doesn't consider the allocations lost and resumes them, but because Docker was offline the network namespaces are gone.

I worked around it by adding a sleep to the nomad service file which is longer than heartbeat_grace, so allocations are always considered lost and Nomad recreates them, including the network namespaces.

The nomad cluster I use utilises fast booting lightweight VMs (less than 10s) thus nearly always hits this.

...
[Service]
EnvironmentFile=-/etc/nomad.d/nomad.env
ExecStartPre=/bin/sleep 90
ExecReload=/bin/kill -HUP $MAINPID
ExecStart=/usr/bin/nomad agent -config /etc/nomad.d
...

Maybe https://github.com/hashicorp/nomad/pull/19886 would help when merged.

apollo13 commented 9 months ago

Crosslinking #15086 for visibility.

jrasell commented 8 months ago

Hi @Jess3Jane and thanks for raising this issue with a great reproduction. I was able to reproduce this locally and have included details below for future readers. I'll add this to our backlog.

Host networking, Docker processes, and health check endpoint after initial start.

root@uk1-c1:/home/jrasell# ip addr show veth541d761a
17: veth541d761a@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master nomad state UP group default
    link/ether ea:07:d7:03:b6:b2 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet6 fe80::e807:d7ff:fe03:b6b2/64 scope link
       valid_lft forever preferred_lft forever

root@uk1-c1:/home/jrasell# docker ps
CONTAINER ID   IMAGE                                      COMMAND                  CREATED         STATUS         PORTS     NAMES
f8edd356ec13   redis:7                                    "docker-entrypoint.s…"   4 minutes ago   Up 4 minutes             redis-1f994fe3-06b6-dbc9-2897-72b429a61820
32d148ee127a   gcr.io/google_containers/pause-arm64:3.1   "/pause"                 4 minutes ago   Up 4 minutes             nomad_init_1f994fe3-06b6-dbc9-2897-72b429a61820

root@uk1-c1:/home/jrasell# (printf "PING\r\n";) | nc 192.168.1.121 27080
+PONG

Task events show restart of the Docker processes:

Recent Events:
Time                  Type        Description
2024-02-20T08:36:22Z  Started     Task started by client
2024-02-20T08:36:04Z  Restarting  Task restarting in 17.156781522s
2024-02-20T08:36:04Z  Terminated  Exit Code: 0
2024-02-20T08:31:14Z  Started     Task started by client
2024-02-20T08:31:14Z  Task Setup  Building Task Directory
2024-02-20T08:31:14Z  Received    Task received by client

The health check no longer responds.

root@uk1-c1:/home/jrasell# (printf "PING\r\n";) | nc 192.168.1.121 27080
root@uk1-c1:/home/jrasell#

The Nomad client host machine (I only had this test job running on my cluster) no longer has a virtual interface configured:

ip addr show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: enp0s1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 52:54:00:5b:f4:27 brd ff:ff:ff:ff:ff:ff
    inet 192.168.121.22/24 metric 100 brd 192.168.121.255 scope global dynamic enp0s1
       valid_lft 55052sec preferred_lft 55052sec
    inet6 fd6b:32d9:3793:3897:5054:ff:fe5b:f427/64 scope global dynamic mngtmpaddr noprefixroute
       valid_lft 2591912sec preferred_lft 604712sec
    inet6 fe80::5054:ff:fe5b:f427/64 scope link
       valid_lft forever preferred_lft forever
3: enp0s2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 52:54:00:1f:6b:0c brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.121/24 brd 192.168.1.255 scope global enp0s2
       valid_lft forever preferred_lft forever
    inet6 fe80::5054:ff:fe1f:6b0c/64 scope link
       valid_lft forever preferred_lft forever
4: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
    link/ether 02:42:6c:60:7c:18 brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
       valid_lft forever preferred_lft forever
    inet6 fe80::42:6cff:fe60:7c18/64 scope link
       valid_lft forever preferred_lft forever
11: nomad: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default qlen 1000
    link/ether e2:48:45:4d:96:6e brd ff:ff:ff:ff:ff:ff
    inet 172.26.64.1/20 brd 172.26.79.255 scope global nomad
       valid_lft forever preferred_lft forever
    inet6 fe80::e048:45ff:fe4d:966e/64 scope link
       valid_lft forever preferred_lft forever

HeikoBoettger commented 8 months ago

Not sure whether this is realy related but I have similar issue together with CNI where port forwarding didn't work after all services were restarted (note: I masked the first two ip-address digits on the destination):

  | plugin type="portmap" failed (add): unable to setup DNAT: running [/sbin/iptables -t nat -A CNI-DN-231ebe256ae7b6bd9006d -p tcp --dport 8084 -d 127.0.0.1 -j DNAT --to-destination x.y.70.228:8080 --wait]: exit status 4: iptables: Resource temporarily unavailable.
  | pre-run hook "network" failed: failed to configure networking for alloc: failed to configure network: plugin type="portmap" failed (add): unable to setup DNAT: running [/sbin/iptables -t nat -A CNI-DN-231ebe256ae7b6bd9006d -p tcp --dport 8084 -d 127.0.0.1 -j DNAT --to-destination x.y.70.228:8080 --wait]: exit status 4: iptables: Resource temporarily unavailable.
  | failed to setup alloc: pre-run hook "network" failed: failed to configure networking for alloc: failed to configure network: plugin type="portmap" failed (add): unable to setup DNAT: running [/sbin/iptables -t nat -A CNI-DN-231ebe256ae7b6bd9006d -p tcp --dport 8084 -d 127.0.0.1 -j DNAT --to-destination x.y.70.228:8080 --wait]: exit status 4: iptables: Resource temporarily unavailable.

Seems like a race condition to me. In this case I would expect the job to fail and may be retry later.

Jess3Jane commented 8 months ago

Apologies for closing this, I think github did something silly with automation

howdoicomputer commented 7 months ago

I don't need to restart Docker for this to occur. I'm not sure WHAT is proccing the change but under bridge networking my allocations are started with just a loopback interface.

hashicorp / nomad