Nomad+Nebula: Multiple Tasks with Docker Host Network Fail

supertylerc commented 4 years ago

Nomad version

# nomad version
Nomad v0.11.0 (5f8fe0afc894d254e4d3baaeaee1c7a70a44fdc6)

Operating system and Environment details

# uname -a
Linux nomad-client-2 4.19.0-8-amd64 #1 SMP Debian 4.19.98-1 (2020-01-26) x86_64 GNU/Linux

Issue

When using Nomad with Slack's Nebula, specifying a network_interface of nebula1 always causes a job with more than one task (or multiple jobs with one task each of similar configuration but different ports) to fail to allocate, caused by an evaluation stuck in blocked for some reason I haven't been able to determine. Setting network_interface to lo works, so I do not think there is anything wrong with my configuration. It seems most likely that there is some kind of validation that happens for network_interface that prevents me from using a Nebula interface. It's also possible that something about the way an IP for an interface is retrieved is not working with Nebula, though I'm not sure why (I haven't dug deeply into Nomad's code, and I am unfortunately not well-versed in Go).

To sum up:

One job with two tasks and no client.network_interface: success
One job with multiple tasks and client.network_interface = "lo": success
One job with one task and client.network_interface = "nebula1": success
One job with multiple tasks and client.network_interface = "nebula1": failure
Two jobs with one task each and client.network_interface = "nebula1": failure

Reproduction steps

0: Have a Nebula overlay deployed:

For a lighthouse/server:

pki:
  ca: /etc/nebula/ca.crt
  cert: /etc/nebula/lighthouse.crt
  key: /etc/nebula/lighthouse.key
static_host_map:
 "10.0.0.1": ["192.168.1.254:4242"]
lighthouse:
  interval: 60
  am_lighthouse: true
  hosts: []
listen:
  host: 0.0.0.0
  port: 4242
tun:
  dev: nebula1
  drop_local_broadcast: false
  drop_multicast: true
  tx_queue: 500
  mtu: 1300
  routes:
logging:
  level: info
  format: text
firewall:
  conntrack:
    tcp_timeout: 120h
    udp_timeout: 3m
    default_timeout: 10m
    max_connections: 100000
  outbound:
    - port: any
      proto: any
      host: any
  inbound:
    - port: any
      proto: any
      host: any

For a node:

pki:
  ca: /etc/nebula/ca.crt
  cert: /etc/nebula/node.crt
  key: /etc/nebula/node.key
static_host_map:
 "10.0.0.1": ["192.168.1.254:4242"]
lighthouse:
  interval: 60
  am_lighthouse: true
  hosts: ["10.0.0.1"]
listen:
  host: 0.0.0.0
  port: 4242
tun:
  dev: nebula1
  drop_local_broadcast: false
  drop_multicast: true
  tx_queue: 500
  mtu: 1300
  routes:
logging:
  level: info
  format: text
firewall:
  conntrack:
    tcp_timeout: 120h
    udp_timeout: 3m
    default_timeout: 10m
    max_connections: 100000
  outbound:
    - port: any
      proto: any
      host: any
  inbound:
    - port: any
      proto: any
      host: any

1: Have a Nomad server with the following configuration:

datacenter = "dc1"
data_dir = "/opt/nomad"
bind_addr = "10.0.0.1"
server {
  enabled          = true
  bootstrap_expect = 1
  server_join {
    retry_join = ["10.0.0.1"]
  }
}

2: Have a Nomad node with the following configuration:

bind_addr = "{{ GetAllInterfaces | include \"network\" \"10.0.0.0/16\" | attr \"address\" }}"
client {
  enabled = true
  server_join {
    retry_join = ["10.0.0.1"]
    retry_max = 3
    retry_interval = "15s"
  }
  network_interface = "nebula1"
}
datacenter = "dc1"
data_dir = "/opt/nomad"

Job file (if appropriate)

job "system-proms" {
  datacenters = ["dc1"]
  type = "system"
  group "universal-exporters" {
    task "node-exporter" {
      driver = "docker"
      config {
        image = "prom/node-exporter:v0.18.1"
        network_mode = "host"
      }
      resources {
        network {
          mode = "host"
          port "node" {
            static = "9100"
          }
        }
      }
      service {
        name = "node-exporter"
        port  = "node"
      }
    }
    task "bb-exporter" {
      driver = "docker"
      config {
        image = "prom/blackbox-exporter:v0.16.0"
        network_mode = "host"
      }
      resources {
        network {
          mode = "host"
          port "bbe" {
            static = "9115"
          }
        }
      }
      service {
        name = "blackbox-exporter"
        port  = "bbe"
      }
    }
  }
}

Nomad Client logs (if appropriate)

# nomad job status -verbose system-proms
ID            = system-proms
Name          = system-proms
Submit Date   = 2020-04-20T18:58:10-07:00
Type          = system
Priority      = 50
Datacenters   = dc1
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group           Queued  Starting  Running  Failed  Complete  Lost
universal-exporters  1       0         0        0       0         0

Evaluations
ID                                    Priority  Triggered By   Status    Placement Failures
e977fee0-7557-41bb-1b05-98160aaac003  50        queued-allocs  blocked   N/A - In Progress
a4277084-731a-5793-e060-c508c80a8c37  50        job-register   complete  true

Placement Failure
Task Group "universal-exporters":

Allocations
No allocations placed
# # nomad eval status -verbose a4277084-731a-5793-e060-c508c80a8c37
ID                 = a4277084-731a-5793-e060-c508c80a8c37
Create Time        = 2020-04-20T18:58:10-07:00
Modify Time        = 2020-04-20T18:58:10-07:00
Status             = complete
Status Description = complete
Type               = system
TriggeredBy        = job-register
Job ID             = system-proms
Priority           = 50
Placement Failures = true
Previous Eval      = <none>
Next Eval          = <none>
Blocked Eval       = <none>

Failed Placements
Task Group "universal-exporters" (failed to place 1 allocation):

# nomad eval status -verbose e977fee0-7557-41bb-1b05-98160aaac003
ID                 = e977fee0-7557-41bb-1b05-98160aaac003
Create Time        = 2020-04-20T18:58:10-07:00
Modify Time        = 2020-04-20T18:58:10-07:00
Status             = blocked
Status Description = created to place remaining allocations
Type               = system
TriggeredBy        = queued-allocs
Priority           = 50
Placement Failures = N/A - In Progress
Previous Eval      = a4277084-731a-5793-e060-c508c80a8c37
Next Eval          = <none>
Blocked Eval       = <none>

supertylerc commented 4 years ago

This seems to only occur if I have job.group.task.resources.network defined in more than one task, and only if I set client.network_interface = "nebula1". So, to ammend my summary of conditions:

One job with two tasks each with a resources.network and no client.network_interface: success
One job with multiple tasks each with a resources.network and client.network_interface = "lo": success
One job with one task and client.network_interface = "nebula1": success
One job with multiple task each with a resources.networks and client.network_interface = "nebula1": failure
Two jobs with one task each with a resources.network and client.network_interface = "nebula1": failure
One job with two tasks, only one of which has resources.network defined, and client.network_interface = "nebula1": success

I've also tried this with a few combinations of specifying the network at the group level, but the result is ultimately the same.

davejhilton commented 4 years ago

Out of curiosity... does this problem go away for you, if you manually set the client.network_speed to what you think your link speed should be? I ran into something similar a while back, when playing with nomad + nebula, and (in my case at least) it just kept saying the network resources were exhausted after about 1 job running on it, unless I manually set the network_speed. I haven't bothered looking into how nomad does its detection of link speed, but I wonder if it just can't detect that properly for nebula? (or maybe what it does detect is very constrained?)

hashicorp / nomad