hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.8k stars 1.94k forks source link

Nomad does not expose cni issued ip address #16779

Open mrproper opened 1 year ago

mrproper commented 1 year ago

Nomad version

Nomad v1.5.2
BuildDate 2023-03-21T22:54:38Z
Revision 9a2fdb5f53dce81edf2802f0b64962e07596fd03

Operating system and Environment details

Ubuntu 22.10 \n \l

Issue

When using cni, nomad does not show the allocation's ip address anywhere in the nomad cli/ui in fact the only way to know what the ip address of an allocation is, is to look within the container itself (often most containers do not have binaries in them to look at ip details etc

Reproduction steps

configure cni:

$ cat frontend.conflist
{
  "cniVersion": "0.4.0",
  "name": "frontend",
  "args": { "cni": { "ips": ["10.0.10.128/24"] }},
  "plugins": [
    {
      "type": "macvlan",
      "master": "bond0.11",
      "ipam": {
        "type": "host-local",
        "dataDir": "/var/run/nomad/cni-frontend-ipam-state",
        "subnet":     "10.0.10.0/24",
        "rangeStart": "10.0.10.2",
        "rangeEnd":   "10.0.10.254",
        "gateway":    "10.0.10.1",
        "routes": [
          { "dst": "0.0.0.0/0" },
          { "dst": "169.254.1.1/32", "gw": "10.0.10.1"}
        ]
      }
    },
    { "type": "portmap", "snat": true, "capabilities": { "portMappings": true } }
  ]
}

configure nomad job:

job "nginx" {
  datacenters = ["dc1"]
  meta {
    deploy = "1"
  }

  group "nginx" {
    count = "1"
    network {
      mode = "cni/frontend"
    }
    task "nginx" {
      driver = "docker"
      config {
        image = "nginx:latest"
      }
    }
    service {
      name         = "www"
      port         = 80
      address_mode = "alloc"
    }
  }
}

plan and run job:

$ nomad job plan nginx.nomad
+ Job: "nginx"
+ Task Group: "nginx" (1 create)
  + Task: "nginx" (forces create)

Scheduler dry-run:
- All tasks successfully allocated.

Job Modify Index: 0
To submit the job with version verification run:

nomad job run -check-index 0 nginx.nomad

When running the job with the check-index flag, the job will only be run if the
job modify index given matches the server-side version. If the index has
changed, another user has modified the job and the plan's results are
potentially invalid.
$ nomad job run -check-index 0 nginx.nomad
==> 2023-04-04T08:17:38+10:00: Monitoring evaluation "4e42a013"
    2023-04-04T08:17:39+10:00: Evaluation triggered by job "nginx"
    2023-04-04T08:17:39+10:00: Evaluation within deployment: "020affbe"
    2023-04-04T08:17:39+10:00: Allocation "2ff40152" created: node "37df7b89", group "nginx"
    2023-04-04T08:17:39+10:00: Evaluation status changed: "pending" -> "complete"
==> 2023-04-04T08:17:39+10:00: Evaluation "4e42a013" finished with status "complete"
==> 2023-04-04T08:17:39+10:00: Monitoring deployment "020affbe"
  ✓ Deployment "020affbe" successful

    2023-04-04T08:17:58+10:00
    ID          = 020affbe
    Job ID      = nginx
    Job Version = 0
    Status      = successful
    Description = Deployment completed successfully

    Deployed
    Task Group  Desired  Placed  Healthy  Unhealthy  Progress Deadline
    nginx       1        1       1        0          2023-04-03T17:24:42Z

get the status of job and allocation:

$ nomad job status nginx
ID            = nginx
Name          = nginx
Submit Date   = 2023-04-04T03:13:09+10:00
Type          = service
Priority      = 50
Datacenters   = dc1
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost  Unknown
nginx       0       0         1        0       0         0     0

Latest Deployment
ID          = 020affbe
Status      = successful
Description = Deployment completed successfully

Deployed
Task Group  Desired  Placed  Healthy  Unhealthy  Progress Deadline
nginx       1        1       1        0          2023-04-03T17:24:42Z

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created   Modified
2ff40152  37df7b89  nginx       0        run      running  5h7m ago  5h6m ago
$ nomad alloc status 2ff40152
ID                  = 2ff40152-c4b4-8ce8-9c7e-09ec416a9f88
Eval ID             = 4e42a013
Name                = nginx.nginx[0]
Node ID             = 37df7b89
Node Name           = frontend-1-1
Job ID              = nginx
Job Version         = 0
Client Status       = running
Client Description  = Tasks are running
Desired Status      = run
Desired Description = <none>
Created             = 5h7m ago
Modified            = 5h7m ago
Deployment ID       = 020affbe
Deployment Health   = healthy

Task "nginx" is "running"
Task Resources:
CPU        Memory          Disk     Addresses
0/100 MHz  76 MiB/300 MiB  300 MiB  

Task Events:
Started At     = 2023-04-03T17:14:32Z
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                       Type        Description
2023-04-04T03:14:32+10:00  Started     Task started by client
2023-04-04T03:14:25+10:00  Driver      Downloading image
2023-04-04T03:14:25+10:00  Task Setup  Building Task Directory
2023-04-04T03:14:24+10:00  Received    Task received by client

relevant logs from nomad agent:

    2023-04-03T17:14:25.885Z [DEBUG] client.alloc_runner.runner_hook: received result from CNI: alloc_id=2ff40152-c4b4-8ce8-9c7e-09ec416a9f88 result="{\"Interfaces\":{\"eth0\":{\"IPConfigs\":[{\"IP\":\"10.0.10.2\",\"Gateway\":\"10.0.10.1\"}],\"Mac\":\"3a:15:55:77:bc:cf\",\"Sandbox\":\"/var/run/docker/netns/4179007f7d56\"}},\"DNS\":[{}],\"Routes\":[{\"dst\":\"0.0.0.0/0\"},{\"dst\":\"169.254.1.1/32\",\"gw\":\"10.0.10.1\"}]}"

Expected Result

within nomad alloc status ${allocid} You should see the container's ip: something like:

Allocation Addresses:
Label  Dynamic  Address
80        no            10.0.10.2:80

yes im aware i have no port definition in the job but its a pointless thing to add given cni, however when adding a port definition)

+/- Job: "nginx"
+/- Task Group: "nginx" (1 create/destroy update)
  + Network {
      Hostname: ""
    + MBits:    "0"
    + Mode:     "cni/frontend"
    + Dynamic Port {
      + HostNetwork: "default"
      + Label:       "http"
      + To:          "80"
      }
    }
  - Network {
      Hostname: ""
    - MBits:    "0"
    - Mode:     "cni/frontend"
    }
    Task: "nginx"

Scheduler dry-run:
- All tasks successfully allocated.

You see the allocation addresses look like this:

Allocation Addresses (mode = "cni/frontend"):
Label  Dynamic  Address
*http  yes      172.25.1.70:24285 -> 80

Actual Result

Container ip no where to be found with inspection of the job/allocation

maxadamo commented 1 year ago

@mrproper in my case nomad is registering the IP of the gateway of the CNI, instead of the IP of the container. My CNIs are NATted, and I can reach the containers IPs from the host, but when the job is spun up, the IP of the gateway is being registered.

One note for the Hashicorp folks: we keep raising issues on Github that look more like support requests. I know that there is discuss.hashicorp.com, but Slack would be much better, to create community and to get immediate support. I understand that the conversations in Slack vanish, but the fact that we tend to use Github, it's a demonstration that discuss.hashicorp.com does not really work IMO.

This is my configuration. I am using VXLAN interfaces and I get my service registered against the gateway, but I want to register the IP of the container, which is reachable from the host (because I am using bridging):

{
  "cniVersion": "1.0.0",
  "name": "gitea",
  "plugins": [
    {
      "type": "loopback"
    },
    {
      "type": "macvlan",
      "master": "vxbr11882895",
      "isDefaultGateway": false,
      "forceAddress": false,
      "ipMasq": true,
      "ipam": {
        "type": "host-local",
        "ranges": [
          [
            {
              "subnet": "192.168.2.0/24",
              "rangeStart": "192.168.2.2",
              "rangeEnd": "192.168.2.25",
              "gateway": "192.168.2.1"
            }
          ]
        ],
        "routes": [
          {
            "dst": "0.0.0.0/0",
            "gw": "192.168.2.1"
          }
        ],
        "dataDir": "/run/cni/ipam-state"
      }
    },
    {
      "type": "firewall",
      "backend": "iptables",
      "iptablesAdminChainName": "NOMAD-ADMIN"
    },
    {
      "type": "portmap",
      "capabilities": {
        "portMappings": true
      },
      "snat": true
    }
  ]
}

This is the job:

nomad alloc status -json 81512e2b-638a-db5a-9650-0d8638b3cda3 | jq .AllocatedResources.Shared.Ports[]

{
  "HostIP": "192.168.2.1",
  "Label": "http_gitea",
  "To": 0,
  "Value": 3000
}
{
  "HostIP": "192.168.2.1",
  "Label": "ssh_pass",
  "To": 0,
  "Value": 2222
}
maxadamo commented 1 year ago

in my case the solution might be here: https://github.com/hashicorp/nomad/pull/12720 though I don't know how to advertise the container IP in the job specification

jrasell commented 1 year ago

Hi @mrproper thanks for raising this request and apologies this slipped through our triaging process. This seems like something we would certainly want to support and I will therefore put this onto our backlog. When this makes it onto our current work, the engineer will assign themself.

we keep raising issues on Github that look more like support requests. I know that there is discuss.hashicorp.com, but Slack would be much better, to create community and to get immediate support. I understand that the conversations in Slack vanish, but the fact that we tend to use Github, it's a demonstration that discuss.hashicorp.com does not really work IMO.

Hi @maxadamo and thanks for the additional details you've added. In relation to the above sentence, if you're an enterprise customer with a support contract, I would encourage you to reach out to your account manager or via the support process if you require support. Outside of this, we are certainly looking at ways in which we can improve our OSS community interaction story and I'll pass this information on to the rest of the team.

GitHub is the correct place for bug or feature requests, which this seems to be unless I am mistaken. The engineering team and wider team do our best to support the OSS community, but this cannot generally be immediate due to the number of other factors which influence our working days. I hope this all makes sense.

maxadamo commented 1 year ago

@jrasell thanks for your reply. for the sake of completeness, I am running Nomad 1.5.4 on Ubuntu 20.04. I'm trying to get my head around, and I'm also looking at https://github.com/hashicorp/nomad/pull/12720 but:

  1. I can't seem to find a way to advertise the container IP.
  2. why are we seeing two different behaviour (I get the gateway registered but @mrproper doesn't get any IP registered) ?
  3. even if I use address_mode = "alloc" I keep getting the IP of the gateway
  4. even if I try to use a bogus IP (see screenshot: 192.168.10.10) it keeps registering the IP of the gateway of the CNI.

image

maxadamo commented 1 year ago

I've open this one, but I seem to have found a solution: https://github.com/hashicorp/nomad/issues/17107

mrproper commented 1 year ago

Sorry my response slipped, my allocations register just fine to consul etc, its just that a status of the allocation doesnt tell you the cni ip address that nomad knows (through the debug log)