hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.76k stars 1.94k forks source link

Support Consul Service Mesh on CNI networks #8953

Open timotheencl opened 3 years ago

timotheencl commented 3 years ago

Nomad version

Nomad v0.12.5 (514b0d667b57068badb43795103fb7dd3a9fbea7)

Operating system and Environment details

Ubuntu Focal amd64 20.04.1

Issue

Hi !

I would like to add a CNI macvlan network to use with Consul Connect to enable an ingress gateway to be part of this publicly available network for clients.

Howerver after setup the CNI config file, nomad says that only "bridge" or "host" is correct.

Error submitting job: Unexpected response code: 500 (1 error occurred:
        * Consul Connect Gateway service requires Task Group with network mode of type "bridge" or "host"

)

Thanks !

My CNI config:

{
    "cniVersion": "0.4.0",
    "name": "mynet",
    "plugins": [
        {
            "type": "macvlan",
            "master": "enp0s10",
            "ipam": {
                "type": "dhcp"
            }
        },
        {
            "type": "portmap",
            "capabilities": {
                "portMappings": true
            },
            "snat": true
        }
    ]
}

And my job file:

job "http-echo" {
  datacenters = ["dc1"]

  group "ingress" {
    count = "2"

    network {
      mode = "cni/mynet"
      port "inbound" {
        to     = 8080
      }
    }

    service {
      name = "http-echo-ingress"
      port = "inbound"

      connect {
        gateway {
          proxy {
            connect_timeout = "500ms"
          }
          ingress {
            listener {
              port     = 8080
              protocol = "tcp"

              service {
                name = "http-echo"
              }
            }
          }
        }
      }
    }
  }

  group "api" {
    count = "2"

    network {
      mode = "bridge"
    }

    service {
      name = "http-echo"
      port = "5678"

      connect {
        sidecar_service {}
      }

      check {
        expose = true
        type = "http"
        path =  "/health"
        interval = "5s"
        timeout = "2s"
      }
    }

    task "api" {
      driver = "docker"

      config {
        image = "hashicorp/http-echo"
        args = [
          "-listen", ":5678",
          "-text", "Hello world !",
        ]
      }
    }
  }
}
nickethier commented 3 years ago

Hey @timotheenicolas

Unfortunately this is currently not supported with CNI, but I don't know of any technical limitation. Pinging @shoenig to see if this is a simple validation change or a more in depth one.

timotheencl commented 3 years ago

Thanks :) I think it would be a cool feature to have the ability to create Ingress GW which can bind on their own IP on a macvlan network

pratheekrebala commented 2 years ago

Just wanted to check if you had a chance to look at this further? Our use-case here is similar to @timotheenicolas's. We'd like to expose a connect sidecar service on a CNI based overlay network. Thank you!

sinisterstumble commented 2 years ago

Would like to see this happen too, similar use case. Thank you!

lgfa29 commented 1 year ago

Hi everyone 👋

I've been looking into this issue but I can't seem to get Connect working which may indicate that there's more work that needs to be done than just removing the validation or it could be that I'm not configuring my CNI network properly (probably more likely 😅).

I have some custom binaries at the bottom of this page https://github.com/hashicorp/nomad/actions/runs/4059725660 that was built with my changes. This is the diff https://github.com/hashicorp/nomad/compare/d375f6043f2144b2e400ddd19a7e46b9f08cc1ce...120747566d92b118524650693ddf4e315e679688 of what is in the binary.

Would anyone with more experience with CNI be able to test them? One important note, these binaries are for development purpose only and should not run in production so make sure you don't accidentally run them with your production data.

I used the sample job file that is generated from nomad job init -short -connect with a few modifications:

Thanks in advance!

lgfa29 commented 1 year ago

I have edited the title here to expand the scope to all CNI networks (so not just macvlan) and to Consul Service Mesh in general (no just ingress gateways).

netdata-be commented 8 months ago

I'm trying to use consul connect on my nomad clusters. However I'm limited to the fact that I have to lower the MTU on de bridge created by nomad.

Since it is hard coded I'm not able to do that, so I thought using a custom configuration and refer to it using mode = "cni/xxx"

But that fails because if this issue.

@lgfa29 Is there something I can do to help advancing this (older) issue?

lgfa29 commented 8 months ago

Hi @netdata-be 👋

We're not currently working on this issue and I didn't receive feedback on the attempted fix mentioned in https://github.com/hashicorp/nomad/issues/8953#issuecomment-1411344922 and haven't had the time to validate it further.

If I were to build another set of binaries with those changes would be able to help validate if the changes work?

nakermann1973 commented 6 months ago

@lgfa29 - It looks like this would solve most of my questions at https://discuss.hashicorp.com/t/configure-network-pinning-for-jobs/63434. I would be happy to test a patched version of 1.6.x or 1.7.x to validate the changes.

nakermann1973 commented 3 months ago

Hi @lgfa29 I have done some tests using your patch (applied to nomad 1.7.7). A job which includes CNI and consul connect starts correctly, but the health check uses the incorrect address.

I am using a macvlan cni config:

{
  "cniVersion": "1.0.0",
  "name": "vlan107_dhcp",
  "plugins": [
    {
      "type": "macvlan",
      "master": "eth0.107",
      "ipam": {
        "type": "dhcp"
      }
    }
    ,
    {
      "type": "portmap",
      "capabilities": {
          "portMappings": true
      },
      "snat": true
    }
  ]
}

ip a l in the envoy sidecar shows that I get an address on vlan107 (it is via dhcp). This address is also shown correctly in the top-right of the consul service view.

2: eth0@if5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default 
    link/ether fe:a6:92:0f:bc:7f brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 172.17.107.139/24 brd 172.17.107.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::fca6:92ff:fe0f:bc7f/64 scope link 
       valid_lft forever preferred_lft forever

/secrets/envoy-bootstrap.cmd

connect envoy -grpc-addr unix://alloc/tmp/consul_grpc.sock -http-addr localhost:8500 -admin-bind 127.0.0.2:19001 -address 127.0.0.1:19101 -proxy-id _nomad-task-d04fe8fb-efa7-b2a6-565b-d709d1cf1a2e-group-nodered-nodered-1880-sidecar-proxy -bootstrap

There is a process (envoy?) listening on port 29130 (this is on IP 172.17.107.139

Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       User       Inode      PID/Program name    
tcp        0      0 127.0.0.2:19001         0.0.0.0:*               LISTEN      101        35993407   -                   
tcp        0      0 0.0.0.0:1880            0.0.0.0:*               LISTEN      1000       35996775   -                   
tcp        0      0 0.0.0.0:29130           0.0.0.0:*               LISTEN      101        35993417   -                   

Consul is trying to health check the nomad client's address though (dial tcp 172.17.17.234:29130: i/o timeout).

image

From the docs (https://developer.hashicorp.com/nomad/docs/job-specification/service#address_mode), I expected the consul check of the sidecar to use the IP provided by CNI My service stanza is:

    service {
      name = "nodered"
      address_mode = "alloc"
      port = 1880
      connect {
        sidecar_service {
          proxy {}
        }
      }
  }

I can do more tests. Please let me know if anything more would help.

lgfa29 commented 3 months ago

Ah nice, thanks for testing it @nakermann1973, I'm glad it kind of works 😅

Health checks are an interesting point. First you need to make sure the Consul agent would be able to reach the service at the IP:port allocated by the CNI plugin. Next we need a way to tell Nomad to use that IP:port as well.

For the first part, I'm not sure there's a single way to fix it. Each environment will need to be configured to fulfill this requirement.

The second part may require some code changes in how Nomad registers the service (and its health check) in Consul. If you run nomad job inspect <job ID> do you see any health checks in the sidecar or your task?

And as a last note, I no longer work for HashiCorp, so I probably won't be able to help much on this issue any more.

nakermann1973 commented 2 months ago

I rolled back to the prod release, as it seemed like with this patch that health checks were failing across multiple services. I didn't dig into it too much, as my focus was to recover the failing services.

do you see any health checks in the sidecar or your task

I don't recall seeing any when I inspected the job