hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.87k stars 1.95k forks source link

Constraint ${attr.consul.grpc} > 0: nodes excluded by filter #12111

Closed locinus closed 10 months ago

locinus commented 2 years ago

Nomad version

1.2.3

Operating system and Environment details

Debian 10

Issue

While trying to setup a sidecar service, such as the countdash of the official example, using the Docker driver, we run into this failure in deployment:

* Constraint "${attr.consul.grpc} > 0": 2 nodes excluded by filter

We are, however, able to deploy a job with no sidecar successfully.

We looked for configuration errors but are currently clueless to the origin of this constraint failure. Google has no reference to such error. Any help greatly appreciated!

Config files

/etc/consul.d/consul.hcl

datacenter = "mydc"
domain = "mydomain.com"

data_dir = "/home/consul/opt"
encrypt = "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/XXXXXX="
verify_incoming = true
verify_outgoing = true
verify_server_hostname = true

ca_file = "/etc/consul.d/certs/mydomain.com-agent-ca.pem"
cert_file = "/etc/consul.d/certs/cert-server-mydomain.com-1.pem"
key_file = "/etc/consul.d/certs/key-server-mydomain.com-1-key.pem"

retry_join = ["xxx.xxx.xxx.xxx", "yyy.yyy.yyy.yyy", "zzz.zzz.zzz.zzz"]

/etc/consul.d/server.hcl

server = true
bootstrap_expect = 3
bind_addr = "yyy.yyy.yyy.yyy"
client_addr = "0.0.0.0"

addresses {
   grpc = "yyy.yyy.yyy.yyy"
 }
connect {
  enabled = true
}
ports {
  grpc  = 8502
}
ui_config {
  enabled = true
}
auto_encrypt {
  allow_tls = true
}

/etc/systemd/system/consul.service

[Unit]
Description="HashiCorp Consul - A service mesh solution"
Documentation=https://www.consul.io/
Requires=network-online.target
After=network-online.target
ConditionFileNotEmpty=/etc/consul.d/consul.hcl

[Service]
User=consul
Group=consul
ExecStart=/usr/local/bin/consul agent -config-dir=/etc/consul.d/ -bind yyy.yyy.yyy.yyy
ExecReload=/bin/kill --signal HUP $MAINPID
KillMode=process
KillSignal=SIGTERM
Restart=on-failure
LimitNOFILE=65536

[Install]
WantedBy=multi-user.target

/etc/nomad.d/nomad.hcl

name = "server-node-1"
region = "myregion"
datacenter = "mydc"
data_dir = "/home/nomad/opt"
bind_addr= "yyy.yyy.yyy.yyy"

telemetry {
  collection_interval = "1s"
  use_node_name= true
  disable_hostname = true
  prometheus_metrics = true
  publish_allocation_metrics = true
  publish_node_metrics = true
}

/etc/nomad.d/server.hcl

server {
  enabled = true
  bootstrap_expect = 3
  encrypt = "XXXXXXX/XXXXXXXXXXXXXXXXXXXX/XXXXXXXXXXXX/X="
  server_join {
    retry_join = ["xxx.xxx.xxx.xxx", "yyy.yyy.yyy.yyy", "zzz.zzz.zzz.zzz"]
  }
}
advertise {
  http = "yyy.yyy.yyy.yyy"
  rpc  = "yyy.yyy.yyy.yyy"
  serf = "yyy.yyy.yyy.yyy"
}

/etc/systemd/system/nomad.service

[Unit]
Description=Nomad
Documentation=https://www.nomadproject.io/docs/
Wants=network-online.target
After=network-online.target
[Service]
User=nomad
Group=nomad
ExecReload=/bin/kill -HUP $MAINPID
ExecStart=/usr/local/bin/nomad agent -config /etc/nomad.d
KillMode=process
KillSignal=SIGINT
LimitNOFILE=65536
LimitNPROC=infinity
Restart=on-failure
RestartSec=2
TasksMax=infinity
OOMScoreAdjust=-1000
[Install]
WantedBy=multi-user.target

/etc/nomad.d/client.hcl

client {
  enabled = true
  network_interface = "eth0"
  cni_path = "/opt/cni/bin/"

  host_volume "mysql" {
    path = "/mnt/bdd_volume/mysql"
    read_only = false
  }
}

/etc/nomad.d/docker.hcl

plugin "docker" {
  config {
    endpoint = "unix:///var/run/docker.sock"
    extra_labels = ["job_name", "job_id", "task_group_name", "task_name", "namespace", "node_name", "node_id"]
    gc {
      image       = true
      image_delay = "3m"
      container   = true
      dangling_containers {
        enabled        = true
        dry_run        = false
        period         = "5m"
        creation_grace = "5m"
      }
    }
    volumes {
      enabled      = true
      selinuxlabel = "z"
    }
    allow_privileged = true
    allow_caps       = ["chown", "net_raw"]
  }
}
shoenig commented 2 years ago

Hi @locinus ! This is a check done by Nomad to make sure the consul client agent is configured correctly. When using connect, it is required to activate the grpc port on Consul client agents. The error being reported indicates there are no Consul client agents available with the grpc port activated (which is supported by your /etc/consul.d/consul.hcl file).

I realize the Consul docs[1] don't really make that clear, but the learn guide[2] does:

Client agents only need to configure the gRPC port.

[1] https://www.consul.io/docs/connect/configuration#agent-configuration [2] https://learn.hashicorp.com/tutorials/consul/service-mesh-with-envoy-proxy?in=consul/developer-mesh#enable-connect-and-grpc

shoenig commented 2 years ago

Nomad could at least document this check and point folks in the right direction

Xerkus commented 2 years ago

I think I had similar issue where enabling just the port in the consul client was not enough and connect enabled had to be set to true on clients as well for Nomad constraint to work. Consul docs state that connect enabled only needed on server mode agents.

thangchung commented 2 years ago

It still happens to me in nomad 1.3.1 when I run it in WSL2. Any process for this?

tgross commented 2 years ago

@thangchung did you activate the grpc port on the Consul clients as described here: https://github.com/hashicorp/nomad/issues/12111#issuecomment-1049079451 ?

exFalso commented 1 year ago

Just encountered this, we have both connect and the grpc port enabled, and we're still getting this constraint failure.

Nomad version: 1.4.4 Consul version: 1.15.0

This started happening after upgrading to the above versions

exFalso commented 1 year ago

Digging into this a bit more, our consul client configuration looks like this:

{
  "addresses": {
    "http": "0.0.0.0"
  },
  "bind_addr": "10.51.1.13",
  "connect": {
    "enabled": true
  },
  "data_dir": "/var/lib/consul",
  "enable_local_script_checks": true,
  "enable_script_checks": true,
  "encrypt": ...,
  "ports": {
    "grpc": 8502,
    "http": 8500,
    "https": 8501
  },
  "retry_interval": "15s",
  "retry_join": [
    "10.51.1.10",
    "10.51.1.11",
    "10.51.1.12"
  ],
  "retry_max": 3,
  "server": false,
  "tls": {
    "https": {
      "ca_file": "/etc/ssl/certs/ca.pem",
      "cert_file": "/etc/ssl/certs/cert.pem",
      "key_file": "/etc/ssl/certs/cert-key.pem"
    },
    "internal_rpc": {
      "ca_file": "/etc/ssl/certs/ca.pem",
      "cert_file": "/etc/ssl/certs/cert.pem",
      "key_file": "/etc/ssl/certs/cert-key.pem"
    }
  },
  "ui_config": {
    "enabled": true
  }
}

However, nomad node status -verbose ... returns:

...
Attributes
consul.connect                    = true
consul.datacenter                 = dc1
consul.ft.namespaces              = false
consul.grpc                       = -1
consul.server                     = false
consul.sku                        = oss
consul.version                    = 1.15.0
...

(yes, tried bouncing all services)

We did have some issues with grpc+tls and envoy when upgrading and we had to change the consul client configuration, perhaps it's related?

exFalso commented 1 year ago

Ok fixed. To anyone encountering this issue, the configuration that worked for us:

...
  "ports": {
    "grpc": 8502,
    "grpc_tls": 8503,
    "http": 8500,
    "https": 8501
  },
...
  "tls": {
    "grpc": {
      "ca_file": "/etc/ssl/certs/ca.pem",
      "cert_file": "/etc/ssl/certs/cert.pem",
      "key_file": "/etc/ssl/certs/cert-key.pem"
    },
    "https": {
      "ca_file": "/etc/ssl/certs/ca.pem",
      "cert_file": "/etc/ssl/certs/cert.pem",
      "key_file": "/etc/ssl/certs/cert-key.pem"
    },
    "internal_rpc": {
      "ca_file": "/etc/ssl/certs/ca.pem",
      "cert_file": "/etc/ssl/certs/cert.pem",
      "key_file": "/etc/ssl/certs/cert-key.pem"
    }
  },
...

In other words we had to enable both non-tls'd grpc (envoy throws errors otherwise) AND tls'd grpc (nomad reports consul.grpc=-1 otherwise).

exFalso commented 1 year ago

Correction, with the above configuration envoy starts throwing again:

[2023-03-17 10:18:34.733][1][warning][config] [./source/common/config/grpc_stream.h:201] DeltaAggregatedResources gRPC config stream to local_agent closed since 1622s ago: 14, upstream connect error or disconnect/reset before headers. reset reason: connection termination

so we're back at square one

exFalso commented 1 year ago

Figured out a working configuration. So it seems nomad marks consul with consul.grpc = -1 unless TLS is configured for grpc. If we configure consul grpc with TLS then envoy by default will try to connect to it without TLS, printing that transport error. To make envoy use TLS correctly we had to set the following env vars for the nomad client :

    CONSUL_HTTP_SSL = "true";
    CONSUL_CACERT = "/etc/ssl/certs/ca.pem";
    CONSUL_GRPC_CACERT = "/etc/ssl/certs/ca.pem";
    CONSUL_GRPC_ADDRESS = "127.0.0.1:8503";
    CONSUL_CLIENT_CERT = "/etc/ssl/certs/cert.pem";
    CONSUL_CLIENT_KEY = "/etc/ssl/certs/cert-key.pem";

maybe not all of these are required, but this worked

carmenmarcos00 commented 1 year ago

I was experiencing the same problem. In my case we managed to solve it by specifying the ports block in /etc/consul.d/server.hcl as follows:

ports {
  grpc_tls  = 8503                                   
  grpc  = 8502                                     
  http = 8500
  https = 8501
}

We also had to set the following environment variable to true:

CONSUL_HTTP_SSL=true

I hope this helps you and works for you as well.

tgross commented 10 months ago

Doing a little issue cleanup. This is currently documented in the Connect Prerequisites documentation and the consul.grpc_address config documentation.