hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.81k stars 1.94k forks source link

consul connect sidecar error #10436

Closed pratikbin closed 3 years ago

pratikbin commented 3 years ago

Nomad version

Nomad v1.0.4 (9294f35f9aa8dbb4acb6e85fa88e3e2534a3e41a)`

Consul v1.9.4 Revision 10bb6cb3b

CNI v9.0.0

Operating system and Environment details

Arch Linux Manjaro 21 on Intel i510xxx 16GB RAM

Issue

https://www.nomadproject.io/docs/integrations/consul-connect When testing consul connect following above official guide on local getting

Apr 23 22:09:34 ctos nomad[4880]:     2021-04-23T22:09:34.163+0530 [ERROR] client.alloc_runner.runner_hook: error connecting to grpc: alloc_id=203427c4-059d-41de-be02-754c08673676 error="dial tcp 192.168.43.54:8502: connect: connection refused" dest=192.168.43.54:8502
Apr 23 22:09:44 ctos nomad[4880]:     2021-04-23T22:09:44.262+0530 [ERROR] client.alloc_runner.runner_hook: error connecting to grpc: alloc_id=1bb14b1d-000c-8d0c-e2fd-b0585e9e6135 error="dial tcp 192.168.43.54:8502: connect: connection refused" dest=192.168.43.54:8502
Apr 23 22:09:44 ctos nomad[4880]:     2021-04-23T22:09:44.862+0530 [ERROR] client.alloc_runner.runner_hook: error connecting to grpc: alloc_id=203427c4-059d-41de-be02-754c08673676 error="dial tcp 192.168.43.54:8502: connect: connection refused" dest=192.168.43.54:8502
...

image

Reproduction steps

Testing nomad and consul with this configs

data_dir  = "/opt/nomad/"
bind_addr = "{{ GetPrivateInterfaces | include \"network\" \"192.168.43.54/24\" | attr \"address\" }}"
datacenter = "blr1"

server {
  enabled          = true
  bootstrap_expect = 1
}

telemetry {
  collection_interval = "5s"
  disable_hostname = true
  prometheus_metrics = true
  publish_allocation_metrics = true
  publish_node_metrics = true
}

client {
  enabled       = true
  options = {
    "docker.volumes.enabled" = true,
    "driver.raw_exec.enable" = "1"
  }
}
consul {
  address = "192.168.43.54:8500"
}

consul

datacenter = "blr1"
client_addr = "{{ GetPrivateInterfaces | include \"network\" \"192.168.43.54/24\" | attr \"address\" }}"
bind_addr = "{{ GetPrivateInterfaces | include \"network\" \"192.168.43.54/24\" | attr \"address\" }}"
data_dir = "/opt/consul/"
encrypt = "xxxxxxxx"
server = true
bootstrap_expect = 1
ui_config
{
  enabled = true
}
connect {
  enabled = true
}
telemetry {
  disable_hostname = true
  prometheus_retention_time = "24h"
  disable_compat_1.9 = true
}

Expected Result

Sidecar should run as per official docs https://www.nomadproject.io/docs/integrations/consul-connect

Actual Result

Job file (if appropriate)

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

Consul

Apr 23 22:29:01 ctos consul[4881]:     2021-04-23T22:29:01.911+0530 [WARN]  agent: Check socket connection failed: check=service:_nomad-task-81e3bc69-6cdb-7a01-aeae-8349109d6853-group-dashboard-count-dashboard-9002-sidecar-proxy:1 error="dial tcp 192.168.43.54:20646: connect: connection refused"
Apr 23 22:29:01 ctos consul[4881]:     2021-04-23T22:29:01.911+0530 [WARN]  agent: Check is now critical: check=service:_nomad-task-81e3bc69-6cdb-7a01-aeae-8349109d6853-group-dashboard-count-dashboard-9002-sidecar-proxy:1
Apr 23 22:29:03 ctos consul[4881]:     2021-04-23T22:29:03.675+0530 [WARN]  agent: Check socket connection failed: check=service:_nomad-task-69081a64-8417-07ef-f004-3ff2f972ad38-group-api-count-api-9001-sidecar-proxy:1 error="dial tcp 192.168.43.54:22377: connect: connection refused"
Apr 23 22:29:03 ctos consul[4881]:     2021-04-23T22:29:03.675+0530 [WARN]  agent: Check is now critical: check=service:_nomad-task-69081a64-8417-07ef-f004-3ff2f972ad38-group-api-count-api-9001-sidecar-proxy:1
Apr 23 22:29:11 ctos consul[4881]:     2021-04-23T22:29:11.912+0530 [WARN]  agent: Check socket connection failed: check=service:_nomad-task-81e3bc69-6cdb-7a01-aeae-8349109d6853-group-dashboard-count-dashboard-9002-sidecar-proxy:1 error="dial tcp 192.168.43.54:20646: connect: connection refused"
Apr 23 22:29:11 ctos consul[4881]:     2021-04-23T22:29:11.912+0530 [WARN]  agent: Check is now critical: check=service:_nomad-task-81e3bc69-6cdb-7a01-aeae-8349109d6853-group-dashboard-count-dashboard-9002-sidecar-proxy:1
Apr 23 22:29:13 ctos consul[4881]:     2021-04-23T22:29:13.676+0530 [WARN]  agent: Check socket connection failed: check=service:_nomad-task-69081a64-8417-07ef-f004-3ff2f972ad38-group-api-count-api-9001-sidecar-proxy:1 error="dial tcp 192.168.43.54:22377: connect: connection refused"
Apr 23 22:29:13 ctos consul[4881]:     2021-04-23T22:29:13.676+0530 [WARN]  agent: Check is now critical: check=service:_nomad-task-69081a64-8417-07ef-f004-3ff2f972ad38-group-api-count-api-9001-sidecar-proxy:1
shoenig commented 3 years ago

Hi @pratikbalar , sorry your'e having trouble!

I suspect you still need to open port 8502 on Consul, which is a requirement for making Connect work.

https://www.nomadproject.io/docs/integrations/consul-connect#consul

(fwiw, imho Consul should automatically listen on 8502 if connect is enabled, but currently it does not)

pratikbin commented 3 years ago

Thanks for the quick reply @shoenig ,

Configured grpc but now getting this in consul FYI I'm using traefik with consul catalog, if you can convince me to shift to envoy then I'm up for it :smile:

Apr 23 22:35:16 ctos consul[18346]:     2021-04-23T22:35:16.372+0530 [ERROR] agent.envoy: Error handling ADS stream: error="rpc error: code = InvalidArgument desc = Envoy 1.11.2 is too old and is not supported by Consul"
Apr 23 22:35:21 ctos consul[18346]:     2021-04-23T22:35:21.628+0530 [WARN]  agent: Check socket connection failed: check=service:_nomad-task-69081a64-8417-07ef-f004-3ff2f972ad38-group-api-count-api-9001-sidecar-proxy:1 error="dial tcp 192.168.43.54:22377: connect: connection refused"
Apr 23 22:35:21 ctos consul[18346]:     2021-04-23T22:35:21.629+0530 [WARN]  agent: Check is now critical: check=service:_nomad-task-69081a64-8417-07ef-f004-3ff2f972ad38-group-api-count-api-9001-sidecar-proxy:1
Apr 23 22:35:22 ctos consul[18346]:     2021-04-23T22:35:22.581+0530 [ERROR] agent.envoy: Error handling ADS stream: error="rpc error: code = InvalidArgument desc = Envoy 1.11.2 is too old and is not supported by Consul"
Apr 23 22:35:24 ctos consul[18346]:     2021-04-23T22:35:24.332+0530 [WARN]  agent: Check socket connection failed: check=service:_nomad-task-81e3bc69-6cdb-7a01-aeae-8349109d6853-group-dashboard-count-dashboard-9002-sidecar-proxy:1 error="dial tcp 192.168.43.54:20646: connect: connection refused"
Apr 23 22:35:24 ctos consul[18346]:     2021-04-23T22:35:24.332+0530 [WARN]  agent: Check is now critical: check=service:_nomad-task-81e3bc69-6cdb-7a01-aeae-8349109d6853-group-dashboard-count-dashboard-9002-sidecar-proxy:1
Apr 23 22:35:31 ctos consul[18346]:     2021-04-23T22:35:31.530+0530 [ERROR] agent.envoy: Error handling ADS stream: error="rpc error: code = InvalidArgument desc = Envoy 1.11.2 is too old and is not supported by Consul"
Apr 23 22:35:31 ctos consul[18346]:     2021-04-23T22:35:31.629+0530 [WARN]  agent: Check socket connection failed: check=service:_nomad-task-69081a64-8417-07ef-f004-3ff2f972ad38-group-api-count-api-9001-sidecar-proxy:1 error="dial tcp 192.168.43.54:22377: connect: connection refused"
Apr 23 22:35:31 ctos consul[18346]:     2021-04-23T22:35:31.629+0530 [WARN]  agent: Check is now critical: check=service:_nomad-task-69081a64-8417-07ef-f004-3ff2f972ad38-group-api-count-api-9001-sidecar-proxy:1
Apr 23 22:35:32 ctos consul[18346]:     2021-04-23T22:35:32.417+0530 [ERROR] agent.envoy: Error handling ADS stream: error="rpc error: code = InvalidArgument desc = Envoy 1.11.2 is too old and is not supported by Consul"
Apr 23 22:35:34 ctos consul[18346]:     2021-04-23T22:35:34.333+0530 [WARN]  agent: Check socket connection failed: check=service:_nomad-task-81e3bc69-6cdb-7a01-aeae-8349109d6853-group-dashboard-count-dashboard-9002-sidecar-proxy:1 error="dial tcp 192.168.43.54:20646: connect: connection refused"
Apr 23 22:35:34 ctos consul[18346]:     2021-04-23T22:35:34.333+0530 [WARN]  agent: Check is now critical: check=service:_nomad-task-81e3bc69-6cdb-7a01-aeae-8349109d6853-group-dashboard-count-dashboard-9002-sidecar-proxy:1
Apr 23 22:35:38 ctos consul[18346]:     2021-04-23T22:35:38.641+0530 [ERROR] agent.envoy: Error handling ADS stream: error="rpc error: code = InvalidArgument desc = Envoy 1.11.2 is too old and is not supported by Consul"
Apr 23 22:35:41 ctos consul[18346]:     2021-04-23T22:35:41.630+0530 [WARN]  agent: Check socket connection failed: check=service:_nomad-task-69081a64-8417-07ef-f004-3ff2f972ad38-group-api-count-api-9001-sidecar-proxy:1 error="dial tcp 192.168.43.54:22377: connect: connection refused"
shoenig commented 3 years ago

We're actually huge fans of traefik! They're about to add a native integration with Consul Connect, something I'm at least super excited for :slightly_smiling_face:

The Envoy 1.11.2 is too old and is not supported by Consul is unexpected, Nomad v1.0+ should automatically use the newest version of Envoy supported by the Consul agent on each node. (Note that all clients, not just servers need to be up to date)

A few reasons that might not happen:

Do any of those conditions match your environment?

pratikbin commented 3 years ago

We're actually huge fans of traefik! They're about to add a native integration with Consul Connect, something I'm at least super excited for slightly_smiling_face

That's cool, wop wop traefik

Nomad clients are not v1.0 or higher I'm using a single machine with the same version of the client and server of nomad which is v1.0.4

same with consul v1.9.4

meta.connect.sidecar_image is set to that version of envoy

what should I configure this to?

you're using a custom sidecar_task with that version of envoy

I'm just following this https://www.nomadproject.io/docs/integrations/consul-connect where no explicit envoy added in job config

shoenig commented 3 years ago

Interesting, can you show the output when making this curl request to the Consul client? E.g.,

$ curl -s localhost:8500/v1/agent/self | jq -r .xDS
{
  "SupportedProxies": {
    "envoy": [
      "1.16.2",
      "1.15.3",
      "1.14.6",
      "1.13.7"
    ]
  }
}

And with trace level logging enabled, what do you see in the Nomad client log line that starts with,

setting task envoy image

You can enable trace logging with -log-level=TRACE or in agent config.

shoenig commented 3 years ago

what should I configure this to?

The meta.connect.sidecar_image can be explicitly set to any image that runs envoy, typically one of the official ones published to docker hub (it would need to be a version of Envoy supported by your version of Consul).

Doing so shouldn't be necessary though; Nomad v1.0 and later query Consul using that /agent/self endpoint above to determine which version of Envoy to use, falling back to that v1.11.2 version if Consul is too old to include the xDS blob in the response payload.

pratikbin commented 3 years ago

ill try this today

pratikbin commented 3 years ago

wait a minute!! now it's running, thanks @shoenig i guess :smile:

image

image

I'll reopen or comment here if i found something/stuck

shoenig commented 3 years ago

Glad it works @pratikbalar !

Actually I think I finally realized what happened - when you first launched Consul without the grpc port set, Consul will not include the xDS blob in the /v1/agent/self response. When Nomad sees the lack of that blob, it defaults to that outdated version of Envoy as described above.

 $ cat consul.hcl 
connect {
  enabled = true
}

data_dir = "/tmp/consul"
bind_addr = "127.0.0.1"
$ curl -s localhost:8500/v1/agent/self | jq -r .xDS
null

After fixing the port problem, unless the job is recycled Nomad will relaunch the task in-place without making additional queries to Consul, since it just assumes it would have gotten the same response, thus the task just keeps failing.

patrick-leb commented 3 years ago

If someone stumbles upon this: I had the same issue and realized that one of my consul clients was missing the port configuration for gRPC.

github-actions[bot] commented 1 year ago

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.