hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.88k stars 1.95k forks source link

Sidecar proxy doesn't work with TLS enabled cluster #18644

Closed lbik closed 11 months ago

lbik commented 1 year ago

Hello guys,

I'm facing an issue with sidecar proxy in a cluster with TLS enabled. In a situation, where I try to deploy a service, which should be connected via terminating gateway to a service, which is outside the service mesh. I have registered an external service, then I have deployed a job with a terminating gateway service and with my service which i want to deploy with a sidecar proxy.

Job.hcl
```job "testaccount1" { datacenters = ["dc1"] type = "service" group "gateway" { network { mode = "bridge" } service { name = "sso-gateway" connect { gateway { proxy {} } terminating { service { name = "sso" } } } sidecar_task { config { image = "xxxxxxxxxxx/library/envoy" } } } } } group "testaccount1" { count = 1 network { mode = "bridge" port "http" { to = 8080 static = 8080 } } service { name = "testaccount1" port = "http" provider = "consul" connect { sidecar_service { proxy { upstreams { destination_name = "sso" local_bind_port = 443 } } } sidecar_task { config { image = "xxxxxxxxxx/library/envoy" } } } } task "testaccount1" { driver = "docker" env { } config { image = "xxxxxxxxx/account" ports = ["http"] auth { username = xxxxx password = xxxxx } } } } } ```

This snippet is able to deploy terminating gateway and my specific service with its sidecar proxy. Consul's health check on that sidecar proxy is giving me an error dial tcp 10.4.5.26:25299: connect: connection refused. In an envoy sidecar logs i can see this

envoy logs
``` [2023-10-03 13:32:50.415][1][info][admin] [source/server/admin/admin.cc:66] admin address: 127.0.0.2:19001 [2023-10-03 13:32:50.416][1][info][config] [source/server/configuration_impl.cc:131] loading tracing configuration [2023-10-03 13:32:50.416][1][info][config] [source/server/configuration_impl.cc:91] loading 0 static secret(s) [2023-10-03 13:32:50.416][1][info][config] [source/server/configuration_impl.cc:97] loading 1 cluster(s) [2023-10-03 13:32:50.467][1][info][config] [source/server/configuration_impl.cc:101] loading 0 listener(s) [2023-10-03 13:32:50.467][1][info][config] [source/server/configuration_impl.cc:113] loading stats configuration [2023-10-03 13:32:50.468][1][info][runtime] [source/common/runtime/runtime_impl.cc:463] RTDS has finished initialization [2023-10-03 13:32:50.468][1][info][upstream] [source/common/upstream/cluster_manager_impl.cc:221] cm init: initializing cds [2023-10-03 13:32:50.468][1][warning][main] [source/server/server.cc:802] there is no configured limit to the number of allowed active connections. Set a limit via the runtime key overload.global_downstream_max_connections [2023-10-03 13:32:50.469][1][info][main] [source/server/server.cc:923] starting main dispatch loop [2023-10-03 13:33:29.302][1][warning][config] [./source/common/config/grpc_stream.h:191] DeltaAggregatedResources gRPC config stream to local_agent closed since 38s ago: 14, upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: immediate connect error: No such file or directory [2023-10-03 13:33:45.667][1][warning][config] [./source/common/config/grpc_stream.h:191] DeltaAggregatedResources gRPC config stream to local_agent closed since 55s ago: 14, upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: immediate connect error: No such file or directory [2023-10-03 13:34:08.535][1][warning][config] [./source/common/config/grpc_stream.h:191] DeltaAggregatedResources gRPC config stream to local_agent closed since 78s ago: 14, upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: immediate connect error: No such file or directory [2023-10-03 13:34:16.799][1][warning][config] [./source/common/config/grpc_stream.h:191] DeltaAggregatedResources gRPC config stream to local_agent closed since 86s ago: 14, upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: immediate connect error: No such file or directory [2023-10-03 13:34:17.366][1][warning][config] [./source/common/config/grpc_stream.h:191] DeltaAggregatedResources gRPC config stream to local_agent closed since 86s ago: 14, upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: immediate connect error: No such file or directory ```

With those last messages in log above I started thinking that grpc is not working as it should. I have a TLS enabled in nomad and same with consul.

nomad server config
``` datacenter = "dc1" data_dir = "/opt/nomad/data" bind_addr = "0.0.0.0" server { enabled = true bootstrap_expect = 3 encrypt = "xxxxxxxxxx" } tls { http = true rpc = true ca_file = "/etc/pki/nomad/nomad-agent-ca.pem" cert_file = "/etc/pki/nomad/global-server-nomad.pem" key_file = "/etc/pki/nomad/global-server-nomad-key.pem" verify_server_hostname = true verify_https_client = true } client { enabled = false } consul { address = "127.0.0.1:8501" token = "xxxxxxxxxxxxx" grpc_ca_file = "/etc/pki/consul/consul-agent-ca.pem" grpc_address = "127.0.0.1:8503" ca_file = "/etc/pki/consul/consul-agent-ca.pem" cert_file = "/etc/pki/consul/dc1-server-consul-1.pem" key_file = "/etc/pki/consul/dc1-server-consul-1-key.pem" ssl = true } acl { enabled = true } ```
consul server config
``` data_dir = "/opt/consul" node_name = "server2" client_addr = "0.0.0.0" bind_addr = "10.4.5.22" advertise_addr = "10.4.5.22" encrypt = "xxxxxxxxxxxxxxxxx" encrypt_verify_incoming = true encrypt_verify_outgoing = true ui_config { enabled = true } rejoin_after_leave = true verify_incoming = true verify_outgoing = true verify_server_hostname = true ca_file = "/etc/pki/consul/consul-agent-ca.pem" cert_file = "/etc/pki/consul/dc1-server-consul-1.pem" key_file = "/etc/pki/consul/dc1-server-consul-1-key.pem" ports = { https = 8501 http = 8500 grpc = 8502 grpc_tls = 8503 dns = -1 } acl { enabled = true default_policy = "deny" tokens { default = "xxxxxxxxxxxxx" } } server = true bootstrap_expect = 3 log_level = "DEBUG" log_file = "/var/log/consul/" log_rotate_max_files = 30 ```
used versions
``` Nomad v1.6.2 BuildDate 2023-09-13T16:47:25Z Revision 73e372ad94033db2ceaf53468b270a31544c23fd ``` ``` Consul v1.16.2 Revision 68f81912 Build Date 2023-09-19T19:29:18Z ```

I'm not sure what could be wrong in my case.

Best Regards

lbik commented 1 year ago

I can see that I'm getting same error with my terminating gateway

terminating gateway log
`[2023-10-03 18:56:13.099][1][info][admin] [source/server/admin/admin.cc:66] admin address: 127.0.0.2:19000 [2023-10-03 18:56:13.099][1][info][config] [source/server/configuration_impl.cc:131] loading tracing configuration [2023-10-03 18:56:13.099][1][info][config] [source/server/configuration_impl.cc:91] loading 0 static secret(s) [2023-10-03 18:56:13.099][1][info][config] [source/server/configuration_impl.cc:97] loading 1 cluster(s) [2023-10-03 18:56:13.149][1][info][config] [source/server/configuration_impl.cc:101] loading 0 listener(s) [2023-10-03 18:56:13.149][1][info][config] [source/server/configuration_impl.cc:113] loading stats configuration [2023-10-03 18:56:13.149][1][info][runtime] [source/common/runtime/runtime_impl.cc:463] RTDS has finished initialization [2023-10-03 18:56:13.149][1][info][upstream] [source/common/upstream/cluster_manager_impl.cc:221] cm init: initializing cds [2023-10-03 18:56:13.149][1][warning][main] [source/server/server.cc:802] there is no configured limit to the number of allowed active connections. Set a limit via the runtime key overload.global_downstream_max_connections [2023-10-03 18:56:13.150][1][info][main] [source/server/server.cc:923] starting main dispatch loop [2023-10-03 18:56:51.110][1][warning][config] [./source/common/config/grpc_stream.h:191] DeltaAggregatedResources gRPC config stream to local_agent closed since 37s ago: 14, upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: immediate connect error: No such file or directory [2023-10-03 18:57:05.782][1][warning][config] [./source/common/config/grpc_stream.h:191] DeltaAggregatedResources gRPC config stream to local_agent closed since 52s ago: 14, upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: immediate connect error: No such file or directory [2023-10-03 18:57:26.883][1][warning][config] [./source/common/config/grpc_stream.h:191] DeltaAggregatedResources gRPC config stream to local_agent closed since 73s ago: 14, upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: immediate connect error: No such file or directory [2023-10-03 18:57:52.637][1][warning][config] [./source/common/config/grpc_stream.h:191] DeltaAggregatedResources gRPC config stream to local_agent closed since 99s ago: 14, upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: immediate connect error: No such file or directory [2023-10-03 18:58:16.634][1][warning][config] [./source/common/config/grpc_stream.h:191] DeltaAggregatedResources gRPC config stream to local_agent closed since 123s ago: 14, upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: immediate connect error: No such file or directory [2023-10-03 18:58:29.846][1][warning][config] [./source/common/config/grpc_stream.h:191] DeltaAggregatedResources gRPC config stream to local_agent closed since 136s ago: 14, upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: immediate connect error: No such file or directory `
lgfa29 commented 1 year ago

Hi @lbik 👋

Unfortunately I don't immediately see anything wrong with your setup 🤔

You mentioned TLS being enabled, do you mean that this worked without TLS?

lbik commented 11 months ago

Hi,

Hi @lgfa29
Im really sorry for my late response.

When I tried to reproduce this issue in our unsecured cluster i found out that everything works as expected when default envoy image is pulled. After that I checked our private docker registry what kind of envoy image we use and envoy:distroless has been spotted. So TLS had no effect.

lgfa29 commented 11 months ago

No worries @lbik, I'm glad you were able to fix the problem.