hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.22k stars 4.41k forks source link

Envoy proxy doesn't work correctly in Consul transparent proxy mode. #21517

Open ruslan-y opened 2 months ago

ruslan-y commented 2 months ago

Hi there!

I'm going to describe my problem in detail so a lot of text, logs and configs are expected)

Nomad version

Nomad v1.8.0
BuildDate 2024-05-28T17:38:17Z
Revision 28b82e4b2259fae5a62e2ed47395334bea5a24c4

Consul version

Consul v1.19.0
Revision bf0166d8
Build Date 2024-06-12T13:59:10Z

Operating system and Environment details

5.10.0-23-amd64 #1 SMP Debian 5.10.179-2 (2023-07-14) x86_64 GNU/Linux
Nomad client config ``` name = "host" region = "global" datacenter = "dc1" enable_debug = false disable_update_check = false bind_addr = "0.0.0.0" advertise { http = ":4646" rpc = ":4647" serf = ":4648" } ports { http = 4646 rpc = 4647 serf = 4648 } consul { address = "localhost:8500" ssl = false ca_file = "" grpc_ca_file = "" cert_file = "" key_file = "" token = "" server_service_name = "nomad-servers" client_service_name = "nomad-clients" tags = [] auto_advertise = true server_auto_join = true client_auto_join = true } data_dir = "/var/nomad" log_level = "INFO" enable_syslog = true leave_on_terminate = true leave_on_interrupt = false tls { http = true rpc = true ca_file = "/etc/nomad/ssl/nomad-ca.pem" cert_file = "/etc/nomad/ssl/client.pem" key_file = "/etc/nomad/ssl/client-key.pem" rpc_upgrade_mode = false verify_server_hostname = "true" verify_https_client = "true" } acl { enabled = true token_ttl = "30s" policy_ttl = "30s" replication_token = "" } vault { enabled = true address = "https://" allow_unauthenticated = true create_from_role = "nomad-cluster" task_token_ttl = "" ca_file = "" ca_path = "" cert_file = "" key_file = "" tls_server_name = "" tls_skip_verify = false namespace = "" } telemetry { disable_hostname = "true" collection_interval = "15s" use_node_name = "false" publish_allocation_metrics = "true" publish_node_metrics = "true" filter_default = "true" prefix_filter = [] disable_dispatched_job_summary_metrics = "false" statsite_address = "" statsd_address = "" datadog_address = "" datadog_tags = [] prometheus_metrics = "true" circonus_api_token = "" circonus_api_app = "nomad" circonus_api_url = "https://api.circonus.com/v2" circonus_submission_interval = "10s" circonus_submission_url = "" circonus_check_id = "" circonus_check_force_metric_activation = "false" circonus_check_instance_id = "" circonus_check_search_tag = "" circonus_check_display_name = "" circonus_check_tags = "" circonus_broker_id = "" circonus_broker_select_tag = "" } autopilot { cleanup_dead_servers = true last_contact_threshold = "1s" max_trailing_logs = 250 server_stabilization_time = "10s" } ```
Consul client config ``` { "acl": { "default_policy": "deny", "down_policy": "extend-cache", "enable_token_persistence": true, "enabled": true, "token_ttl": "30s", "tokens": { "agent": "", "agent_recovery": "" } }, "addresses": { "dns": "172.17.0.1", "grpc": "127.0.0.1", "grpc_tls": "127.0.0.1", "http": "127.0.0.1", "https": "127.0.0.1" }, "advertise_addr": "", "advertise_addr_wan": "", "auto_encrypt": { "tls": true }, "bind_addr": "", "client_addr": "127.0.0.1", "connect": { "enabled": true }, "data_dir": "/opt/consul", "datacenter": "dc1", "disable_update_check": false, "domain": "consul", "enable_local_script_checks": false, "enable_script_checks": false, "enable_syslog": true, "encrypt": "", "encrypt_verify_incoming": true, "encrypt_verify_outgoing": true, "limits": { "http_max_conns_per_client": 400, "rpc_max_conns_per_client": 200 }, "log_level": "INFO", "node_name": "host", "performance": { "leave_drain_time": "10s", "raft_multiplier": 1, "rpc_hold_timeout": "30s" }, "ports": { "dns": 8600, "grpc": 8502, "grpc_tls": 8503, "http": 8500, "https": -1, "serf_lan": 8301, "serf_wan": 8302, "server": 8300 }, "primary_datacenter": "dc1", "raft_protocol": 3, "recursors": [ "1.1.1.1", "8.8.8.8" ], "retry_interval": "30s", "retry_join": [ "", "", "" ], "retry_max": 0, "server": false, "syslog_facility": "local0", "tls": { "defaults": { "ca_file": "/etc/consul/ssl/consul-agent-ca.pem", "tls_min_version": "TLSv1_2", "verify_incoming": false, "verify_outgoing": true }, "https": { "verify_incoming": false }, "internal_rpc": { "verify_incoming": true, "verify_server_hostname": false } }, "translate_wan_addrs": false, "ui_config": { "enabled": false } } ```

Issue

There is a running nomad job (envoy proxy) in a cluster that proxies requests to my-service. I'm using consul connect upstreams in my nomad jobs and it works perfectly.

Example of my nomad job (upstreams): ``` job "test-proxy-job" { datacenters = ["dc1"] namespace = "test" type = "service" group "test-proxy-group" { count = 1 vault { policies = ["nomad-services"] } update { max_parallel = 1 canary = 1 auto_revert = true auto_promote = true min_healthy_time = "10s" healthy_deadline = "5m" progress_deadline = "15m" } network { mode = "bridge" port "http" { static = 28123 to = 28162 } } service { name = "test-proxy" port = "28162" tags = ["proxy", "test"] connect { sidecar_task { resources { cpu = 500 memory = 300 } config { args = [ "-c", "${NOMAD_SECRETS_DIR}/envoy_bootstrap.json", "-l", "debug", "--concurrency", "${meta.connect.proxy_concurrency}", "--disable-hot-restart" ] } } sidecar_service { proxy { upstreams { destination_name = "my-service" local_bind_port = 10007 } } } } } task "test-proxy-task" { driver = "docker" template { data = <

If I make a request there will be an expected response from the my-service:

curl -v -H 'x-test-envoy: true' http://<external_ip>:28123
*   Trying <external_ip>:28123...
* Connected to <external_ip> (<external_ip>) port 28123
> GET / HTTP/1.1
> Host: <external_ip>:28123
> User-Agent: curl/8.6.0
> Accept: */*
> x-test-envoy: true
>
< HTTP/1.1 200 OK
< server: envoy
< date: Thu, 04 Jul 2024 19:01:21 GMT
< content-type: application/json; charset=utf8
< content-length: 57
< x-envoy-upstream-service-time: 6
<
* Connection #0 to host <external_ip> left intact
{"jsonrpc": "2.0", "id": "test", "result": "ok"}%

When I'm trying to enable transparent proxy in the config it doesn't work

Example of my nomad job (transparent proxy): ``` job "test-proxy-job" { datacenters = ["dc1"] namespace = "test" type = "service" group "test-proxy-group" { count = 1 vault { policies = ["nomad-services"] } update { max_parallel = 1 canary = 1 auto_revert = true auto_promote = true min_healthy_time = "10s" healthy_deadline = "5m" progress_deadline = "15m" } network { mode = "bridge" port "http" { static = 28123 to = 28162 } } service { name = "test-proxy" port = "28162" tags = ["proxy", "test"] connect { sidecar_task { resources { cpu = 500 memory = 300 } config { args = [ "-c", "${NOMAD_SECRETS_DIR}/envoy_bootstrap.json", "-l", "debug", "--concurrency", "${meta.connect.proxy_concurrency}", "--disable-hot-restart" ] } } sidecar_service { proxy { transparent_proxy {} } } } } task "test-proxy-task" { driver = "docker" template { data = <

Requests don't go through, I'm getting 503 error. No response from the my-service:

curl -v -H 'x-test-envoy: true' http://<external_ip>:28123
*   Trying <external_ip>:28123...
* Connected to <external_ip> (<external_ip>) port 28123
> GET / HTTP/1.1
> Host: <external_ip>:28123
> User-Agent: curl/8.6.0
> Accept: */*
> x-test-envoy: true
>
< HTTP/1.1 503 Service Unavailable
< content-length: 91
< content-type: text/plain
< date: Thu, 04 Jul 2024 20:55:09 GMT
< server: envoy
<
* Connection #0 to host <external_ip> left intact
upstream connect error or disconnect/reset before headers. reset reason: connection failure%
Some debug logs from Envoy: ``` [2024-07-04 20:55:04.149][21][debug][conn_handler] [source/extensions/listener_managers/listener_manager/active_tcp_listener.cc:155] [C205] new connection from :44066 [2024-07-04 20:55:04.149][21][debug][http] [source/common/http/conn_manager_impl.cc:375] [C205] new stream [2024-07-04 20:55:04.149][21][debug][http] [source/common/http/conn_manager_impl.cc:1118] [C205][S13244765374556234856] request headers complete (end_stream=true): ':authority', ':28123' ':path', '/' ':method', 'GET' 'user-agent', 'curl/8.6.0' 'accept', '*/*' 'x-test-envoy', 'true' [2024-07-04 20:55:04.149][21][debug][http] [source/common/http/conn_manager_impl.cc:1101] [C205][S13244765374556234856] request end stream [2024-07-04 20:55:04.149][21][debug][connection] [./source/common/network/connection_impl.h:98] [C205] current connecting state: false [2024-07-04 20:55:04.149][21][debug][router] [source/common/router/router.cc:478] [C205][S13244765374556234856] cluster 'my-service' match for URL '/' [2024-07-04 20:55:04.149][21][debug][router] [source/common/router/router.cc:686] [C205][S13244765374556234856] router decoding headers: ':authority', ':28123' ':path', '/' ':method', 'GET' ':scheme', 'http' 'user-agent', 'curl/8.6.0' 'accept', '*/*' 'x-test-envoy', 'true' 'x-forwarded-proto', 'http' 'x-request-id', '' [2024-07-04 20:55:04.149][21][debug][pool] [source/common/http/conn_pool_base.cc:78] queueing stream due to no available connections (ready=0 busy=0 connecting=0) [2024-07-04 20:55:04.149][21][debug][pool] [source/common/conn_pool/conn_pool_base.cc:291] trying to create new connection [2024-07-04 20:55:04.149][21][debug][pool] [source/common/conn_pool/conn_pool_base.cc:145] creating a new connection (connecting=0) [2024-07-04 20:55:04.150][21][debug][connection] [./source/common/network/connection_impl.h:98] [C206] current connecting state: true [2024-07-04 20:55:04.150][21][debug][client] [source/common/http/codec_client.cc:57] [C206] connecting [2024-07-04 20:55:04.150][21][debug][connection] [source/common/network/connection_impl.cc:941] [C206] connecting to 240.0.41.1:80 [2024-07-04 20:55:04.150][21][debug][connection] [source/common/network/connection_impl.cc:960] [C206] connection in progress [2024-07-04 20:55:06.029][1][debug][main] [source/server/server.cc:265] flushing stats [2024-07-04 20:55:07.079][1][debug][dns] [source/extensions/network/dns_resolver/cares/dns_impl.cc:354] dns resolution for my-service.virtual.consul started [2024-07-04 20:55:07.081][1][debug][dns] [source/extensions/network/dns_resolver/cares/dns_impl.cc:275] dns resolution for my-service.virtual.consul completed with status 0 [2024-07-04 20:55:07.081][1][debug][upstream] [source/common/upstream/upstream_impl.cc:457] transport socket match, socket default selected for host with address 240.0.41.1:80 [2024-07-04 20:55:07.081][1][debug][upstream] [source/extensions/clusters/strict_dns/strict_dns_cluster.cc:177] DNS refresh rate reset for my-service.virtual.consul, refresh rate 5000 ms [2024-07-04 20:55:08.321][15][debug][conn_handler] [source/extensions/listener_managers/listener_manager/active_tcp_listener.cc:155] [C207] new connection from 127.0.0.1:54136 [2024-07-04 20:55:08.322][15][debug][connection] [source/common/network/connection_impl.cc:656] [C207] remote close [2024-07-04 20:55:08.322][15][debug][connection] [source/common/network/connection_impl.cc:250] [C207] closing socket: 0 [2024-07-04 20:55:08.322][15][debug][conn_handler] [source/extensions/listener_managers/listener_manager/active_stream_listener_base.cc:121] [C207] adding to cleanup list [2024-07-04 20:55:09.148][21][debug][pool] [source/common/conn_pool/conn_pool_base.cc:793] [C206] connect timeout [2024-07-04 20:55:09.148][21][debug][connection] [source/common/network/connection_impl.cc:139] [C206] closing data_to_write=0 type=1 [2024-07-04 20:55:09.148][21][debug][connection] [source/common/network/connection_impl.cc:250] [C206] closing socket: 1 [2024-07-04 20:55:09.148][21][debug][client] [source/common/http/codec_client.cc:107] [C206] disconnect. resetting 0 pending requests [2024-07-04 20:55:09.148][21][debug][pool] [source/common/conn_pool/conn_pool_base.cc:484] [C206] client disconnected, failure reason: [2024-07-04 20:55:09.148][21][debug][router] [source/common/router/router.cc:1279] [C205][S13244765374556234856] upstream reset: reset reason: connection failure, transport failure reason: [2024-07-04 20:55:09.148][21][debug][http] [source/common/http/filter_manager.cc:996] [C205][S13244765374556234856] Sending local reply with details upstream_reset_before_response_started{connection_failure} [2024-07-04 20:55:09.148][21][debug][http] [source/common/http/conn_manager_impl.cc:1773] [C205][S13244765374556234856] encoding headers via codec (end_stream=false): ':status', '503' 'content-length', '91' 'content-type', 'text/plain' 'date', 'Thu, 04 Jul 2024 20:55:09 GMT' 'server', 'envoy' [2024-07-04 20:55:09.148][21][debug][http] [source/common/http/conn_manager_impl.cc:1865] [C205][S13244765374556234856] Codec completed encoding stream. [2024-07-04 20:55:09.148][21][debug][pool] [source/common/conn_pool/conn_pool_base.cc:454] invoking idle callbacks - is_draining_for_deletion_=false [2024-07-04 20:55:10.201][21][debug][connection] [source/common/network/connection_impl.cc:656] [C205] remote close [2024-07-04 20:55:10.201][21][debug][connection] [source/common/network/connection_impl.cc:250] [C205] closing socket: 0 [2024-07-04 20:55:10.201][21][debug][conn_handler] [source/extensions/listener_managers/listener_manager/active_stream_listener_base.cc:121] [C205] adding to cleanup list [2024-07-04 20:55:11.031][1][debug][main] [source/server/server.cc:265] flushing stats [2024-07-04 20:55:12.081][1][debug][dns] [source/extensions/network/dns_resolver/cares/dns_impl.cc:354] dns resolution for my-service.virtual.consul started [2024-07-04 20:55:12.083][1][debug][dns] [source/extensions/network/dns_resolver/cares/dns_impl.cc:275] dns resolution for my-service.virtual.consul completed with status 0 [2024-07-04 20:55:12.083][1][debug][upstream] [source/common/upstream/upstream_impl.cc:457] transport socket match, socket default selected for host with address 240.0.41.1:80 [2024-07-04 20:55:12.083][1][debug][upstream] [source/extensions/clusters/strict_dns/strict_dns_cluster.cc:177] DNS refresh rate reset for my-service.virtual.consul, refresh rate 5000 ms ```

If you go inside the container Envoy proxy and try to sent a request locally (Envoy listener) you see same issue:

root@d87055ce8426:/# curl -v localhost:28162
*   Trying 127.0.0.1:28162...
* TCP_NODELAY set
* Connected to localhost (127.0.0.1) port 28162 (#0)
> GET / HTTP/1.1
> Host: localhost:28162
> User-Agent: curl/7.68.0
> Accept: */*
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 503 Service Unavailable
< content-length: 91
< content-type: text/plain
< date: Thu, 04 Jul 2024 22:23:30 GMT
< server: envoy
< 
* Connection #0 to host localhost left intact
upstream connect error or disconnect/reset before headers. reset reason: connection failure

At the same time a request to my-service by consul name (and virtual IP) pass:

root@d87055ce8426:/# curl -v my-service.virtual.consul
*   Trying 240.0.41.1:80...
* TCP_NODELAY set
* Connected to my-service.virtual.consul (240.0.41.1) port 80 (#0)
> GET / HTTP/1.1
> Host: my-service.virtual.consul
> User-Agent: curl/7.68.0
> Accept: */*
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Server: fasthttp
< Date: Thu, 04 Jul 2024 22:24:07 GMT
< Content-Type: application/json; charset=utf8
< Content-Length: 57
< 
* Connection #0 to host my-service.virtual.consul left intact
{"jsonrpc": "2.0", "id": "test", "result": "ok"}

I also have three other services in the cluster running with transparent proxy and there is connectivity between them. So I guess the problem is with Envoy proxy (or my configuration of Envoy proxy) I tried different versions of Envoy, including the latest.

Any help would be appreciated.

ruslan-y commented 1 month ago

Updated Consul to version 1.19.1 and Nomad to version 1.8.2 Problem still exist.