hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
28.22k stars 4.41k forks source link

Envoy proxy doesn't work correctly in Consul transparent proxy mode. #21517

Open ruslan-y opened 2 months ago

ruslan-y commented 2 months ago

Hi there!

I'm going to describe my problem in detail so a lot of text, logs and configs are expected)

Nomad version

Nomad v1.8.0
BuildDate 2024-05-28T17:38:17Z
Revision 28b82e4b2259fae5a62e2ed47395334bea5a24c4

Consul version

Consul v1.19.0
Revision bf0166d8
Build Date 2024-06-12T13:59:10Z

Operating system and Environment details

5.10.0-23-amd64 #1 SMP Debian 5.10.179-2 (2023-07-14) x86_64 GNU/Linux
Nomad client config ``` name = "host" region = "global" datacenter = "dc1" enable_debug = false disable_update_check = false bind_addr = "" advertise { http = ":4646" rpc = ":4647" serf = ":4648" } ports { http = 4646 rpc = 4647 serf = 4648 } consul { address = "localhost:8500" ssl = false ca_file = "" grpc_ca_file = "" cert_file = "" key_file = "" token = "" server_service_name = "nomad-servers" client_service_name = "nomad-clients" tags = [] auto_advertise = true server_auto_join = true client_auto_join = true } data_dir = "/var/nomad" log_level = "INFO" enable_syslog = true leave_on_terminate = true leave_on_interrupt = false tls { http = true rpc = true ca_file = "/etc/nomad/ssl/nomad-ca.pem" cert_file = "/etc/nomad/ssl/client.pem" key_file = "/etc/nomad/ssl/client-key.pem" rpc_upgrade_mode = false verify_server_hostname = "true" verify_https_client = "true" } acl { enabled = true token_ttl = "30s" policy_ttl = "30s" replication_token = "" } vault { enabled = true address = "https://" allow_unauthenticated = true create_from_role = "nomad-cluster" task_token_ttl = "" ca_file = "" ca_path = "" cert_file = "" key_file = "" tls_server_name = "" tls_skip_verify = false namespace = "" } telemetry { disable_hostname = "true" collection_interval = "15s" use_node_name = "false" publish_allocation_metrics = "true" publish_node_metrics = "true" filter_default = "true" prefix_filter = [] disable_dispatched_job_summary_metrics = "false" statsite_address = "" statsd_address = "" datadog_address = "" datadog_tags = [] prometheus_metrics = "true" circonus_api_token = "" circonus_api_app = "nomad" circonus_api_url = "https://api.circonus.com/v2" circonus_submission_interval = "10s" circonus_submission_url = "" circonus_check_id = "" circonus_check_force_metric_activation = "false" circonus_check_instance_id = "" circonus_check_search_tag = "" circonus_check_display_name = "" circonus_check_tags = "" circonus_broker_id = "" circonus_broker_select_tag = "" } autopilot { cleanup_dead_servers = true last_contact_threshold = "1s" max_trailing_logs = 250 server_stabilization_time = "10s" } ```
Consul client config ``` { "acl": { "default_policy": "deny", "down_policy": "extend-cache", "enable_token_persistence": true, "enabled": true, "token_ttl": "30s", "tokens": { "agent": "", "agent_recovery": "" } }, "addresses": { "dns": "", "grpc": "", "grpc_tls": "", "http": "", "https": "" }, "advertise_addr": "", "advertise_addr_wan": "", "auto_encrypt": { "tls": true }, "bind_addr": "", "client_addr": "", "connect": { "enabled": true }, "data_dir": "/opt/consul", "datacenter": "dc1", "disable_update_check": false, "domain": "consul", "enable_local_script_checks": false, "enable_script_checks": false, "enable_syslog": true, "encrypt": "", "encrypt_verify_incoming": true, "encrypt_verify_outgoing": true, "limits": { "http_max_conns_per_client": 400, "rpc_max_conns_per_client": 200 }, "log_level": "INFO", "node_name": "host", "performance": { "leave_drain_time": "10s", "raft_multiplier": 1, "rpc_hold_timeout": "30s" }, "ports": { "dns": 8600, "grpc": 8502, "grpc_tls": 8503, "http": 8500, "https": -1, "serf_lan": 8301, "serf_wan": 8302, "server": 8300 }, "primary_datacenter": "dc1", "raft_protocol": 3, "recursors": [ "", "" ], "retry_interval": "30s", "retry_join": [ "", "", "" ], "retry_max": 0, "server": false, "syslog_facility": "local0", "tls": { "defaults": { "ca_file": "/etc/consul/ssl/consul-agent-ca.pem", "tls_min_version": "TLSv1_2", "verify_incoming": false, "verify_outgoing": true }, "https": { "verify_incoming": false }, "internal_rpc": { "verify_incoming": true, "verify_server_hostname": false } }, "translate_wan_addrs": false, "ui_config": { "enabled": false } } ```


There is a running nomad job (envoy proxy) in a cluster that proxies requests to my-service. I'm using consul connect upstreams in my nomad jobs and it works perfectly.

Example of my nomad job (upstreams): ``` job "test-proxy-job" { datacenters = ["dc1"] namespace = "test" type = "service" group "test-proxy-group" { count = 1 vault { policies = ["nomad-services"] } update { max_parallel = 1 canary = 1 auto_revert = true auto_promote = true min_healthy_time = "10s" healthy_deadline = "5m" progress_deadline = "15m" } network { mode = "bridge" port "http" { static = 28123 to = 28162 } } service { name = "test-proxy" port = "28162" tags = ["proxy", "test"] connect { sidecar_task { resources { cpu = 500 memory = 300 } config { args = [ "-c", "${NOMAD_SECRETS_DIR}/envoy_bootstrap.json", "-l", "debug", "--concurrency", "${meta.connect.proxy_concurrency}", "--disable-hot-restart" ] } } sidecar_service { proxy { upstreams { destination_name = "my-service" local_bind_port = 10007 } } } } } task "test-proxy-task" { driver = "docker" template { data = <

If I make a request there will be an expected response from the my-service:

curl -v -H 'x-test-envoy: true' http://<external_ip>:28123
*   Trying <external_ip>:28123...
* Connected to <external_ip> (<external_ip>) port 28123
> GET / HTTP/1.1
> Host: <external_ip>:28123
> User-Agent: curl/8.6.0
> Accept: */*
> x-test-envoy: true
< HTTP/1.1 200 OK
< server: envoy
< date: Thu, 04 Jul 2024 19:01:21 GMT
< content-type: application/json; charset=utf8
< content-length: 57
< x-envoy-upstream-service-time: 6
* Connection #0 to host <external_ip> left intact
{"jsonrpc": "2.0", "id": "test", "result": "ok"}%

When I'm trying to enable transparent proxy in the config it doesn't work

Example of my nomad job (transparent proxy): ``` job "test-proxy-job" { datacenters = ["dc1"] namespace = "test" type = "service" group "test-proxy-group" { count = 1 vault { policies = ["nomad-services"] } update { max_parallel = 1 canary = 1 auto_revert = true auto_promote = true min_healthy_time = "10s" healthy_deadline = "5m" progress_deadline = "15m" } network { mode = "bridge" port "http" { static = 28123 to = 28162 } } service { name = "test-proxy" port = "28162" tags = ["proxy", "test"] connect { sidecar_task { resources { cpu = 500 memory = 300 } config { args = [ "-c", "${NOMAD_SECRETS_DIR}/envoy_bootstrap.json", "-l", "debug", "--concurrency", "${meta.connect.proxy_concurrency}", "--disable-hot-restart" ] } } sidecar_service { proxy { transparent_proxy {} } } } } task "test-proxy-task" { driver = "docker" template { data = <

Requests don't go through, I'm getting 503 error. No response from the my-service:

curl -v -H 'x-test-envoy: true' http://<external_ip>:28123
*   Trying <external_ip>:28123...
* Connected to <external_ip> (<external_ip>) port 28123
> GET / HTTP/1.1
> Host: <external_ip>:28123
> User-Agent: curl/8.6.0
> Accept: */*
> x-test-envoy: true
< HTTP/1.1 503 Service Unavailable
< content-length: 91
< content-type: text/plain
< date: Thu, 04 Jul 2024 20:55:09 GMT
< server: envoy
* Connection #0 to host <external_ip> left intact
upstream connect error or disconnect/reset before headers. reset reason: connection failure%
Some debug logs from Envoy: ``` [2024-07-04 20:55:04.149][21][debug][conn_handler] [source/extensions/listener_managers/listener_manager/active_tcp_listener.cc:155] [C205] new connection from :44066 [2024-07-04 20:55:04.149][21][debug][http] [source/common/http/conn_manager_impl.cc:375] [C205] new stream [2024-07-04 20:55:04.149][21][debug][http] [source/common/http/conn_manager_impl.cc:1118] [C205][S13244765374556234856] request headers complete (end_stream=true): ':authority', ':28123' ':path', '/' ':method', 'GET' 'user-agent', 'curl/8.6.0' 'accept', '*/*' 'x-test-envoy', 'true' [2024-07-04 20:55:04.149][21][debug][http] [source/common/http/conn_manager_impl.cc:1101] [C205][S13244765374556234856] request end stream [2024-07-04 20:55:04.149][21][debug][connection] [./source/common/network/connection_impl.h:98] [C205] current connecting state: false [2024-07-04 20:55:04.149][21][debug][router] [source/common/router/router.cc:478] [C205][S13244765374556234856] cluster 'my-service' match for URL '/' [2024-07-04 20:55:04.149][21][debug][router] [source/common/router/router.cc:686] [C205][S13244765374556234856] router decoding headers: ':authority', ':28123' ':path', '/' ':method', 'GET' ':scheme', 'http' 'user-agent', 'curl/8.6.0' 'accept', '*/*' 'x-test-envoy', 'true' 'x-forwarded-proto', 'http' 'x-request-id', '' [2024-07-04 20:55:04.149][21][debug][pool] [source/common/http/conn_pool_base.cc:78] queueing stream due to no available connections (ready=0 busy=0 connecting=0) [2024-07-04 20:55:04.149][21][debug][pool] [source/common/conn_pool/conn_pool_base.cc:291] trying to create new connection [2024-07-04 20:55:04.149][21][debug][pool] [source/common/conn_pool/conn_pool_base.cc:145] creating a new connection (connecting=0) [2024-07-04 20:55:04.150][21][debug][connection] [./source/common/network/connection_impl.h:98] [C206] current connecting state: true [2024-07-04 20:55:04.150][21][debug][client] [source/common/http/codec_client.cc:57] [C206] connecting [2024-07-04 20:55:04.150][21][debug][connection] [source/common/network/connection_impl.cc:941] [C206] connecting to [2024-07-04 20:55:04.150][21][debug][connection] [source/common/network/connection_impl.cc:960] [C206] connection in progress [2024-07-04 20:55:06.029][1][debug][main] [source/server/server.cc:265] flushing stats [2024-07-04 20:55:07.079][1][debug][dns] [source/extensions/network/dns_resolver/cares/dns_impl.cc:354] dns resolution for my-service.virtual.consul started [2024-07-04 20:55:07.081][1][debug][dns] [source/extensions/network/dns_resolver/cares/dns_impl.cc:275] dns resolution for my-service.virtual.consul completed with status 0 [2024-07-04 20:55:07.081][1][debug][upstream] [source/common/upstream/upstream_impl.cc:457] transport socket match, socket default selected for host with address [2024-07-04 20:55:07.081][1][debug][upstream] [source/extensions/clusters/strict_dns/strict_dns_cluster.cc:177] DNS refresh rate reset for my-service.virtual.consul, refresh rate 5000 ms [2024-07-04 20:55:08.321][15][debug][conn_handler] [source/extensions/listener_managers/listener_manager/active_tcp_listener.cc:155] [C207] new connection from [2024-07-04 20:55:08.322][15][debug][connection] [source/common/network/connection_impl.cc:656] [C207] remote close [2024-07-04 20:55:08.322][15][debug][connection] [source/common/network/connection_impl.cc:250] [C207] closing socket: 0 [2024-07-04 20:55:08.322][15][debug][conn_handler] [source/extensions/listener_managers/listener_manager/active_stream_listener_base.cc:121] [C207] adding to cleanup list [2024-07-04 20:55:09.148][21][debug][pool] [source/common/conn_pool/conn_pool_base.cc:793] [C206] connect timeout [2024-07-04 20:55:09.148][21][debug][connection] [source/common/network/connection_impl.cc:139] [C206] closing data_to_write=0 type=1 [2024-07-04 20:55:09.148][21][debug][connection] [source/common/network/connection_impl.cc:250] [C206] closing socket: 1 [2024-07-04 20:55:09.148][21][debug][client] [source/common/http/codec_client.cc:107] [C206] disconnect. resetting 0 pending requests [2024-07-04 20:55:09.148][21][debug][pool] [source/common/conn_pool/conn_pool_base.cc:484] [C206] client disconnected, failure reason: [2024-07-04 20:55:09.148][21][debug][router] [source/common/router/router.cc:1279] [C205][S13244765374556234856] upstream reset: reset reason: connection failure, transport failure reason: [2024-07-04 20:55:09.148][21][debug][http] [source/common/http/filter_manager.cc:996] [C205][S13244765374556234856] Sending local reply with details upstream_reset_before_response_started{connection_failure} [2024-07-04 20:55:09.148][21][debug][http] [source/common/http/conn_manager_impl.cc:1773] [C205][S13244765374556234856] encoding headers via codec (end_stream=false): ':status', '503' 'content-length', '91' 'content-type', 'text/plain' 'date', 'Thu, 04 Jul 2024 20:55:09 GMT' 'server', 'envoy' [2024-07-04 20:55:09.148][21][debug][http] [source/common/http/conn_manager_impl.cc:1865] [C205][S13244765374556234856] Codec completed encoding stream. [2024-07-04 20:55:09.148][21][debug][pool] [source/common/conn_pool/conn_pool_base.cc:454] invoking idle callbacks - is_draining_for_deletion_=false [2024-07-04 20:55:10.201][21][debug][connection] [source/common/network/connection_impl.cc:656] [C205] remote close [2024-07-04 20:55:10.201][21][debug][connection] [source/common/network/connection_impl.cc:250] [C205] closing socket: 0 [2024-07-04 20:55:10.201][21][debug][conn_handler] [source/extensions/listener_managers/listener_manager/active_stream_listener_base.cc:121] [C205] adding to cleanup list [2024-07-04 20:55:11.031][1][debug][main] [source/server/server.cc:265] flushing stats [2024-07-04 20:55:12.081][1][debug][dns] [source/extensions/network/dns_resolver/cares/dns_impl.cc:354] dns resolution for my-service.virtual.consul started [2024-07-04 20:55:12.083][1][debug][dns] [source/extensions/network/dns_resolver/cares/dns_impl.cc:275] dns resolution for my-service.virtual.consul completed with status 0 [2024-07-04 20:55:12.083][1][debug][upstream] [source/common/upstream/upstream_impl.cc:457] transport socket match, socket default selected for host with address [2024-07-04 20:55:12.083][1][debug][upstream] [source/extensions/clusters/strict_dns/strict_dns_cluster.cc:177] DNS refresh rate reset for my-service.virtual.consul, refresh rate 5000 ms ```

If you go inside the container Envoy proxy and try to sent a request locally (Envoy listener) you see same issue:

root@d87055ce8426:/# curl -v localhost:28162
*   Trying
* Connected to localhost ( port 28162 (#0)
> GET / HTTP/1.1
> Host: localhost:28162
> User-Agent: curl/7.68.0
> Accept: */*
* Mark bundle as not supporting multiuse
< HTTP/1.1 503 Service Unavailable
< content-length: 91
< content-type: text/plain
< date: Thu, 04 Jul 2024 22:23:30 GMT
< server: envoy
* Connection #0 to host localhost left intact
upstream connect error or disconnect/reset before headers. reset reason: connection failure

At the same time a request to my-service by consul name (and virtual IP) pass:

root@d87055ce8426:/# curl -v my-service.virtual.consul
*   Trying
* Connected to my-service.virtual.consul ( port 80 (#0)
> GET / HTTP/1.1
> Host: my-service.virtual.consul
> User-Agent: curl/7.68.0
> Accept: */*
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Server: fasthttp
< Date: Thu, 04 Jul 2024 22:24:07 GMT
< Content-Type: application/json; charset=utf8
< Content-Length: 57
* Connection #0 to host my-service.virtual.consul left intact
{"jsonrpc": "2.0", "id": "test", "result": "ok"}

I also have three other services in the cluster running with transparent proxy and there is connectivity between them. So I guess the problem is with Envoy proxy (or my configuration of Envoy proxy) I tried different versions of Envoy, including the latest.

Any help would be appreciated.

ruslan-y commented 1 month ago

Updated Consul to version 1.19.1 and Nomad to version 1.8.2 Problem still exist.