hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.07k stars 4.4k forks source link

Envoy Proxy for Terminating Gateway fails to configure dynamic cluster #21370

Open cyclops23 opened 2 months ago

cyclops23 commented 2 months ago

Nomad version

Nomad v1.7.6
BuildDate 2024-03-12T07:27:36Z
Revision 594fedbfbc4f0e532b65e8a69b28ff9403eb822e

Consul version

Consul v1.18.1
Revision 98cb473c
Build Date 2024-03-26T21:59:08Z

Operating system and Environment details

Linux ip-XX-XX-XXX-XXX 5.15.0-1056-aws hashicorp/nomad#61~20.04.1-Ubuntu SMP Wed Mar 13 17:45:04 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux

Issue

I'm attempting to set up a terminating gateway for DynamoDB. The Envoy proxy is started successfully but the dynamic cluster representing the terminating gateway service is never added.

In the Consul / Nomad UIs everything looks good:

Screenshot 2024-04-18 at 12 03 21 Screenshot 2024-04-18 at 12 03 49

however the external service is not accessible through the service mesh.

Reproduction steps

  1. Create an external service in Consul
  2. Create a terminating gateway job in Nomad

I've uploaded the relevant configuration files to https://github.com/cyclops23/nomad-bug-tgw

Expected Result

The external service should be accessible through the terminating gateway (or some meaningful error message should be provided if there is a problem with the configuration).

Expect to see the dynamic cluster representing the external service to be added to Envoy like this example:

[2024-04-18 11:42:34.877][1][info][upstream] [source/common/upstream/cluster_manager_impl.cc:222] cm init: initializing cds
[2024-04-18 11:42:34.879][1][info][main] [source/server/server.cc:934] starting main dispatch loop
[2024-04-18 11:42:34.884][1][info][upstream] [source/common/upstream/cds_api_helper.cc:32] cds: add 1 cluster(s), remove 0 cluster(s)
[2024-04-18 11:42:34.919][1][info][upstream] [source/common/upstream/cds_api_helper.cc:71] cds: added/updated 1 cluster(s), skipped 0 unmodified cluster(s)
[2024-04-18 11:42:34.921][1][info][upstream] [source/common/upstream/cluster_manager_impl.cc:226] cm init: all clusters initialized

Actual Result

Requests to the external service are routed to the terminating gateway and fail.

Inspecting the Envoy logs shows that the cluster for the gateway is never added via xDS:

[info] cm init: initializing cds
[info] starting main dispatch loop
[debug] [Tags: \"ConnectionId\":\"0\"] connected
[debug] [Tags: \"ConnectionId\":\"0\"] connected
[debug] [Tags: \"ConnectionId\":\"0\"] attaching to next stream
[debug] [Tags: \"ConnectionId\":\"0\"] creating stream
[debug] [Tags: \"ConnectionId\":\"0\",\"StreamId\":\"10539488469085721247\"] pool ready
[debug] [Tags: \"ConnectionId\":\"0\",\"StreamId\":\"10539488469085721247\"] upstream headers complete: end_stream=false
[debug] async http request response headers (end_stream=false):\n':status', '200'\n'content-type', 'application/grpc'\n
[debug] Received DeltaDiscoveryResponse for type.googleapis.com/envoy.config.cluster.v3.Cluster at version 
[info] cds: add 0 cluster(s), remove 0 cluster(s)
[info] cds: added/updated 0 cluster(s), skipped 0 unmodified cluster(s)
[debug] maybe finish initialize state: 4
[debug] maybe finish initialize primary init clusters empty: true
[debug] maybe finish initialize secondary init clusters empty: true
[debug] maybe finish initialize cds api ready: true
[info] cm init: all clusters initialized

Additional config / debug info

# consul config read -kind terminating-gateway -name ext-dynamodb-tgw
{
    "Kind": "terminating-gateway",
    "Name": "ext-dynamodb-tgw",
    "Services": [
        {
            "Name": "ext-dynamodb",
            "CAFile": "/etc/ssl/certs/Amazon_Root_CA_1.pem",
            "SNI": "dynamodb.us-east-1.amazonaws.com"
        }
    ],
    "CreateIndex": 438709,
    "ModifyIndex": 438709
}
# curl -s -H "X-Consul-Token:${CONSUL_HTTP_TOKEN}" "${CONSUL_HTTP_ADDR}/v1/catalog/service/ext-dynamodb-tgw" | jq  '.[] | { ServiceKind, ServiceName, ServiceID }'
{
  "ServiceKind": "terminating-gateway",
  "ServiceName": "ext-dynamodb-tgw",
  "ServiceID": "_nomad-task-f0b1b6d5-ef0f-ec7c-14c6-3112685453aa-group-ext-dynamodb-tgw-ext-dynamodb-tgw-connect-terminating-ext-dynamodb-tgw"
}
# cat .envoy_bootstrap.cmd
connect envoy -grpc-addr unix://alloc/tmp/consul_grpc.sock -http-addr 127.0.0.1:8501 -admin-bind 127.0.0.2:19000 -address 127.0.0.1:19100 -proxy-id _nomad-task-f0b1b6d5-ef0f-ec7c-14c6-3112685453aa-group-ext-dynamodb-tgw-ext-dynamodb-tgw-connect-terminating-ext-dynamodb-tgw -bootstrap -gateway terminating -token <REDACTED> -grpc-ca-file /opt/consul/tls/ca.pem -ca-file /opt/consul/tls/ca.pem -client-cert /opt/nomad/tls/cert.pem -client-key /opt/nomad/tls/private-key.pem
# cat .envoy_bootstrap.env
[
    "LANG=C.UTF-8",
    "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin",
    "HOME=/root",
    "LOGNAME=root",
    "USER=root",
    "SHELL=/bin/sh",
    "INVOCATION_ID=92dfa72f396a42d69b4c3d62c526fcc5",
    "JOURNAL_STREAM=8:27232",
    "CONSUL_HTTP_SSL=true",
    "CONSUL_HTTP_SSL_VERIFY=false",
    "NOMAD_ALLOC_ID=f0b1b6d5-ef0f-ec7c-14c6-3112685453aa",
    "NOMAD_SHORT_ALLOC_ID=f0b1b6d5",
    "NOMAD_ALLOC_NAME=ext-dynamodb-tgw.ext-dynamodb-tgw[0]",
    "NOMAD_GROUP_NAME=ext-dynamodb-tgw",
    "NOMAD_JOB_NAME=ext-dynamodb-tgw",
    "NOMAD_JOB_ID=ext-dynamodb-tgw",
    "NOMAD_NAMESPACE=default",
    "NOMAD_REGION=global"
]
# cat envoy_bootstrap.json
{
  "admin": {
    "access_log": [
      {
        "name": "Consul Listener Filter Log",
        "typedConfig": {
          "@type": "type.googleapis.com/envoy.extensions.access_loggers.stream.v3.StdoutAccessLog",
          "logFormat": {
            ...
          }
        }
      }
    ],
    "address": {
      "socket_address": {
        "address": "127.0.0.2",
        "port_value": 19000
      }
    }
  },
  "node": {
    "cluster": "terminating-gateway",
    "id": "_nomad-task-f0b1b6d5-ef0f-ec7c-14c6-3112685453aa-group-ext-dynamodb-tgw-ext-dynamodb-tgw-connect-terminating-ext-dynamodb-tgw",
    "metadata": {
      "namespace": "default",
      "partition": "default"
    }
  },
  "layered_runtime": {
    "layers": [
      {
        "name": "base",
        "static_layer": {
          "re2.max_program_size.error_level": 1048576
        }
      }
    ]
  },
  "static_resources": {
    "clusters": [
      {
        "name": "local_agent",
        "ignore_health_on_host_removal": false,
        "connect_timeout": "1s",
        "type": "STATIC",
        "transport_socket": {
          "name": "tls",
          "typed_config": {
            "@type": "type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.UpstreamTlsContext",
            "common_tls_context": {
              "validation_context": {
                "trusted_ca": {
                  "inline_string": "-----BEGIN CERTIFICATE-----\n<REDACTED\n-----END CERTIFICATE-----\n"
                }
              }
            }
          }
        },
        "typed_extension_protocol_options": {
          "envoy.extensions.upstreams.http.v3.HttpProtocolOptions": {
            "@type": "type.googleapis.com/envoy.extensions.upstreams.http.v3.HttpProtocolOptions",
            "explicit_http_config": {
              "http2_protocol_options": {}
            }
          }
        },
        "loadAssignment": {
          "clusterName": "local_agent",
          "endpoints": [
            {
              "lbEndpoints": [
                {
                  "endpoint": {
                    "address": {
                      "pipe": {
                        "path": "alloc/tmp/consul_grpc.sock"
                      }
                    }
                  }
                }
              ]
            }
          ]
        }
      },
      {
        "name": "self_admin",
        "ignore_health_on_host_removal": false,
        "connect_timeout": "5s",
        "type": "STATIC",
        "typed_extension_protocol_options": {
          "envoy.extensions.upstreams.http.v3.HttpProtocolOptions": {
            "@type": "type.googleapis.com/envoy.extensions.upstreams.http.v3.HttpProtocolOptions",
            "explicit_http_config": {
              "http_protocol_options": {}
            }
          }
        },
        "loadAssignment": {
          "clusterName": "self_admin",
          "endpoints": [
            {
              "lbEndpoints": [
                {
                  "endpoint": {
                    "address": {
                      "socket_address": {
                        "address": "127.0.0.2",
                        "port_value": 19000
                      }
                    }
                  }
                }
              ]
            }
          ]
        }
      }
    ],
    "listeners": [
      {
        "name": "envoy_prometheus_metrics_listener",
        "address": {
          "socket_address": {
            "address": "127.0.0.1",
            "port_value": 9102
          }
        },
        "filter_chains": [
          {
            "filters": [
              {
                "name": "envoy.filters.network.http_connection_manager",
                "typedConfig": {
                  "@type": "type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager",
                  "stat_prefix": "envoy_prometheus_metrics",
                  "codec_type": "HTTP1",
                  "route_config": {
                    "name": "self_admin_route",
                    "virtual_hosts": [
                      {
                        "name": "self_admin",
                        "domains": [
                          "*"
                        ],
                        "routes": [
                          {
                            "match": {
                              "path": "/metrics"
                            },
                            "route": {
                              "cluster": "self_admin",
                              "prefix_rewrite": "/stats/prometheus"
                            }
                          },
                          {
                            "match": {
                              "prefix": "/"
                            },
                            "direct_response": {
                              "status": 404
                            }
                          }
                        ]
                      }
                    ]
                  },
                  "http_filters": [
                    {
                      "name": "envoy.filters.http.router",
                      "typedConfig": {
                        "@type": "type.googleapis.com/envoy.extensions.filters.http.router.v3.Router"
                      }
                    }
                  ]
                }
              }
            ]
          }
        ]
      }
    ]
  },
  "stats_config": {
    "stats_tags": [
      ...
    ],
    "use_all_default_tags": true
  },
  "dynamic_resources": {
    "lds_config": {
      "ads": {},
      "initial_fetch_timeout": "0s",
      "resource_api_version": "V3"
    },
    "cds_config": {
      "ads": {},
      "initial_fetch_timeout": "0s",
      "resource_api_version": "V3"
    },
    "ads_config": {
      "api_type": "DELTA_GRPC",
      "transport_api_version": "V3",
      "grpc_services": {
        "initial_metadata": [
          {
            "key": "x-consul-token",
            "value": "<REDACTED>"
          }
        ],
        "envoy_grpc": {
          "cluster_name": "local_agent"
        }
      }
    }
  }
}

From the agent where the terminating gateway is running:

# curl -s http://127.0.0.0:8500/v1/agent/services | jq '.["_nomad-task-f0b1b6d5-ef0f-ec7c-14c6-3112685453aa-group-ext-dynamodb-tgw-ext-dynamodb-tgw-connect-terminating-ext-dynamodb-tgw"]'
{
  "Kind": "terminating-gateway",
  "ID": "_nomad-task-f0b1b6d5-ef0f-ec7c-14c6-3112685453aa-group-ext-dynamodb-tgw-ext-dynamodb-tgw-connect-terminating-ext-dynamodb-tgw",
  "Service": "ext-dynamodb-tgw",
  "Tags": [],
  "Meta": {
    "external-source": "nomad"
  },
  "Port": 28117,
  "Address": "<REDACTED>",
  "TaggedAddresses": {
    "lan_ipv4": {
      "Address": "<REDACTED>",
      "Port": 28117
    },
    "wan_ipv4": {
      "Address": "<REDACTED>",
      "Port": 28117
    }
  },
  "Weights": {
    "Passing": 1,
    "Warning": 1
  },
  "EnableTagOverride": false,
  "Proxy": {
    "Config": {
      "component_log_level": "upstream:trace,http:trace,router:trace,config:trace",
      "connect_timeout_ms": 5000,
      "envoy_gateway_bind_addresses": {
        "default": {
          "Address": "0.0.0.0",
          "Port": 28117
        }
      },
      "envoy_gateway_no_default_bind": true,
      "envoy_prometheus_bind_addr": "127.0.0.1:9102",
      "log_level": "debug",
      "protocol": "tcp"
    },
    "MeshGateway": {},
    "Expose": {},
    "AccessLogs": {
      "Enabled": true
    }
  },
  "Datacenter": "aws-us-east-1"
}

Please let me know if there are additional debugging steps you can suggest, or if you need more information on the issue.

tgross commented 2 weeks ago

Hi @cyclops23! Apologies for the long delay in responding to this. Unfortunately I wasn't able to reproduce what you're seeing. I've cloned the repository you linked to and ran the deploy script, and got the following in the logs for the terminating GW proxy:

[2024-06-25 19:31:44.406][1][info][main] [source/server/server.cc:934] starting main dispatch loop
[2024-06-25 19:31:44.409][1][info][upstream] [source/common/upstream/cds_api_helper.cc:32] cds: add 1 cluster(s), remove 0 cluster(s)
[2024-06-25 19:31:44.466][1][info][upstream] [source/common/upstream/cds_api_helper.cc:71] cds: added/updated 1 cluster(s), skipped 0 unmodified cluster(s)
[2024-06-25 19:31:44.467][1][info][upstream] [source/common/upstream/cluster_manager_impl.cc:226] cm init: all clusters initialized
[2024-06-25 19:31:44.467][1][info][main] [source/server/server.cc:915] all clusters initialized. initializing init manager
[2024-06-25 19:31:44.470][1][info][upstream] [source/extensions/listener_managers/listener_manager/lds_api.cc:99] lds: add/update listener 'default:0.0.0.0:24076'
[2024-06-25 19:31:44.470][1][info][config] [source/extensions/listener_managers/listener_manager/listener_manager_impl.cc:923] all dependencies initialized. starting workers

Then I ran the following job to act as a test client (I've skipped using transparent proxy here but that should work as well):

downstream jobspec ```hcl job "curl" { group "group" { network { mode = "bridge" port "www" { to = 8001 } } service { name = "count-dashboard" port = "8001" connect { sidecar_service { proxy { upstreams { destination_name = "ext-dynamodb" local_bind_port = 8080 } } } } } task "task" { driver = "docker" config { image = "curlimages/curl:latest" command = "tail" args = ["-f"] ports = ["www"] } resources { cpu = 128 memory = 256 } } } } ```

That allocation starts up just fine, and I'm able to curl DynamoDB via the upstream:

$ nomad alloc exec -task task 83fd /bin/sh
~ $ curl localhost:8080
healthy: dynamodb.us-east-1.amazonaws.com ~ $ ^C

I'd have you check the Nomad server logs to see what happened when it registered the gateway, but I can see from your consul config read that everything looks as I'd expect. Here's what mine looks like (with Consul Enterprise):

$ consul config read -kind terminating-gateway -name ext-dynamodb-tgw
{
    "Kind": "terminating-gateway",
    "Name": "ext-dynamodb-tgw",
    "Services": [
        {
            "Namespace": "default",
            "Name": "ext-dynamodb",
            "CAFile": "/etc/ssl/certs/Amazon_Root_CA_1.pem",
            "SNI": "dynamodb.us-east-1.amazonaws.com"
        }
    ],
    "CreateIndex": 1168,
    "ModifyIndex": 1168,
    "Partition": "default",
    "Namespace": "default"
}

At this point I feel pretty confident that Nomad has configured the gateway as you've requested. I'm going to transfer this issue over to the Consul repository, in hopes that folks there will have a better handle on where to look next.