mesh-gateway fails to restart with peered connections in k8s when replicas > 1

christophermichaeljohnston commented 1 year ago

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request. Searching for pre-existing feature requests helps us consolidate datapoints for identical requirements into a single place, thank you!
Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request.
If you are interested in working on this issue or have submitted a pull request, please leave a comment.

Overview of the Issue

consul: 1.15.3 consul-k8s-control-plane: 1.1.2

When running the mesh-gateway in k8s with replicas > 1, they fail to restart (ie after a pod failure) on the establishing side of an active peered connection. This looks to be caused by envoy rejecting the additional endpoints causing the mesh-gateway to eventually terminate with k8s attempting to restart it (restart loop). The only way to restore service is to reduce the number of mesh-gateway replicas to 1 on both sides of the peered connection.

Reproduction Steps

Stand up 2 clusters in k8s with mesh-gateway replicas > 1.
Peer the clusters.
Terminate a mesh-gateway on the establishing side and it should never successfully start up.

Logs

Logs from mesh-gateway:

2023-06-02T13:56:36.666Z+00:00 [info] envoy.upstream(14) cds: add 2 cluster(s), remove 0 cluster(s)
2023-06-02T13:56:36.697Z+00:00 [info] envoy.upstream(14) cds: added/updated 0 cluster(s), skipped 1 unmodified cluster(s)
2023-06-02T13:56:36.697Z+00:00 [warning] envoy.config(14) delta config for type.googleapis.com/envoy.config.cluster.v3.Cluster rejected: Error adding/updating cluster(s) server.stage-04-use.peering.d43a86f0-f14f-4297-292a-2afe84e214b7.consul: LOGICAL_DNS clusters must have a single locality_lb_endpoint and a single lb_endpoint
2023-06-02T13:56:36.697Z+00:00 [warning] envoy.config(14) gRPC config for type.googleapis.com/envoy.config.cluster.v3.Cluster rejected: Error adding/updating cluster(s) server.stage-04-use.peering.d43a86f0-f14f-4297-292a-2afe84e214b7.consul: LOGICAL_DNS clusters must have a single locality_lb_endpoint and a single lb_endpoint
2023-06-02T13:56:51.664Z+00:00 [warning] envoy.config(14) gRPC config: initial fetch timed out for type.googleapis.com/envoy.config.listener.v3.Listener
2023-06-02T13:56:51.664Z+00:00 [info] envoy.config(14) all dependencies initialized. starting workers
2023-06-02T13:57:26.031Z [INFO]  consul-dataplane.metrics: stopping the merged  server
2023-06-02T13:57:26.031Z [INFO]  consul-dataplane.server-connection-manager: stopping
2023-06-02T13:57:26.031Z [INFO]  consul-dataplane: context done stopping xds server
2023-06-02T13:57:26.031Z [INFO]  consul-dataplane.metrics: stopping consul dp promtheus server
2023-06-02T13:57:26.034Z [INFO]  consul-dataplane: envoy process exited: error="signal: killed"
2023-06-02T13:57:26.036Z [INFO]  consul-dataplane.server-connection-manager: ACL auth method logout succeeded

Logs from consul server:

2023-06-02T13:57:16.602Z [ERROR] agent.envoy.xds.mesh_gateway: got error response from envoy proxy: service_id=consul-mesh-gateway-6bfb4b5b45-2qnl5 typeUrl=type.googleapis.com/envoy.config.cluster.v3.Cluster xdsVersion=v3 nonce=00000001 error="rpc error: code = Internal desc = Error adding/updating cluster(s) server.stage-04-use.peering.d43a86f0-f14f-4297-292a-2afe84e214b7.consul: LOGICAL_DNS clusters must have a single locality_lb_endpoint and a single lb_endpoint"
2023-06-02T13:57:16.637Z [ERROR] agent.envoy.xds.mesh_gateway: got error response from envoy proxy: service_id=consul-mesh-gateway-6bfb4b5b45-2qnl5 typeUrl=type.googleapis.com/envoy.config.cluster.v3.Cluster xdsVersion=v3 nonce=00000003 error="rpc error: code = Internal desc = Error adding/updating cluster(s) server.stage-04-use.peering.d43a86f0-f14f-4297-292a-2afe84e214b7.consul: LOGICAL_DNS clusters must have a single locality_lb_endpoint and a single lb_endpoint"
2023-06-02T13:58:15.982Z [ERROR] agent.envoy: Error receiving new DeltaDiscoveryRequest; closing request channel: error="rpc error: code = Canceled desc = context canceled"
2023-06-02T13:58:15.986Z [ERROR] agent.proxycfg.server-data-sources: subscribe call failed: err="subscription closed by server, client must reset state and resubscribe" failure_count=1 key=mesh topic=MeshConfig
2023-06-02T13:58:15.986Z [ERROR] agent.proxycfg.server-data-sources: subscribe call failed: err="subscription closed by server, client must reset state and resubscribe" failure_count=1 key=consul topic=ServiceHealth
2023-06-02T13:58:15.986Z [ERROR] agent.proxycfg.server-data-sources: subscribe call failed: err="subscription closed by server, client must reset state and resubscribe" failure_count=1 key=consul topic=ServiceHealthConnect
2023-06-02T13:58:15.986Z [ERROR] agent.proxycfg.server-data-sources: subscribe call failed: err="subscription closed by server, client must reset state and resubscribe" failure_count=1 topic=ServiceResolver wildcard_subject=true
2023-06-02T13:58:15.986Z [ERROR] agent.proxycfg.server-data-sources: subscribe call failed: err="subscription closed by server, client must reset state and resubscribe" failure_count=1 topic=ServiceList wildcard_subject=true

Expected behavior

Should be able to run the mesh gateway component with more than 1 replica.

Environment details

AWS K8S 1.26

Additional Context

Created this in the consul issue tracker (https://github.com/hashicorp/consul/issues/17557) but will close that as this seems to be the better location for this issue.

Screen shot showing peering status on the establishing side (the dialer side doesn't include the server addresses in the ui)

Screen Shot 2023-07-05 at 12 19 39 PM

I wonder if this has anything to do with: https://github.com/envoyproxy/envoy/issues/14848

jm96441n commented 1 year ago

hey @christophermichaeljohnston just want to confirm this still needs investigating and that the resolution in your original issue didn't actually resolve it?

christophermichaeljohnston commented 1 year ago

Correct. This is still an issue. I have an environment up that I can gather information from to help with the investigation.

jm96441n commented 1 year ago

Okay, I'm going to start investigating on our end to replicate and track down the issue (also going to dig into what you've got here once I've got it replicated).

jm96441n commented 1 year ago

hey @christophermichaeljohnston I haven't been able to recreate the issue you're seeing, I've been using this setup which pretty much follows this doc with mesh replicas set to 2. Can you provide a minimal setup that reproduces the issue?

christophermichaeljohnston commented 1 year ago

Hi @jm96441n. Looked over your setup and mine is very similar. The main difference is that I used an aws loadbalancer instead of metallb.

  meshGateway:
    enabled: true
    replicas: 3
    service:
      type: LoadBalancer
      annotations: |
        "service.beta.kubernetes.io/aws-load-balancer-internal": "true"

I've also tried using the UI to creating the peering connection instead of the CRDs and the result was the same. So I wonder if this is caused by something with a difference between the aws loadbalancer and metalb. Did the UI in your test look similiar to the screenshot in the original post with the same server name multiple times (or was it addresses)? Does that server name return multiple address?

% dig internal-a05f2c2cc74ab48aaa8688d406114ee6-1525383021.us-east-1.elb.amazonaws.com +short
10.147.79.148
10.147.10.164
10.147.59.55

jm96441n commented 1 year ago

so I setup a new cluster without metallb on eks (I was using metallb for an attempt at a local recreation using kind) and was not able to replicate. I created a peering connection through the mesh gateway with 2 replicas on the gateway, deleted a mesh pod on the establishing side and it came right back up.

I do see the same as you do (multiple server addresses) but my understanding is that for each instance of the mesh gateway you'll see an address, but in this case all the mesh instances are behind the same ELB which is why you see the same address multiple times.

Can you provide a minimal setup that reproduces the issue that I can run?

christophermichaeljohnston commented 1 year ago

This minimal setup reproduces the mesh gateway restart failure. Note I tried consul 1.16.0 and consul-k8s 1.2.0 with the same result and its what is the minimal setup.

Screen Shot 2023-07-18 at 11 59 31 AM

christophermichaeljohnston commented 1 year ago

This continues to be a problem. Even reducing replicas to a single mesh gateway, the consul servers don't think the peered connection is heathy. In the UI the peer state is 'Active' but 'consul.peering.healthy' returns 0. Is there a way to enable debug logs in the mesh gateway to try and get more information on what is happening?

natemollica-nm commented 1 year ago

@christophermichaeljohnston

You can increase the mesh-gateway envoy log-level by either:

Setting the override meshGateway.logLevel to the desired level.

or

Port-forwarding your mesh-gateway envoy admin api (defaults to 19000) to your local machine and run:

$ curl -XPOST localhost:19000/logging\?level\=debug

active loggers:
  admin: debug
  alternate_protocols_cache: debug
  aws: debug
  assert: debug
# ---- component list cut for brevity ----

The above would change all the envoy component log-levels to debug.

christophermichaeljohnston commented 1 year ago

Thanks. Not sure how I missed logLevel in the helm chart. I've captured logs from when the meshGateway starts to when it stops itself with a /quitquitquit. It's not clear why it's doing a self destruct other than Envoy never being fully initialized.

envoy.main(14) Envoy is not fully initialized

logs.txt

natemollica-nm commented 1 year ago

@christophermichaeljohnston

Looking through the logs I do see:

2023-09-19T08:42:47-04:00   2023-09-19T12:42:39.723782332Z stderr F 2023-09-19T12:42:39.723Z+00:00 [warning] envoy.config(14) gRPC config for type.googleapis.com/envoy.config.cluster.v3.Cluster rejected: Error adding/updating cluster(s) server.stage-04-use.peering.88780324-49a4-249e-19e0-82c359abf2f5.consul: LOGICAL_DNS clusters must have a single locality_lb_endpoint and a single lb_endpoint
2023-09-19T08:42:47-04:00   2023-09-19T12:42:39.723777771Z stderr F 2023-09-19T12:42:39.723Z+00:00 [warning] envoy.config(14) delta config for type.googleapis.com/envoy.config.cluster.v3.Cluster rejected: Error adding/updating cluster(s) server.stage-04-use.peering.88780324-49a4-249e-19e0-82c359abf2f5.consul: LOGICAL_DNS clusters must have a single locality_lb_endpoint and a single lb_endpoint

I see you did attempt to set strict_dns via your proxy-defaults config entry with no changes.

I'm wondering if the aws loadbalancer is interfering at all. I have seen issues with cross-zone loadbalancing in aws. Have you tried enabling cross-zone load balancing to see if this helps?

christophermichaeljohnston commented 1 year ago

cross-zone load balancing has no impact. Results in the same crash loop with the same suspicious 'LOGICAL_DNS' log message. And consul_peering_healthy is still 0 (unhealthy) even though the peered connection is active.

natemollica-nm commented 1 year ago

@christophermichaeljohnston

I managed to get a working eks reproduction up with no issues when deleting a mesh-gateway. It came right back up no issue.

This was tested using envoy_dns_discovery_type with LOGICAL_DNS and STICT_DNS set using the proxy-defaults CRDs, so I don't suspect that being an issue here.

I'd recommend looking into the something with regard to your AWS networking or security group permissions.

Reproduction Info

Versions:

k8s: 1.26
ebs-csi: 1.20.0
consul: 1.16.2+ent
consul-k8s: 1.2.2
consul-dataplane: 1.2.2

consul-k8s overrides:

global:
  name: consul
  peering:
    enabled: true
  tls:
    enabled: true
    httpsOnly: false
  enterpriseLicense:
    secretName: license
    secretKey: key
    enableLicenseAutoload: true
  enableConsulNamespaces: true
  adminPartitions:
    enabled: true
    name: "default"
  acls:
    manageSystemACLs: true

connectInject:
  enabled: true
  default: true
  replicas: 2
  consulNamespaces:
    mirroringK8S: true
  k8sAllowNamespaces: ['*']
  k8sDenyNamespaces: []

syncCatalog:
  enabled: true
  k8sAllowNamespaces: ["*"]
  consulNamespaces:
    mirroringK8S: true

meshGateway:
  enabled: true
  replicas: 3
  service:
    type: LoadBalancer
    annotations: |
      "service.beta.kubernetes.io/aws-load-balancer-internal": "true"

server:
  enabled: true
  replicas: 3
  extraConfig: |
    {
      "performance": {
        "raft_multiplier": 3
      },
      "telemetry": {
        "disable_hostname": true
      }
    }

ui:
  enabled: true
  service:
    type: LoadBalancer

vpc module terraform code

note: I'm also leveraging aws vpc peering between my eks clusters

module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "5.1.2"

  name = local.name

  cidr = var.eks-vpc.vpc
  azs  = local.azs

  private_subnets = [for k, v in local.azs : cidrsubnet(var.eks-vpc.vpc, 4, k)]
  public_subnets  = [for k, v in local.azs : cidrsubnet(var.eks-vpc.vpc, 8, k + 48)]

  enable_nat_gateway   = true
  single_nat_gateway   = true
  enable_dns_hostnames = true # required for eks, default is false

  reuse_nat_ips           = true
  external_nat_ip_ids     = aws_eip.nat.*.id
  map_public_ip_on_launch = true # now required as of 04-2020 for EKS Nodes

  public_subnet_tags = {
    "kubernetes.io/cluster/${local.name}" = "shared"
    "kubernetes.io/role/elb"              = 1
  }

  private_subnet_tags = {
    "kubernetes.io/cluster/${local.name}" = "shared"
    "kubernetes.io/role/internal-elb"     = 1
  }
}

eks module terraform code

module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "19.5.1"

  cluster_name    = local.name
  cluster_version = var.k8s-version

  vpc_id                          = module.vpc.vpc_id
  subnet_ids                      = module.vpc.public_subnets
  cluster_endpoint_public_access  = true

  eks_managed_node_group_defaults = {
    ami_type = "AL2_x86_64"
  }

  eks_managed_node_groups = {
    consul = {
      name = "consul"

      instance_types = [var.eks-node-instance-type]

      min_size       = 1
      max_size       = 5
      desired_size   = 3
    }
  }

  node_security_group_additional_rules = {
    ingress_self_all = {
      description = "Node to node all ports/protocols"
      protocol    = "-1"
      from_port   = 0
      to_port     = 0
      type        = "ingress"
      self        = true
    }
    ingress_cluster_all = {
      description                   = "Cluster to node all ports/protocols"
      protocol                      = "-1"
      from_port                     = 0
      to_port                       = 0
      type                          = "ingress"
      source_cluster_security_group = true
    }
    egress_all = {
      description      = "Node all egress"
      protocol         = "-1"
      from_port        = 0
      to_port          = 0
      type             = "egress"
      cidr_blocks      = ["0.0.0.0/0"]
      ipv6_cidr_blocks = ["::/0"]
    }
  }
}

christophermichaeljohnston commented 1 year ago

I've tried peering clusters in different regions and even within the same region in the same vpc with the same subnets with wide open SGs. The issue remains. Once clusters are peered, any mesh-gateway restart on the dialing side results in a crash loop. All I can determine from the logs is that the dataplane never fully initializes so it terminates itself. And even with only a single mesh-gateway, the prometheus metrics never indicate that peering is healthy, which is perhaps the cause of the crash loop with multiple mesh-gateway. I can't determine from the logs what part of the dataplane is having issues.

christophermichaeljohnston commented 1 year ago

I did manually create the clusters so perhaps there is something there... I'll try using the provided terraform instead.

christophermichaeljohnston commented 1 year ago

@natemollica-nm

No luck with those additional SG rules. :(

I've updated my consul_mesh_test repo with everything I've used. This test was 2 eks clusters in the same vpc, using the same subnets.

I've also captured mesh gateway logs in 'trace'. Logs are similar in that 'Envoy is not fully initialized' which seems to trigger 'consul-dataplane.lifecycle: initiating shutdown'. But the trace level does have 'envoy.connection(13) [C0] read error: Resource temporarily unavailable'. Could this be the cause? What resource is unavailable?

consul-mesh-gateway-6ff977f9b5-cgj2w.log

christophermichaeljohnston commented 1 year ago

#19268 looks to be a possible solution to this problem

david-yu commented 1 year ago

Closing as https://github.com/hashicorp/consul/pull/19268 does fix the issue. This has been isolated to AWS EKS environments and should go out with the next set of Consul patch releases.

PavelPikat commented 8 months ago

I am facing this issue in AKS as well, where LoadBalancer doesn't have hostnames. I tried using public and private LBs. The mesh gateway in the second cluster is in a restart loop. These are the logs:

2024-03-10T07:04:30.474Z+00:00 [debug] envoy.main(14) Envoy is not fully initialized, skipping histogram merge and flushing stats
2024-03-10T07:04:35.474Z+00:00 [debug] envoy.main(14) flushing stats
2024-03-10T07:04:35.474Z+00:00 [debug] envoy.main(14) Envoy is not fully initialized, skipping histogram merge and flushing stats
2024-03-10T07:04:40.475Z+00:00 [debug] envoy.main(14) flushing stats
2024-03-10T07:04:40.475Z+00:00 [debug] envoy.main(14) Envoy is not fully initialized, skipping histogram merge and flushing stats
2024-03-10T07:04:42.866Z [INFO]  consul-dataplane.lifecycle: initiating shutdown

P.S. I am evaluating Consul multi-cluster federation using this tutorial https://developer.hashicorp.com/consul/tutorials/kubernetes/kubernetes-mesh-gateways

Here's my Helm values for the second cluster:

      global:
        # The main enabled/disabled setting.
        # If true, servers, clients, Consul DNS and the Consul UI will be enabled.
        enabled: true
        # The name of the datacenter that the agents should register as.
        datacenter: eu1-aks-stg-1
        # Enables TLS across the cluster to verify authenticity of the Consul servers and clients.
        tls:
          enabled: true
          caCert:
            secretName: consul-federation
            secretKey: caCert
          caKey:
            secretName: consul-federation
            secretKey: caKey
        federations:
          enabled: true
          primaryDatacenter: eu1-aks-cp-stg-1
        argocd:
          enabled: true
        gossipEncryption:
          autoGenerate: false
          secretName: consul-federation
          secretKey: gossipEncryptionKey
      transparentProxy:
        defaultEnabled: false
      client:
        nodeSelector: |
          purpose: generic-workload
          kubernetes.io/os: linux
      # Configures values that configure the Consul server cluster.
      server:
        enabled: true
        # The number of server agents to run. This determines the fault tolerance of the cluster.
        replicas: 3
        extraVolumes:
          - type: secret
            name: consul-federation
            items:
              - key: serverConfigJSON
                path: config.json
            load: true
        disruptionBudget:
          enabled: false
        exposeService:
          enabled: true
          type: ClusterIP
        storageClass: managed-csi-premium-zrs
        nodeSelector: |
          purpose: generic-workload
          kubernetes.io/os: linux
        topologySpreadConstraints: |
          - maxSkew: 1
            topologyKey: topology.kubernetes.io/zone
            whenUnsatisfiable: DoNotSchedule
            labelSelector:
              matchLabels:
                app: consul
                component: server
      # Contains values that configure the Consul UI.
      ui:
        enabled: false
        # Registers a Kubernetes Service for the Consul UI as a LoadBalancer.
        service:
          enabled: false
          type: ClusterIP
      # Configures and installs the automatic Consul Connect sidecar injector.
      connectInject:
        enabled: true
        nodeSelector: |
          purpose: generic-workload
          kubernetes.io/os: linux
        disruptionBudget:
          enabled: false
        apiGateway:
          managedGatewayClass:
            nodeSelector: |
              purpose: generic-workload
              kubernetes.io/os: linux
            serviceType: LoadBalancer
      meshGateway:
        enabled: true
        logLevel: debug
        nodeSelector: |
          purpose: generic-workload
          kubernetes.io/os: linux
        service:
          type: LoadBalancer
          annotations: |
            "service.beta.kubernetes.io/azure-load-balancer-internal": "false"
      webhookCertManager:
        nodeSelector: |
          purpose: generic-workload
          kubernetes.io/os: linux

hashicorp / consul-k8s