Closed christophermichaeljohnston closed 1 year ago
hey @christophermichaeljohnston just want to confirm this still needs investigating and that the resolution in your original issue didn't actually resolve it?
Correct. This is still an issue. I have an environment up that I can gather information from to help with the investigation.
Okay, I'm going to start investigating on our end to replicate and track down the issue (also going to dig into what you've got here once I've got it replicated).
Hi @jm96441n. Looked over your setup and mine is very similar. The main difference is that I used an aws loadbalancer instead of metallb.
meshGateway:
enabled: true
replicas: 3
service:
type: LoadBalancer
annotations: |
"service.beta.kubernetes.io/aws-load-balancer-internal": "true"
I've also tried using the UI to creating the peering connection instead of the CRDs and the result was the same. So I wonder if this is caused by something with a difference between the aws loadbalancer and metalb. Did the UI in your test look similiar to the screenshot in the original post with the same server name multiple times (or was it addresses)? Does that server name return multiple address?
% dig internal-a05f2c2cc74ab48aaa8688d406114ee6-1525383021.us-east-1.elb.amazonaws.com +short
10.147.79.148
10.147.10.164
10.147.59.55
so I setup a new cluster without metallb on eks (I was using metallb for an attempt at a local recreation using kind) and was not able to replicate. I created a peering connection through the mesh gateway with 2 replicas on the gateway, deleted a mesh pod on the establishing side and it came right back up.
I do see the same as you do (multiple server addresses) but my understanding is that for each instance of the mesh gateway you'll see an address, but in this case all the mesh instances are behind the same ELB which is why you see the same address multiple times.
Can you provide a minimal setup that reproduces the issue that I can run?
This minimal setup reproduces the mesh gateway restart failure. Note I tried consul 1.16.0 and consul-k8s 1.2.0 with the same result and its what is the minimal setup.
This continues to be a problem. Even reducing replicas to a single mesh gateway, the consul servers don't think the peered connection is heathy. In the UI the peer state is 'Active' but 'consul.peering.healthy' returns 0. Is there a way to enable debug logs in the mesh gateway to try and get more information on what is happening?
@christophermichaeljohnston
You can increase the mesh-gateway envoy log-level by either:
or
19000
) to your local machine and run:$ curl -XPOST localhost:19000/logging\?level\=debug
active loggers:
admin: debug
alternate_protocols_cache: debug
aws: debug
assert: debug
# ---- component list cut for brevity ----
The above would change all the envoy component log-levels to debug
.
Thanks. Not sure how I missed logLevel in the helm chart. I've captured logs from when the meshGateway starts to when it stops itself with a /quitquitquit
. It's not clear why it's doing a self destruct other than Envoy never being fully initialized.
envoy.main(14) Envoy is not fully initialized
@christophermichaeljohnston
Looking through the logs I do see:
2023-09-19T08:42:47-04:00 2023-09-19T12:42:39.723782332Z stderr F 2023-09-19T12:42:39.723Z+00:00 [warning] envoy.config(14) gRPC config for type.googleapis.com/envoy.config.cluster.v3.Cluster rejected: Error adding/updating cluster(s) server.stage-04-use.peering.88780324-49a4-249e-19e0-82c359abf2f5.consul: LOGICAL_DNS clusters must have a single locality_lb_endpoint and a single lb_endpoint
2023-09-19T08:42:47-04:00 2023-09-19T12:42:39.723777771Z stderr F 2023-09-19T12:42:39.723Z+00:00 [warning] envoy.config(14) delta config for type.googleapis.com/envoy.config.cluster.v3.Cluster rejected: Error adding/updating cluster(s) server.stage-04-use.peering.88780324-49a4-249e-19e0-82c359abf2f5.consul: LOGICAL_DNS clusters must have a single locality_lb_endpoint and a single lb_endpoint
I see you did attempt to set strict_dns
via your proxy-defaults config entry with no changes.
I'm wondering if the aws loadbalancer is interfering at all. I have seen issues with cross-zone loadbalancing in aws. Have you tried enabling cross-zone load balancing to see if this helps?
cross-zone load balancing has no impact. Results in the same crash loop with the same suspicious 'LOGICAL_DNS' log message. And consul_peering_healthy is still 0 (unhealthy) even though the peered connection is active.
@christophermichaeljohnston
I managed to get a working eks reproduction up with no issues when deleting a mesh-gateway. It came right back up no issue.
This was tested using envoy_dns_discovery_type
with LOGICAL_DNS
and STICT_DNS
set using the proxy-defaults CRDs, so I don't suspect that being an issue here.
I'd recommend looking into the something with regard to your AWS networking or security group permissions.
Versions:
consul-k8s overrides:
global:
name: consul
peering:
enabled: true
tls:
enabled: true
httpsOnly: false
enterpriseLicense:
secretName: license
secretKey: key
enableLicenseAutoload: true
enableConsulNamespaces: true
adminPartitions:
enabled: true
name: "default"
acls:
manageSystemACLs: true
connectInject:
enabled: true
default: true
replicas: 2
consulNamespaces:
mirroringK8S: true
k8sAllowNamespaces: ['*']
k8sDenyNamespaces: []
syncCatalog:
enabled: true
k8sAllowNamespaces: ["*"]
consulNamespaces:
mirroringK8S: true
meshGateway:
enabled: true
replicas: 3
service:
type: LoadBalancer
annotations: |
"service.beta.kubernetes.io/aws-load-balancer-internal": "true"
server:
enabled: true
replicas: 3
extraConfig: |
{
"performance": {
"raft_multiplier": 3
},
"telemetry": {
"disable_hostname": true
}
}
ui:
enabled: true
service:
type: LoadBalancer
vpc module terraform code
note: I'm also leveraging aws vpc peering between my eks clusters
module "vpc" {
source = "terraform-aws-modules/vpc/aws"
version = "5.1.2"
name = local.name
cidr = var.eks-vpc.vpc
azs = local.azs
private_subnets = [for k, v in local.azs : cidrsubnet(var.eks-vpc.vpc, 4, k)]
public_subnets = [for k, v in local.azs : cidrsubnet(var.eks-vpc.vpc, 8, k + 48)]
enable_nat_gateway = true
single_nat_gateway = true
enable_dns_hostnames = true # required for eks, default is false
reuse_nat_ips = true
external_nat_ip_ids = aws_eip.nat.*.id
map_public_ip_on_launch = true # now required as of 04-2020 for EKS Nodes
public_subnet_tags = {
"kubernetes.io/cluster/${local.name}" = "shared"
"kubernetes.io/role/elb" = 1
}
private_subnet_tags = {
"kubernetes.io/cluster/${local.name}" = "shared"
"kubernetes.io/role/internal-elb" = 1
}
}
eks module terraform code
module "eks" {
source = "terraform-aws-modules/eks/aws"
version = "19.5.1"
cluster_name = local.name
cluster_version = var.k8s-version
vpc_id = module.vpc.vpc_id
subnet_ids = module.vpc.public_subnets
cluster_endpoint_public_access = true
eks_managed_node_group_defaults = {
ami_type = "AL2_x86_64"
}
eks_managed_node_groups = {
consul = {
name = "consul"
instance_types = [var.eks-node-instance-type]
min_size = 1
max_size = 5
desired_size = 3
}
}
node_security_group_additional_rules = {
ingress_self_all = {
description = "Node to node all ports/protocols"
protocol = "-1"
from_port = 0
to_port = 0
type = "ingress"
self = true
}
ingress_cluster_all = {
description = "Cluster to node all ports/protocols"
protocol = "-1"
from_port = 0
to_port = 0
type = "ingress"
source_cluster_security_group = true
}
egress_all = {
description = "Node all egress"
protocol = "-1"
from_port = 0
to_port = 0
type = "egress"
cidr_blocks = ["0.0.0.0/0"]
ipv6_cidr_blocks = ["::/0"]
}
}
}
I've tried peering clusters in different regions and even within the same region in the same vpc with the same subnets with wide open SGs. The issue remains. Once clusters are peered, any mesh-gateway restart on the dialing side results in a crash loop. All I can determine from the logs is that the dataplane never fully initializes so it terminates itself. And even with only a single mesh-gateway, the prometheus metrics never indicate that peering is healthy, which is perhaps the cause of the crash loop with multiple mesh-gateway. I can't determine from the logs what part of the dataplane is having issues.
I did manually create the clusters so perhaps there is something there... I'll try using the provided terraform instead.
@natemollica-nm
No luck with those additional SG rules. :(
I've updated my consul_mesh_test repo with everything I've used. This test was 2 eks clusters in the same vpc, using the same subnets.
I've also captured mesh gateway logs in 'trace'. Logs are similar in that 'Envoy is not fully initialized' which seems to trigger 'consul-dataplane.lifecycle: initiating shutdown'. But the trace level does have 'envoy.connection(13) [C0] read error: Resource temporarily unavailable'. Could this be the cause? What resource is unavailable?
#19268 looks to be a possible solution to this problem
Closing as https://github.com/hashicorp/consul/pull/19268 does fix the issue. This has been isolated to AWS EKS environments and should go out with the next set of Consul patch releases.
I am facing this issue in AKS as well, where LoadBalancer doesn't have hostnames. I tried using public and private LBs. The mesh gateway in the second cluster is in a restart loop. These are the logs:
2024-03-10T07:04:30.474Z+00:00 [debug] envoy.main(14) Envoy is not fully initialized, skipping histogram merge and flushing stats
2024-03-10T07:04:35.474Z+00:00 [debug] envoy.main(14) flushing stats
2024-03-10T07:04:35.474Z+00:00 [debug] envoy.main(14) Envoy is not fully initialized, skipping histogram merge and flushing stats
2024-03-10T07:04:40.475Z+00:00 [debug] envoy.main(14) flushing stats
2024-03-10T07:04:40.475Z+00:00 [debug] envoy.main(14) Envoy is not fully initialized, skipping histogram merge and flushing stats
2024-03-10T07:04:42.866Z [INFO] consul-dataplane.lifecycle: initiating shutdown
P.S. I am evaluating Consul multi-cluster federation using this tutorial https://developer.hashicorp.com/consul/tutorials/kubernetes/kubernetes-mesh-gateways
Here's my Helm values for the second cluster:
global:
# The main enabled/disabled setting.
# If true, servers, clients, Consul DNS and the Consul UI will be enabled.
enabled: true
# The name of the datacenter that the agents should register as.
datacenter: eu1-aks-stg-1
# Enables TLS across the cluster to verify authenticity of the Consul servers and clients.
tls:
enabled: true
caCert:
secretName: consul-federation
secretKey: caCert
caKey:
secretName: consul-federation
secretKey: caKey
federations:
enabled: true
primaryDatacenter: eu1-aks-cp-stg-1
argocd:
enabled: true
gossipEncryption:
autoGenerate: false
secretName: consul-federation
secretKey: gossipEncryptionKey
transparentProxy:
defaultEnabled: false
client:
nodeSelector: |
purpose: generic-workload
kubernetes.io/os: linux
# Configures values that configure the Consul server cluster.
server:
enabled: true
# The number of server agents to run. This determines the fault tolerance of the cluster.
replicas: 3
extraVolumes:
- type: secret
name: consul-federation
items:
- key: serverConfigJSON
path: config.json
load: true
disruptionBudget:
enabled: false
exposeService:
enabled: true
type: ClusterIP
storageClass: managed-csi-premium-zrs
nodeSelector: |
purpose: generic-workload
kubernetes.io/os: linux
topologySpreadConstraints: |
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: consul
component: server
# Contains values that configure the Consul UI.
ui:
enabled: false
# Registers a Kubernetes Service for the Consul UI as a LoadBalancer.
service:
enabled: false
type: ClusterIP
# Configures and installs the automatic Consul Connect sidecar injector.
connectInject:
enabled: true
nodeSelector: |
purpose: generic-workload
kubernetes.io/os: linux
disruptionBudget:
enabled: false
apiGateway:
managedGatewayClass:
nodeSelector: |
purpose: generic-workload
kubernetes.io/os: linux
serviceType: LoadBalancer
meshGateway:
enabled: true
logLevel: debug
nodeSelector: |
purpose: generic-workload
kubernetes.io/os: linux
service:
type: LoadBalancer
annotations: |
"service.beta.kubernetes.io/azure-load-balancer-internal": "false"
webhookCertManager:
nodeSelector: |
purpose: generic-workload
kubernetes.io/os: linux
Community Note
Overview of the Issue
consul: 1.15.3 consul-k8s-control-plane: 1.1.2
When running the mesh-gateway in k8s with replicas > 1, they fail to restart (ie after a pod failure) on the establishing side of an active peered connection. This looks to be caused by envoy rejecting the additional endpoints causing the mesh-gateway to eventually terminate with k8s attempting to restart it (restart loop). The only way to restore service is to reduce the number of mesh-gateway replicas to 1 on both sides of the peered connection.
Reproduction Steps
Logs
Logs from mesh-gateway:
Logs from consul server:
Expected behavior
Should be able to run the mesh gateway component with more than 1 replica.
Environment details
AWS K8S 1.26
Additional Context
Created this in the consul issue tracker (https://github.com/hashicorp/consul/issues/17557) but will close that as this seems to be the better location for this issue.
Screen shot showing peering status on the establishing side (the dialer side doesn't include the server addresses in the ui)
I wonder if this has anything to do with: https://github.com/envoyproxy/envoy/issues/14848