hashicorp / consul-k8s

First-class support for Consul Service Mesh on Kubernetes
https://www.consul.io/docs/k8s
Mozilla Public License 2.0
667 stars 316 forks source link

Consul-Sync pointing to same Consul deregisters all services on K8S-Sync Node #860

Closed webmutation closed 2 years ago

webmutation commented 2 years ago

Community Note


Overview of the Issue

If two or more instances of consul-sync are running and pointing to the same Consul external service, all the services get unregistered and it goes into a loop of unregistering, registering services.

Reproduction Steps

  1. Create two EKS cluster
  2. Deploy the Consul Helm chart with the k8s-sync service
    
    # Requires to set the external Consul catalog URL parameter manually in the deloyment...

global: enabled: false

client: enabled: false

externalServers: enabled: true hosts:

syncCatalog: enabled: true k8sDenyNamespaces: ["kube-system", "kube-public"]

3. kubectl edit deployment consul-consul-sync-catalog
4. change value to point to the external Consul cluster

And the unregistering of the services starts to occur... some services show up, then all services show up then all services disappear. And this loop goes on forever

Expected behavior

Services should not disappear, additional cluster connecting to Consul, should simply have their services registered. Services should not be unregistered. This is probably because the special k8s-sync node is being deleted or recreated...

webmutation commented 2 years ago

Tried to workaround... change the nodeName on the second cluster the behaviour is still the same... also since there is no healthcheck on the node. the nodes become orphaned (k8s-sync-A).

image

image

image

bondido commented 2 years ago

@webmutation - the working way of handling your scenario is to differentiate services from every kubernetes cluster by tag in consul catalog. It's described here - https://github.com/hashicorp/consul-k8s/issues/579 - and confirmed that it's the expected method.

webmutation commented 2 years ago

Thank you @bondido tested and indeed it is working! services are now staying registered.

However I wonder if there is a setting to remove orphan nodes after a timeout period... in other words how can we avoid having to manually remove the nodes? Is this possible? I was not able to find anything in the charts.

thisisnotashwin commented 2 years ago

Hey @webmutation !! Consul does have a default setting for removing orphan node which is currently in the range of days. We do not expose this via the helm chart and I don't think we intend on doing so at the moment, unfortunately. We don't see this as a scenario users are expected to run into in a stable deployment.

webmutation commented 2 years ago

Thanks for the message @thisisnotashwin it is clear now.

In our case, we have on-demand clusters that live only for a few hours or days, for UAT, Training events or Integration testing (specific versions of components being deployed)... I think we will have to write a script to manually remove it once the cluster is destroyed. It should not be a huge issue to handle. Thanks.