Closed mr-miles closed 1 year ago
@mr-miles How best can we contact you? This issue is very interesting and we would love to be able to touch base with you and get a little bit more detail from you! Thanks.
@nrichu-hcp i'll mail you direct now
Hi everyone, seems met with the same issue May I know if you have some clues on this? Thanks :) Consul 1.14.3 servers with 14 nodes in aws EKS 1.24 cluster in a single region
I chatted to @nrichu-hcp last week about it, hoping that hashicorp can give me some more pointers on where to look for the smoking gun. When it happens to a terminating gateway it is a killer for us.
We had an instance over the weekend. Looking at the logs and comparing with a cycle due to a deployment, I noted:
Currently I'm suspecting there's some event-order dependence in the controller but I couldn't work out what it was subscribed to exactly to investigate further.
Also we use karpenter and that seems to be implicated somehow (certainly it tries to consolidate instances in the middle of the night and that also seems to be when it occurs quite often). Maybe that is shutting the node down rather than the pods and all the events aren't making it out ... or something
@vorobiovv can you give us more context on how this issue starts?
@mr-miles unfortunately for us to be able to help you further we do need you to be able to reproduce the issue and that way we can take a crack at it ourselves and see where that leads us
I've had some success reproducing this at last. The steps with greatest chance of success seems to be:
I'm tailing events from the cluster and see things like this:
{"level":"info","ts":1674767502.7706947,"logger":"event","caller":"kube-event-tail/main.go:98","msg":"Error updating Endpoint Slices for Service consul/consul-ingress-gateway: skipping Pod consul-ingress-gateway-787c754bf9-xbrbz for Service consul/consul-ingress-gateway: Node ip-xxx Not Found","event":{"namespace":"consul","name":"consul-ingress-gateway.173df76d57f3d267","involvedObject":{"name":"Service/consul-ingress-gateway"},"reason":"FailedToUpdateEndpointSlices","source.component":"endpoint-slice-controller","firstTimestamp":1674766685,"lastTimestamp":1674766685,"count":1,"type":"Warning"}} {"level":"info","ts":1674767502.770735,"logger":"event","caller":"kube-event-tail/main.go:98","msg":"Failed to update endpoint consul/consul-ingress-gateway: Operation cannot be fulfilled on endpoints \"consul-ingress-gateway\": the object has been modified; please apply your changes to the latest version and try again","event":{"namespace":"consul","name":"consul-ingress-gateway.173df76d58f02a8d","involvedObject":{"name":"Endpoints/consul-ingress-gateway"},"reason":"FailedToUpdateEndpoint","source.component":"endpoint-controller","firstTimestamp":1674766685,"lastTimestamp":1674766685,"count":1,"type":"Warning"}}
In one occurrence, I saw only the EndpointSlice error. In a subsequent attempt I got these two errors.
I also observe that the endpoint-controller doesn't receive the event about the service to deregister - so it doesn't seem like the problem is in the service catalog.
Last of all I picked through the helm chart and I noticed the connect-injector pod doesn't have any anti-affinity or spreadConstraints by default. Does that make it possible for there to be nothing running to pick the events up in some circumstances?
How long a backlog of events will the controller receive when it becomes leader, if the original leader is brutally killed?
@nrichu-hcp - have you tried to reproduce the issue? Do you need any more information? There are a few tickets across the various repositories relating to orphans in the service catalog that all implicate node removal.
The following PR is now merged which may address your issues: https://github.com/hashicorp/consul-k8s/pull/2571. This should be released in consul-k8s 1.2.x, 1.1.x, and 1.0.x by mid August timeframe for our next set of patch releases. Will go ahead and close as this also a duplicate of https://github.com/hashicorp/consul-k8s/issues/2491 and https://github.com/hashicorp/consul-k8s/issues/1817
Overview of the Issue
We are running a consul service mesh on k8s (eks) with external servers and ingress/terminating gateways. Most of the time this is working great!
I've noticed over time that the pods in the mesh are not always unregistering themselves from consul cleanly when they are shut down. I am seeing:
The cluster is in our dev environment and I thought it was down to the level of churn as we were iterating a few things. However I am still seeing it every few days even with limited deployments. I am suspecting a race condition maybe due to a leadership change or some timeout. I also have a suspicion that it may be related to karpenter rearranging the nodes periodically but I don't have any evidence for that specifically.
It is quite impacting since envoy attempts to route to the now-non-existent pods in a round robin way and so every other request gets a 503 response which makes things very broken.
Questions
I have looked through the server and pod logs but haven't found anything useful. Is there a namespace or phrase particularly useful to search for in the logs?
What are the mechanics of deregistration when using the consul dataplane in a pod? How does it guard against stragglers when there's an unclean shutdown of the pod/node?
Are there any other areas where you think deregistration might not complete? This would help guide some more specific testing.
Consul info for both Client and Server
Consul 1.14.3 servers, 5 nodes in aws EKS 1.21 cluster, same region Consul clients installed via helm chart 1.0.2