hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.4k stars 4.43k forks source link

Orphan sidecar proxies affect service mesh #15908

Closed mr-miles closed 1 year ago

mr-miles commented 1 year ago

Overview of the Issue

We are running a consul service mesh on k8s (eks) with external servers and ingress/terminating gateways. Most of the time this is working great!

I've noticed over time that the pods in the mesh are not always unregistering themselves from consul cleanly when they are shut down. I am seeing:

The cluster is in our dev environment and I thought it was down to the level of churn as we were iterating a few things. However I am still seeing it every few days even with limited deployments. I am suspecting a race condition maybe due to a leadership change or some timeout. I also have a suspicion that it may be related to karpenter rearranging the nodes periodically but I don't have any evidence for that specifically.

It is quite impacting since envoy attempts to route to the now-non-existent pods in a round robin way and so every other request gets a 503 response which makes things very broken.

Questions

Consul info for both Client and Server

Consul 1.14.3 servers, 5 nodes in aws EKS 1.21 cluster, same region Consul clients installed via helm chart 1.0.2

nrichu-hcp commented 1 year ago

@mr-miles How best can we contact you? This issue is very interesting and we would love to be able to touch base with you and get a little bit more detail from you! Thanks.

mr-miles commented 1 year ago

@nrichu-hcp i'll mail you direct now

vorobiovv commented 1 year ago

Hi everyone, seems met with the same issue May I know if you have some clues on this? Thanks :) Consul 1.14.3 servers with 14 nodes in aws EKS 1.24 cluster in a single region

mr-miles commented 1 year ago

I chatted to @nrichu-hcp last week about it, hoping that hashicorp can give me some more pointers on where to look for the smoking gun. When it happens to a terminating gateway it is a killer for us.

We had an instance over the weekend. Looking at the logs and comparing with a cycle due to a deployment, I noted:

Currently I'm suspecting there's some event-order dependence in the controller but I couldn't work out what it was subscribed to exactly to investigate further.

Also we use karpenter and that seems to be implicated somehow (certainly it tries to consolidate instances in the middle of the night and that also seems to be when it occurs quite often). Maybe that is shutting the node down rather than the pods and all the events aren't making it out ... or something

nrichu-hcp commented 1 year ago

@vorobiovv can you give us more context on how this issue starts?

@mr-miles unfortunately for us to be able to help you further we do need you to be able to reproduce the issue and that way we can take a crack at it ourselves and see where that leads us

mr-miles commented 1 year ago

I've had some success reproducing this at last. The steps with greatest chance of success seems to be:

I'm tailing events from the cluster and see things like this:

{"level":"info","ts":1674767502.7706947,"logger":"event","caller":"kube-event-tail/main.go:98","msg":"Error updating Endpoint Slices for Service consul/consul-ingress-gateway: skipping Pod consul-ingress-gateway-787c754bf9-xbrbz for Service consul/consul-ingress-gateway: Node ip-xxx Not Found","event":{"namespace":"consul","name":"consul-ingress-gateway.173df76d57f3d267","involvedObject":{"name":"Service/consul-ingress-gateway"},"reason":"FailedToUpdateEndpointSlices","source.component":"endpoint-slice-controller","firstTimestamp":1674766685,"lastTimestamp":1674766685,"count":1,"type":"Warning"}} {"level":"info","ts":1674767502.770735,"logger":"event","caller":"kube-event-tail/main.go:98","msg":"Failed to update endpoint consul/consul-ingress-gateway: Operation cannot be fulfilled on endpoints \"consul-ingress-gateway\": the object has been modified; please apply your changes to the latest version and try again","event":{"namespace":"consul","name":"consul-ingress-gateway.173df76d58f02a8d","involvedObject":{"name":"Endpoints/consul-ingress-gateway"},"reason":"FailedToUpdateEndpoint","source.component":"endpoint-controller","firstTimestamp":1674766685,"lastTimestamp":1674766685,"count":1,"type":"Warning"}}

In one occurrence, I saw only the EndpointSlice error. In a subsequent attempt I got these two errors.

I also observe that the endpoint-controller doesn't receive the event about the service to deregister - so it doesn't seem like the problem is in the service catalog.

Last of all I picked through the helm chart and I noticed the connect-injector pod doesn't have any anti-affinity or spreadConstraints by default. Does that make it possible for there to be nothing running to pick the events up in some circumstances?

How long a backlog of events will the controller receive when it becomes leader, if the original leader is brutally killed?

mr-miles commented 1 year ago

@nrichu-hcp - have you tried to reproduce the issue? Do you need any more information? There are a few tickets across the various repositories relating to orphans in the service catalog that all implicate node removal.

david-yu commented 1 year ago

The following PR is now merged which may address your issues: https://github.com/hashicorp/consul-k8s/pull/2571. This should be released in consul-k8s 1.2.x, 1.1.x, and 1.0.x by mid August timeframe for our next set of patch releases. Will go ahead and close as this also a duplicate of https://github.com/hashicorp/consul-k8s/issues/2491 and https://github.com/hashicorp/consul-k8s/issues/1817