Orphan sidecar proxies affect service mesh

mr-miles commented 1 year ago

Overview of the Issue

We are running a consul service mesh on k8s (eks) with external servers and ingress/terminating gateways. Most of the time this is working great!

I've noticed over time that the pods in the mesh are not always unregistering themselves from consul cleanly when they are shut down. I am seeing:

a pod and its current status remain in the catalog "for ever" after the pod has been shutdown, and the status is never updated. I need to deregister the pod and sidecar manually via the api to complete the clean up.
a pod's sidecar proxy remains registered when the main service has been unregistered. this leaves ips in the service mesh but since the sidecars are not shown in the UI its hard to track down other than by using the /v1/catalog/connect endpoint. sidecars still seem to count as healthy even if the service they reference has disappeared.

The cluster is in our dev environment and I thought it was down to the level of churn as we were iterating a few things. However I am still seeing it every few days even with limited deployments. I am suspecting a race condition maybe due to a leadership change or some timeout. I also have a suspicion that it may be related to karpenter rearranging the nodes periodically but I don't have any evidence for that specifically.

It is quite impacting since envoy attempts to route to the now-non-existent pods in a round robin way and so every other request gets a 503 response which makes things very broken.

Questions

I have looked through the server and pod logs but haven't found anything useful. Is there a namespace or phrase particularly useful to search for in the logs?
What are the mechanics of deregistration when using the consul dataplane in a pod? How does it guard against stragglers when there's an unclean shutdown of the pod/node?
Are there any other areas where you think deregistration might not complete? This would help guide some more specific testing.

Consul info for both Client and Server

Consul 1.14.3 servers, 5 nodes in aws EKS 1.21 cluster, same region Consul clients installed via helm chart 1.0.2

nrichu-hcp commented 1 year ago

@mr-miles How best can we contact you? This issue is very interesting and we would love to be able to touch base with you and get a little bit more detail from you! Thanks.

mr-miles commented 1 year ago

@nrichu-hcp i'll mail you direct now

vorobiovv commented 1 year ago

Hi everyone, seems met with the same issue May I know if you have some clues on this? Thanks :) Consul 1.14.3 servers with 14 nodes in aws EKS 1.24 cluster in a single region

mr-miles commented 1 year ago

I chatted to @nrichu-hcp last week about it, hoping that hashicorp can give me some more pointers on where to look for the smoking gun. When it happens to a terminating gateway it is a killer for us.

We had an instance over the weekend. Looking at the logs and comparing with a cycle due to a deployment, I noted:

normal looking events from k8s, including container shutdowns from the pod
consul-endpoint-controller did not fire at any point for deregistration
it then proceeded to register the new pod just fine

Currently I'm suspecting there's some event-order dependence in the controller but I couldn't work out what it was subscribed to exactly to investigate further.

Also we use karpenter and that seems to be implicated somehow (certainly it tries to consolidate instances in the middle of the night and that also seems to be when it occurs quite often). Maybe that is shutting the node down rather than the pods and all the events aren't making it out ... or something

nrichu-hcp commented 1 year ago

@vorobiovv can you give us more context on how this issue starts?

@mr-miles unfortunately for us to be able to help you further we do need you to be able to reproduce the issue and that way we can take a crack at it ourselves and see where that leads us

mr-miles commented 1 year ago

I've had some success reproducing this at last. The steps with greatest chance of success seems to be:

delete the node containing the active connect-injector pod, when there are consul-registered pods running on the same node. (I have about 12 pods running on the node)

I'm tailing events from the cluster and see things like this:

{"level":"info","ts":1674767502.7706947,"logger":"event","caller":"kube-event-tail/main.go:98","msg":"Error updating Endpoint Slices for Service consul/consul-ingress-gateway: skipping Pod consul-ingress-gateway-787c754bf9-xbrbz for Service consul/consul-ingress-gateway: Node ip-xxx Not Found","event":{"namespace":"consul","name":"consul-ingress-gateway.173df76d57f3d267","involvedObject":{"name":"Service/consul-ingress-gateway"},"reason":"FailedToUpdateEndpointSlices","source.component":"endpoint-slice-controller","firstTimestamp":1674766685,"lastTimestamp":1674766685,"count":1,"type":"Warning"}} {"level":"info","ts":1674767502.770735,"logger":"event","caller":"kube-event-tail/main.go:98","msg":"Failed to update endpoint consul/consul-ingress-gateway: Operation cannot be fulfilled on endpoints \"consul-ingress-gateway\": the object has been modified; please apply your changes to the latest version and try again","event":{"namespace":"consul","name":"consul-ingress-gateway.173df76d58f02a8d","involvedObject":{"name":"Endpoints/consul-ingress-gateway"},"reason":"FailedToUpdateEndpoint","source.component":"endpoint-controller","firstTimestamp":1674766685,"lastTimestamp":1674766685,"count":1,"type":"Warning"}}

In one occurrence, I saw only the EndpointSlice error. In a subsequent attempt I got these two errors.

I also observe that the endpoint-controller doesn't receive the event about the service to deregister - so it doesn't seem like the problem is in the service catalog.

Last of all I picked through the helm chart and I noticed the connect-injector pod doesn't have any anti-affinity or spreadConstraints by default. Does that make it possible for there to be nothing running to pick the events up in some circumstances?

How long a backlog of events will the controller receive when it becomes leader, if the original leader is brutally killed?

mr-miles commented 1 year ago

@nrichu-hcp - have you tried to reproduce the issue? Do you need any more information? There are a few tickets across the various repositories relating to orphans in the service catalog that all implicate node removal.

david-yu commented 1 year ago

The following PR is now merged which may address your issues: https://github.com/hashicorp/consul-k8s/pull/2571. This should be released in consul-k8s 1.2.x, 1.1.x, and 1.0.x by mid August timeframe for our next set of patch releases. Will go ahead and close as this also a duplicate of https://github.com/hashicorp/consul-k8s/issues/2491 and https://github.com/hashicorp/consul-k8s/issues/1817

hashicorp / consul