Closed rlnrln closed 1 year ago
There have been several changes merged recently that address destination controller memory leaks that could be caused by high Pod churn: #10013 and #10201. I'd encourage you to try the latest edge release; these fixes will also probably be included in a 2.12 patch release. I'm going to close for now since there has been little activity, but please reopen if you still experience these issues on more recent version of Linkerd.
What is the issue?
A little over a month ago, we ran over the memory limit for linkerd-destination, and all three pods in the deployment began crashlooping, with bad results for the cluster as a whole. We increased the default memory to 500MiB, then after an alert to 750MiB, and yesterday to 1GiB.
Each line in the image is a separate instance of the
destination
container in thelinkerd-destination
deployment. Until October 26, we ran 6 instances; on Oct 6 we reduced it to 3, which increased the memory usage per pod slightly - which was expected.On Nov 7 and Nov 10, something happened that triggered eviction for two out of the three pods, which in both cases used up a lot of memory in the
destination
container, which wasn't reclaimed even when the other two pods came back up.I suspect the trigger for this in our case is related to Cluster Autoscaling. Basically, one pod is (was) co-located on a node that "never" went down, so it's been running for >30 days. Two other pods gets evicted occasionally, and while they're down, memory increases on the remaining pod by a lot, and never goes down.
Technical details:
Some workarounds we've considered but not yet implemented:
I've also found it hard to find information about working around the problem. I started looking for recommendations on cpu/memory resource allocations and found... none. Not even in the Linkerd Production Runbook.
I then started looking for information on what drives memory usage in linkerd-destination, which was also rather limited. I have no idea if it scales up with:
I also have no idea why memory usage should increase when the number of
linkerd-destination
replicas are reduced - my expectation is that all pods should hold the same data to ensure any one of them could answer any queries, but that obviously isn't the case.How can it be reproduced?
Logs, error output, etc
Each line in the image is a separate instance of the destination container in the linkerd-destination deployment. Until October 26, we ran 6 instances; on Oct 6 we reduced it to 3, which increased the memory usage per pod slightly - which was expected.
On Nov 7 and Nov 10, something happened that triggered eviction for two out of the three pods, which in both cases used up a lot of memory in the destination container, which wasn't reclaimed even when the other two pods came back up.
output of
linkerd check -o short
Note: linkerd-multicluster is running fine, but it's not in the default namespace and I don't remember the command line flag for adding it off the top of my head.
Linkerd core checks
linkerd-version
‼ cli is up-to-date is running version 2.11.4 but the latest stable version is 2.12.2 see https://linkerd.io/2.11/checks/#l5d-version-cli for hints
control-plane-version
‼ control plane is up-to-date is running version 2.11.4 but the latest stable version is 2.12.2 see https://linkerd.io/2.11/checks/#l5d-version-control for hints
linkerd-control-plane-proxy
‼ control plane proxies are up-to-date some proxies are not running the current version:
Linkerd extensions checks
linkerd-multicluster
× remote cluster access credentials are valid
Linkerd extensions checks
linkerd-multicluster
× remote cluster access credentials are valid
Linkerd extensions checks
linkerd-viz
‼ viz extension proxies are up-to-date some proxies are not running the current version:
Status check results are ×
Environment
Possible solution
Some workarounds suggested above.
Other than that, I'm looking for more predictable behaviour on pod restart and a way to preemptively foresee what the growth will be.
Additional context
No response
Would you like to work on fixing this bug?
No response