`linkerd-destination` OOMKilled due to discovery spike in linkerd P2P multicluster, renders cluster inoperable

Sierra1011 commented 4 months ago

What is the issue?

As requested by Flynn on Slack.

Setup: running edge-24.3.2, 2 clusters, mirroring some services.

Cluster A shares some mirrored services with Cluster B.
Cluster B had some bad config rolled out to it, relating to the Kube 1.29/proxy native sidecar changes
Cluster B started creating thousands upon thousands (literally, like 10k+) of linkerd-proxy-injector pods, all in the same ReplicaSet, all with the same error (I forget the exact error but it was words to the effect of NonDefaultRestartPolicy, so it was clearly related to that change we'd made)
This massively spiked the linkerd-destination pods and resulted in being OOMKilled continuously, taking down the services in the cluster.
This also spikes the resources (and caused OOMKill) all of the linkerd-destination pods in cluster A
This causes a denial of service on all services in cluster A.

How can it be reproduced?

Take 2 clusters (A and B) that have pod-to-pod multicluster set up, with at least one service mirrored from A to B. Linkerd deployment will need reasonable resource limits to exhibit the OOMKill and DoS effect. On cluster A, scale a linkerd-injected deployment to something unreasonable, like 50,000 replicas. This should then cause cluster B to attempt discovery of the endpoints. This should cause a spike in resources of the Linkerd control plane in cluster B, especially linkerd-destination pods. If the linkerd-destination resource limits are exceeded, this will result in a failure of the control plane in cluster B, stopping all meshed traffic.

Logs, error output, etc

This is being written retrospectively, so I do not have an output of the destination pods; however they were also being OOMKilled continuously until the number of pods in cluster A reduced to normal levels.

output of `linkerd check -o short`

Again, this is historic and we have since upgraded from edge-24.3.2 to edge-24.5.1, but nothing else has changed in our setup.

linkerd-version
---------------
‼ cli is up-to-date
    is running version 24.3.2 but the latest edge version is 24.5.3
    see https://linkerd.io/2/checks/#l5d-version-cli for hints

control-plane-version
---------------------
‼ control plane is up-to-date
    is running version 24.5.1 but the latest edge version is 24.5.3
    see https://linkerd.io/2/checks/#l5d-version-control for hints
‼ control plane and cli versions match
    control plane running edge-24.5.1 but cli running edge-24.3.2
    see https://linkerd.io/2/checks/#l5d-version-control for hints

linkerd-control-plane-proxy
---------------------------
‼ control plane proxies are up-to-date
    some proxies are not running the current version:
    * linkerd-destination-6cfb9689f6-7mj9t (edge-24.5.1)
    * linkerd-destination-6cfb9689f6-mnvnz (edge-24.5.1)
    * linkerd-destination-6cfb9689f6-n5w6l (edge-24.5.1)
    * linkerd-identity-85c5896467-7v82j (edge-24.5.1)
    * linkerd-identity-85c5896467-n6znn (edge-24.5.1)
    * linkerd-identity-85c5896467-r7qgd (edge-24.5.1)
    * linkerd-proxy-injector-589b5cc587-8pz5g (edge-24.5.1)
    * linkerd-proxy-injector-589b5cc587-8w96c (edge-24.5.1)
    * linkerd-proxy-injector-589b5cc587-bjh9l (edge-24.5.1)
    see https://linkerd.io/2/checks/#l5d-cp-proxy-version for hints
‼ control plane proxies and cli versions match
    linkerd-destination-6cfb9689f6-7mj9t running edge-24.5.1 but cli running edge-24.3.2
    see https://linkerd.io/2/checks/#l5d-cp-proxy-cli-version for hints

linkerd-jaeger
--------------
‼ jaeger extension proxies are up-to-date
    some proxies are not running the current version:
    * collector-6c98b7c975-w5lmd (edge-24.5.1)
    * jaeger-7f489d75f7-nqxzv (edge-24.5.1)
    * jaeger-injector-567d6756dc-s8lrx (edge-24.5.1)
    see https://linkerd.io/2/checks/#l5d-jaeger-proxy-cp-version for hints
‼ jaeger extension proxies and cli versions match
    collector-6c98b7c975-w5lmd running edge-24.5.1 but cli running edge-24.3.2
    see https://linkerd.io/2/checks/#l5d-jaeger-proxy-cli-version for hints

linkerd-viz
-----------
‼ viz extension proxies are up-to-date
    some proxies are not running the current version:
    * metrics-api-548778dd4c-9z6tf (edge-24.5.1)
    * metrics-api-548778dd4c-hvn88 (edge-24.5.1)
    * metrics-api-548778dd4c-ltxbm (edge-24.5.1)
    * tap-5f846bb67b-bprgk (edge-24.5.1)
    * tap-5f846bb67b-cjngm (edge-24.5.1)
    * tap-5f846bb67b-qkmxl (edge-24.5.1)
    * tap-injector-58db76686f-jdb6b (edge-24.5.1)
    * tap-injector-58db76686f-kdwp4 (edge-24.5.1)
    * tap-injector-58db76686f-sqv5t (edge-24.5.1)
    * web-6f486c9d84-5gfqs (edge-24.5.1)
    * web-6f486c9d84-c6p9d (edge-24.5.1)
    see https://linkerd.io/2/checks/#l5d-viz-proxy-cp-version for hints
‼ viz extension proxies and cli versions match
    metrics-api-548778dd4c-9z6tf running edge-24.5.1 but cli running edge-24.3.2
    see https://linkerd.io/2/checks/#l5d-viz-proxy-cli-version for hints

Status check results are √

Environment

Kubernetes v1.29.3
EKS cluster
Bottlerocket nodes
Cilium CNI in AWS VPC replacement mode

Possible solution

Very much spitballing here, but an option could include a try/fail where indexing destination endpoints stops and operates in a "service mode" if the discovery exceeds an amount of resource usage (this already sounds horribly like a JVM heap argument so take it with a pinch of salt)

Alternatively calculating a spike of pods based on typical discovered pod numbers and incrementing more slowly.

Sharding the destination service could also mitigate this, by breaking up the resources that each pod tries to index... but I'm not sure how reasonable that is as an approach, as the point of HA is that each pod holds all state.

Additional context

No response

Would you like to work on fixing this bug?

None

olix0r commented 4 months ago

This also spikes the resources (and caused OOMKill) all of the linkerd-destination pods in cluster A

@Sierra1011 Do you know which container in the pod is being OOMKilled? kubectl describe should include some relevant information. Or perhaps you have per-container logs or metrics from the incident somewhere?

olix0r commented 4 months ago

@Sierra1011 This sounds somewhat similar to another report we had recently that we believe is fixed by https://github.com/linkerd/linkerd2/pull/12598. Updating to the latest edge release should resolve that class of issue. Without logs or metrics from your incident, it will be hard for us to know for sure whether this is the same problem you observed.

Sierra1011 commented 4 months ago

This also spikes the resources (and caused OOMKill) all of the linkerd-destination pods in cluster A

@Sierra1011 Do you know which container in the pod is being OOMKilled? kubectl describe should include some relevant information. Or perhaps you have per-container logs or metrics from the incident somewhere?

The pod being killed was the linkerd-destination pod, and unfortunately I don't have metrics from this any more.

It's entirely possible that this was brought up internally before I could raise the issue here; in which case I'm happy to let it be closed :+1:

olix0r commented 4 months ago

The pod being killed was the linkerd-destination pod, and unfortunately I don't have metrics from this any more.

There are multiple containers in the destination pod... It sounds like you're assuming that the destination container was killed, but I suspect that it was the proxy container. If that's the case, the PR I referenced should help to improve the situation.

We are working on doing additional testing, but I'd recommend updating to the latest edge.

olix0r commented 4 months ago

On cluster A, scale a linkerd-injected deployment... This should cause a spike in resources of the Linkerd control plane in cluster B, especially linkerd-destination pods.

Thinking about this more, it is unlikely that this could be caused by proxy OOMKilling (for the peer cluster to be impacted). We plan to do some more investigation here.

stale[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

linkerd / linkerd2