Closed Sierra1011 closed 1 month ago
This also spikes the resources (and caused OOMKill) all of the linkerd-destination pods in cluster A
@Sierra1011 Do you know which container in the pod is being OOMKilled? kubectl describe
should include some relevant information. Or perhaps you have per-container logs or metrics from the incident somewhere?
@Sierra1011 This sounds somewhat similar to another report we had recently that we believe is fixed by https://github.com/linkerd/linkerd2/pull/12598. Updating to the latest edge release should resolve that class of issue. Without logs or metrics from your incident, it will be hard for us to know for sure whether this is the same problem you observed.
This also spikes the resources (and caused OOMKill) all of the linkerd-destination pods in cluster A
@Sierra1011 Do you know which container in the pod is being OOMKilled?
kubectl describe
should include some relevant information. Or perhaps you have per-container logs or metrics from the incident somewhere?
The pod being killed was the linkerd-destination pod, and unfortunately I don't have metrics from this any more.
It's entirely possible that this was brought up internally before I could raise the issue here; in which case I'm happy to let it be closed :+1:
The pod being killed was the linkerd-destination pod, and unfortunately I don't have metrics from this any more.
There are multiple containers in the destination pod... It sounds like you're assuming that the destination container was killed, but I suspect that it was the proxy container. If that's the case, the PR I referenced should help to improve the situation.
We are working on doing additional testing, but I'd recommend updating to the latest edge.
On cluster A, scale a linkerd-injected deployment... This should cause a spike in resources of the Linkerd control plane in cluster B, especially linkerd-destination pods.
Thinking about this more, it is unlikely that this could be caused by proxy OOMKilling (for the peer cluster to be impacted). We plan to do some more investigation here.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
What is the issue?
As requested by Flynn on Slack.
Setup: running edge-24.3.2, 2 clusters, mirroring some services.
How can it be reproduced?
Take 2 clusters (A and B) that have pod-to-pod multicluster set up, with at least one service mirrored from A to B. Linkerd deployment will need reasonable resource limits to exhibit the OOMKill and DoS effect. On cluster A, scale a linkerd-injected deployment to something unreasonable, like 50,000 replicas. This should then cause cluster B to attempt discovery of the endpoints. This should cause a spike in resources of the Linkerd control plane in cluster B, especially
linkerd-destination
pods. If thelinkerd-destination
resource limits are exceeded, this will result in a failure of the control plane in cluster B, stopping all meshed traffic.Logs, error output, etc
This is being written retrospectively, so I do not have an output of the destination pods; however they were also being OOMKilled continuously until the number of pods in cluster A reduced to normal levels.
output of
linkerd check -o short
Again, this is historic and we have since upgraded from edge-24.3.2 to edge-24.5.1, but nothing else has changed in our setup.
Environment
Possible solution
Very much spitballing here, but an option could include a try/fail where indexing destination endpoints stops and operates in a "service mode" if the discovery exceeds an amount of resource usage (this already sounds horribly like a JVM heap argument so take it with a pinch of salt)
Alternatively calculating a spike of pods based on typical discovered pod numbers and incrementing more slowly.
Sharding the destination service could also mitigate this, by breaking up the resources that each pod tries to index... but I'm not sure how reasonable that is as an approach, as the point of HA is that each pod holds all state.
Additional context
No response
Would you like to work on fixing this bug?
None