linkerd / linkerd2

Ultralight, security-first service mesh for Kubernetes. Main repo for Linkerd 2.x.
https://linkerd.io
Apache License 2.0
10.72k stars 1.28k forks source link

Linkerd sidecars maintain higher memory usage after a traffic burst on ARM #11430

Open michel-gleeson opened 1 year ago

michel-gleeson commented 1 year ago

What is the issue?

Looking at my datadog dashboards, I see that the linkerd proxies consume a sustained higher level of memory when a burst of traffic hits them

How can it be reproduced?

In our infrastructure, whenever traffic sits at normal levels and sees a 100x burst out of nowhere, we also see larger memory use out of the sidecar proxy. Would be curious to see if this is reproducible in other infrastructures

Logs, error output, etc

image image

output of linkerd check -o short

➜  ~ linkerd check -o short
linkerd-identity
----------------
‼ issuer cert is valid for at least 60 days
    issuer certificate will expire on 2023-09-29T20:27:08Z
    see https://linkerd.io/2.14/checks/#l5d-identity-issuer-cert-not-expiring-soon for hints

linkerd-webhooks-and-apisvc-tls
-------------------------------
‼ proxy-injector cert is valid for at least 60 days
    certificate will expire on 2023-09-28T20:17:03Z
    see https://linkerd.io/2.14/checks/#l5d-proxy-injector-webhook-cert-not-expiring-soon for hints
‼ sp-validator cert is valid for at least 60 days
    certificate will expire on 2023-09-28T20:17:05Z
    see https://linkerd.io/2.14/checks/#l5d-sp-validator-webhook-cert-not-expiring-soon for hints
‼ policy-validator cert is valid for at least 60 days
    certificate will expire on 2023-09-28T20:17:03Z
    see https://linkerd.io/2.14/checks/#l5d-policy-validator-webhook-cert-not-expiring-soon for hints

linkerd-version
---------------
‼ cli is up-to-date
    is running version 2.14.0 but the latest stable version is 2.14.1
    see https://linkerd.io/2.14/checks/#l5d-version-cli for hints

control-plane-version
---------------------
‼ control plane is up-to-date
    is running version 2.14.0 but the latest stable version is 2.14.1
    see https://linkerd.io/2.14/checks/#l5d-version-control for hints

linkerd-control-plane-proxy
---------------------------
‼ control plane proxies are up-to-date
    some proxies are not running the current version:
    * linkerd-destination-666975cb44-2hhrr (stable-2.14.0)
    * linkerd-destination-666975cb44-gdjnx (stable-2.14.0)
    * linkerd-destination-666975cb44-xp2qf (stable-2.14.0)
    * linkerd-identity-6c4767f949-c64db (stable-2.14.0)
    * linkerd-identity-6c4767f949-jz8j2 (stable-2.14.0)
    * linkerd-identity-6c4767f949-r4bc9 (stable-2.14.0)
    * linkerd-proxy-injector-5d4688c558-4qwmb (stable-2.14.0)
    * linkerd-proxy-injector-5d4688c558-fll4r (stable-2.14.0)
    * linkerd-proxy-injector-5d4688c558-q285x (stable-2.14.0)
    see https://linkerd.io/2.14/checks/#l5d-cp-proxy-version for hints

linkerd-viz
-----------
‼ tap API server cert is valid for at least 60 days
    certificate will expire on 2023-09-28T20:17:06Z
    see https://linkerd.io/2.14/checks/#l5d-tap-cert-not-expiring-soon for hints
‼ viz extension proxies are up-to-date
    some proxies are not running the current version:
    * metrics-api-65c578c57f-bc258 (stable-2.14.0)
    * tap-7864997c5d-gwcrf (stable-2.14.0)
    * tap-injector-655957d5c6-sqq6r (stable-2.14.0)
    * web-7b6ddc8cc-r4xjf (stable-2.14.0)
    see https://linkerd.io/2.14/checks/#l5d-viz-proxy-cp-version for hints

Status check results are √

Environment

Possible solution

No response

Additional context

The application this is a sidecar to is an our api gateway that uses koa.

Would you like to work on fixing this bug?

None

hawkw commented 1 year ago

Is the "Linkerd Memory Usage" graph in the screenshot you've attached showing the total virtual memory of the Linkerd proxy process, or its resident set size (RSS)?

wmorgan commented 1 year ago

We're going to need a lot more details to make this actionable. What you are describing sounds like normal behavior for any proxy. Processing more traffic requires more memory. And in general, it is difficult for a Linux application to relinquish memory back to the OS once it has requested it.

Is there something out of the ordinary that made you file this issue? E.g. did this behavior change between Linkerd versions? Have you observed other proxies handle similar traffic loads with dramatically different behaviors? Please help us understand what is different here. (Also, please answer @hawkw's question above.)

hatfarm commented 1 year ago

Our graph looks pretty similar: d8a19cb3-5984-4a39-8027-d571ace8ca52 Ours is scaled to the requested memory (we moved to using a limit higher than the request because we had seen this behavior before).

alpeb commented 1 year ago

Thanks for the feedback, we were actually able to reproduce the issue in ARM nodes. @hatfarm is that your architecture as well? If not, can you share more details about your setup (cloud provider, CNI, linkerd version, etc)? Also, what metric are you using exactly to measure memory usage?

hatfarm commented 1 year ago

We are using AMD nodes, Linkerd 2.12.4, Azure CNI in AKS. The metric we're looking at is container_memory_working_set_bytes for actual memory reporting.

michel-gleeson commented 1 year ago

Apologies for the delayed reply - I opened this issue then went on a 2 week vacation. Glad to see there's some movement here. Catching up:

Is there something out of the ordinary that made you file this issue? E.g. did this behavior change between Linkerd versions? Have you observed other proxies handle similar traffic loads with dramatically different behaviors?

  • Nothing out of the ordinary except for the sustained higher memory usage
  • Currently this is our only existing proxy but plan to add more as we grow
  • @hawkw this metric is the kubernetes.memory.usage metric from here so whatever this metric is tracking is what's represented in this graph :)

Please let me know if there's anything else I can provide!

Abrishges commented 1 year ago

To provide further context, in addition to what @hatfarm mentioned, we've examined the memory utilization data in our Grafana dashboards over the past 30 days. A clear trend emerges when our system encounters unexpected surges in user traffic: we expect a proportional rise in memory usage for both our service and proxy containers. Typically, in the case of the majority of our services, under normal operating conditions, the memory utilization [Of the request] of the Linkerd proxy container is notably lower than that of our application container.

However, during traffic surges or exceptionally high request loads, a significant shift occurs. We've noticed that the memory utilization of the Linkerd proxy triples, reaching levels as high as 150% of its request, while our application container's memory utilization only increases by around 20%.

For instance, consider one of our services as depicted in the Grafana dashboard below: during routine traffic, the Linkerd proxy exhibits roughly 32% memory utilization of the request, whereas our service registers around 50% memory utilization of the request. However, when the volume of requests to our system experiences a substantial increase, the Linkerd proxy's memory utilization surges to 104% of the request, while our service's memory utilization rises to approximately to 64% of the request.

image image

Environment k8s version: 1.25 K8s Type: Azure AKS OS: Linux linkerd Version: 2.12.4

Let me know if you need further information or need any assistance.

kflynn commented 1 year ago

Quick update: yup, we're still working on this!

alpeb commented 1 year ago

The original post in this issue was about the proxy not being able to liberate memory after a traffic burst passes. We acknowledged this is an issue we recognized in our ARM builds, that we still need to address.

@Abrishges am I right to conclude given your graph, that the high traffic scenario was sustained from 9/16 to 10/10? And afterwards it appears the memory was reclaimed. In which case this is different than the original issue. Note that the proxy's memory consumption doesn't necessarily grow linearly with traffic. I have some difficulty however interpreting your data. Can you clarify what the "memory utilization [Of the request]" is or how it's calculated? It'd be clearer for us to reason about absolute memory consumption by those containers, and it would also be handy to know the k8s resource config for them. Even further, it'd be great to know the RPS during that time lapse, as well as the kind of traffic (protocol, latency distribution, etc) for us to check if this is expected behavior.

hatfarm commented 1 year ago

Hi @alpeb ,

I'm on @Abrishges 's team, and that's not quite right. We had a HUGE influx of requests on 9/16 (went from ~10k requests per minute, to about 110k RPM). This was sustained over about 2 hours, after which it went away. The problem is, the memory usage stayed high for that whole time. The 10/10 dip is when we restarted or re-deployed, too much work to figure out which 😄, and it doesn't matter which one, really. This would have been entirely http/https traffic over tcp.

As for the memory, here's a graph of the actual values: image

128MB was our request (though we've updated that to 256MB). The memory usage itself isn't as much of a concern, but if we were to see multiple of these bursts over some small window, it seems that we would run out of memory, since it's never freed.

kflynn commented 1 year ago

@hatfarm and @Abrishges, probably best to split your case into its own discussion so we can separate issues of scale from issues of memory...