Open michel-gleeson opened 1 year ago
Is the "Linkerd Memory Usage" graph in the screenshot you've attached showing the total virtual memory of the Linkerd proxy process, or its resident set size (RSS)?
We're going to need a lot more details to make this actionable. What you are describing sounds like normal behavior for any proxy. Processing more traffic requires more memory. And in general, it is difficult for a Linux application to relinquish memory back to the OS once it has requested it.
Is there something out of the ordinary that made you file this issue? E.g. did this behavior change between Linkerd versions? Have you observed other proxies handle similar traffic loads with dramatically different behaviors? Please help us understand what is different here. (Also, please answer @hawkw's question above.)
Our graph looks pretty similar: Ours is scaled to the requested memory (we moved to using a limit higher than the request because we had seen this behavior before).
Thanks for the feedback, we were actually able to reproduce the issue in ARM nodes. @hatfarm is that your architecture as well? If not, can you share more details about your setup (cloud provider, CNI, linkerd version, etc)? Also, what metric are you using exactly to measure memory usage?
We are using AMD nodes, Linkerd 2.12.4, Azure CNI in AKS. The metric we're looking at is container_memory_working_set_bytes for actual memory reporting.
Apologies for the delayed reply - I opened this issue then went on a 2 week vacation. Glad to see there's some movement here. Catching up:
Is there something out of the ordinary that made you file this issue? E.g. did this behavior change between Linkerd versions? Have you observed other proxies handle similar traffic loads with dramatically different behaviors?
- Nothing out of the ordinary except for the sustained higher memory usage
- Currently this is our only existing proxy but plan to add more as we grow
- @hawkw this metric is the
kubernetes.memory.usage
metric from here so whatever this metric is tracking is what's represented in this graph :)
Please let me know if there's anything else I can provide!
To provide further context, in addition to what @hatfarm mentioned, we've examined the memory utilization data in our Grafana dashboards over the past 30 days. A clear trend emerges when our system encounters unexpected surges in user traffic: we expect a proportional rise in memory usage for both our service
and proxy
containers. Typically, in the case of the majority of our services, under normal operating conditions, the memory utilization [Of the request] of the Linkerd proxy container is notably lower than that of our application container.
However, during traffic surges or exceptionally high request loads, a significant shift occurs. We've noticed that the memory utilization of the Linkerd proxy triples, reaching levels as high as 150% of its request, while our application container's memory utilization only increases by around 20%.
For instance, consider one of our services as depicted in the Grafana dashboard below: during routine traffic, the Linkerd proxy
exhibits roughly 32%
memory utilization of the request, whereas our service registers around 50%
memory utilization of the request. However, when the volume of requests to our system experiences a substantial increase, the Linkerd proxy's
memory utilization surges to 104%
of the request, while our service's memory utilization rises to approximately to 64%
of the request.
Environment k8s version: 1.25 K8s Type: Azure AKS OS: Linux linkerd Version: 2.12.4
Let me know if you need further information or need any assistance.
Quick update: yup, we're still working on this!
The original post in this issue was about the proxy not being able to liberate memory after a traffic burst passes. We acknowledged this is an issue we recognized in our ARM builds, that we still need to address.
@Abrishges am I right to conclude given your graph, that the high traffic scenario was sustained from 9/16 to 10/10? And afterwards it appears the memory was reclaimed. In which case this is different than the original issue. Note that the proxy's memory consumption doesn't necessarily grow linearly with traffic. I have some difficulty however interpreting your data. Can you clarify what the "memory utilization [Of the request]" is or how it's calculated? It'd be clearer for us to reason about absolute memory consumption by those containers, and it would also be handy to know the k8s resource config for them. Even further, it'd be great to know the RPS during that time lapse, as well as the kind of traffic (protocol, latency distribution, etc) for us to check if this is expected behavior.
Hi @alpeb ,
I'm on @Abrishges 's team, and that's not quite right. We had a HUGE influx of requests on 9/16 (went from ~10k requests per minute, to about 110k RPM). This was sustained over about 2 hours, after which it went away. The problem is, the memory usage stayed high for that whole time. The 10/10 dip is when we restarted or re-deployed, too much work to figure out which 😄, and it doesn't matter which one, really. This would have been entirely http/https traffic over tcp.
As for the memory, here's a graph of the actual values:
128MB was our request (though we've updated that to 256MB). The memory usage itself isn't as much of a concern, but if we were to see multiple of these bursts over some small window, it seems that we would run out of memory, since it's never freed.
@hatfarm and @Abrishges, probably best to split your case into its own discussion so we can separate issues of scale from issues of memory...
What is the issue?
Looking at my datadog dashboards, I see that the linkerd proxies consume a sustained higher level of memory when a burst of traffic hits them
How can it be reproduced?
In our infrastructure, whenever traffic sits at normal levels and sees a 100x burst out of nowhere, we also see larger memory use out of the sidecar proxy. Would be curious to see if this is reproducible in other infrastructures
Logs, error output, etc
output of
linkerd check -o short
Environment
Possible solution
No response
Additional context
The application this is a sidecar to is an our api gateway that uses koa.
Would you like to work on fixing this bug?
None