Open MiGrandjean opened 1 year ago
Thanks for your report! Could you share a memory dump? You can grab one the following way:
-server.profiling_enabled
http://localhost:9080/debug/pprof/heap?duration=15s
since that is still using less than 150MiB maybe this isn't a real leak but anyway, always good to have a look at things :smile:
edit: ok looking at my own cluster looks like there's definetely a leak going on lol. thanks for the report.
Also seeing a "leaky" pattern and oom kills across our promtail daemonsets (promtail v2.8.3)
sum by (node) (container_memory_working_set_bytes{job="cadvisor/cadvisor", pod=~"promtail.+",container!=""})
We could correlate the leak with extremely long lines or absolutely no line breaks in certain pods logs. We experimented adding line breaks and the leak disappeared.
For reference, we are using max_line_size = 0 (no limit)
We could correlate the leak with extremely long lines or absolutely no line breaks in certain pods logs. We experimented adding line breaks and the leak disappeared.
Using max_line_size = 0
hi @grandich . you mean adding this lines ?
config:
snippets:
extraLimitsConfig: |
max_line_size: 0
max_line_size_truncate: true
or can you send example what you mean ? :)
hi @Gakhramanzode
or can you send example what you mean ? :)
I meant we are using v2.5 defaults (no override/config): https://grafana.com/docs/loki/v2.5.x/configuration/#limits_config
max_line_size: 0 max_line_szie_truncate: false
This are our promtail pods as today (one week); the "big leakers" are gone; they were apps with gigantic log lines, but the leaky pattern remains, I guess it is related to "very long lines" apps
sum by (node) (container_memory_working_set_bytes{job="cadvisor/cadvisor", pod=~"promtail.+",container!=""})
@grandich thank you! I try again tomorrow fix memory leak))
@grandich thank you for sharing your insights on the memory consumption issue with promtail. we have implemented your suggestion but decided to set max_line_size to 16384 instead of 0, to ensure better control over the log line size. all changes have been applied, and we will observe the system over the next few days to monitor the memory usage. I appreciate your help!
@grandich do you know why setting max_line_size: 0 helped to remove the leak? According to the doc, by setting this we essentially say there's no limit on the log line length. For me it's counter-intuitive.
@sajithnedunmullage so, I don't understand 😀 I set max_line_size to 16384. It's wrong?))
this is my changes
@Gakhramanzode I'm also confused. Seeking for insights and advice :)
My Promtail pods have been slowly, but steadily increasing their memory consumption for the last 30 days. Trying to find a proper fix for this.
@sajithnedunmullage I understand you bro 😄
@grandich do you know why setting max_line_size: 0 helped to remove the leak? According to the doc, by setting this we essentially say there's no limit on the log line length. For me it's counter-intuitive.
Hi @sajithnedunmullage I didn't said that. In https://github.com/grafana/loki/issues/8054#issuecomment-1888178237 I mentioned that the leak is correlated to the length of the lines. We had lines in the order of ~500KB / ~1MB which produced huge leaks. We introduced line breaks and in certain cases we eliminated such lines, and the leaks improved.
I only mentioned max_line_size as a reference.
In theory, if we set max_line_size to something != 0, the leak should improve, but we didn't test it.
I'm seeing the same behavior over a 90 day period:
@maudrid Welcome to the club 🤝
hello everyone guys 👋 @grandich I think your revision helped us
...
# -- Section for crafting Promtails config file. The only directly relevant value is `config.file`
# which is a templated string that references the other values and snippets below this key.
# @default -- See `values.yaml`
config:
...
# -- A section of reusable snippets that can be reference in `config.file`.
# Custom snippets may be added in order to reduce redundancy.
# This is especially helpful when multiple `kubernetes_sd_configs` are use which usually have large parts in common.
# @default -- See `values.yaml`
snippets:
...
# -- You can put here any keys that will be directly added to the config file's 'limits_config' block.
# @default -- empty
extraLimitsConfig: |
max_line_size: 0
max_line_size_truncate: false
...
This issue should be renamed to "Promtail: memory leak" as it was confirmed that this is really a memory leak. Also, a fix would be nice 🙏🏼.
it's work 🤔
Hi , facing memory leak issue in the promtail. My Promtail pods have been slowly, but steadily increasing their memory consumption for the last 30 days kindly provide any solution for that
Any solution for this issue? Still facing this for v3.0.0
This is happening even promtail is used only for ingesting logs from journal.
This is happening even promtail is used only for ingesting logs from journal
We have a few installations of promtail scraping logs from regular files and systemd journal. Historically, these are running separately, each instance with its own config. We noticed that only the ones with journal scraping are consistently leaking, while others (those scraping regular files) doing just fine, e.g.:
So far we confirmed the leak on v2.8.2 and v3.0.1.
We also tried to migrate from promtail to alloy (for this particular log scrapping task), but alloy was consuming almost 2x more RSS doing exactly the same thing as promtail, so we had to postpone the migration for now.
For this particular workloads we do not use any k8s or containers; promtails are running on regular VMs.
I'd be happy to provide more context if it helps to finally resolve the issue.
Hello
on my side, I had the same mem leak problem but just on few promtail of the daemonset. (helm chart : 6.16.4 with promtail : 3.0.0
)
I've tried some settings under extraLimitsConfig
without evolution.
Finally, the best thing I've found is to reduce the number of files being tracked. In our context, there is no need to keep every container discovered by kubernetes_sd_configs
under watch in Loki.
As an example, some exclusions (slightly modified for the post):
extraRelabelConfigs:
- action: drop
source_labels:
- namespace
regex: '(kube-system|kubernetes-dashboard|postgres|observability)'
- action: drop
source_labels:
- pod
regex: '(loki|promtail|.*botkube|kafka-lag-exporter)-.*'
- action: drop
source_labels:
- container
regex: "(vault-agent|jaeger-agent|envoy-sidecar)"
This workaround fixes our recurring OOM of promtail in various envs.
Describe the bug We are seeing a slow but steady increase in memory usage for our Promtail pods. IMHO this looks very typical for a memory leak. Sooner or later we experience OOM kills for Promtail.
To Reproduce Steps to reproduce the behavior:
Expected behavior I would expect memory consumption to be somewhat linear under regular operation. Or, if there are spikes, that the memory is freed after increased demand.
Environment:
Screenshots, Promtail config, or terminal output
We are using the default values of the Helm Chart (with exception of the Loki URL and some podAnnotations and tolerations).
I'm also happy to investigate more on this and e.g. track down what is actually driving the memory consumption, if someone can point me in the right direction how to do this.