grafana / loki

Like Prometheus, but for logs.
https://grafana.com/loki
GNU Affero General Public License v3.0
24.09k stars 3.47k forks source link

Promtail: possible memory leak #8054

Open MiGrandjean opened 1 year ago

MiGrandjean commented 1 year ago

Describe the bug We are seeing a slow but steady increase in memory usage for our Promtail pods. IMHO this looks very typical for a memory leak. Sooner or later we experience OOM kills for Promtail.

To Reproduce Steps to reproduce the behavior:

  1. Running Loki (2.6.1), deployed on EKS via Helm Chart
  2. Running Promtail (2.7.0), deployed on EKS via Helm Chart
  3. Running some Pods (Ingress, Prometheus, Grafana, Thanos ... ) on the Node

Expected behavior I would expect memory consumption to be somewhat linear under regular operation. Or, if there are spikes, that the memory is freed after increased demand.

Environment:

Screenshots, Promtail config, or terminal output

Auswahl_274

We are using the default values of the Helm Chart (with exception of the Loki URL and some podAnnotations and tolerations).

I'm also happy to investigate more on this and e.g. track down what is actually driving the memory consumption, if someone can point me in the right direction how to do this.

DylanGuedes commented 1 year ago

Thanks for your report! Could you share a memory dump? You can grab one the following way:

  1. Run promtail with -server.profiling_enabled
  2. Access http://localhost:9080/debug/pprof/heap?duration=15s

since that is still using less than 150MiB maybe this isn't a real leak but anyway, always good to have a look at things :smile:

edit: ok looking at my own cluster looks like there's definetely a leak going on lol. thanks for the report.

grandich commented 10 months ago

Also seeing a "leaky" pattern and oom kills across our promtail daemonsets (promtail v2.8.3)

image

sum by (node) (container_memory_working_set_bytes{job="cadvisor/cadvisor", pod=~"promtail.+",container!=""})

grandich commented 10 months ago

We could correlate the leak with extremely long lines or absolutely no line breaks in certain pods logs. We experimented adding line breaks and the leak disappeared.

For reference, we are using max_line_size = 0 (no limit)

Gakhramanzode commented 7 months ago

We could correlate the leak with extremely long lines or absolutely no line breaks in certain pods logs. We experimented adding line breaks and the leak disappeared.

Using max_line_size = 0

hi @grandich . you mean adding this lines ?

config:
  snippets:
    extraLimitsConfig: |
      max_line_size: 0
      max_line_size_truncate: true
Gakhramanzode commented 7 months ago

or can you send example what you mean ? :)

grandich commented 7 months ago

hi @Gakhramanzode

or can you send example what you mean ? :)

I meant we are using v2.5 defaults (no override/config): https://grafana.com/docs/loki/v2.5.x/configuration/#limits_config

max_line_size: 0 max_line_szie_truncate: false

grandich commented 7 months ago

This are our promtail pods as today (one week); the "big leakers" are gone; they were apps with gigantic log lines, but the leaky pattern remains, I guess it is related to "very long lines" apps

sum by (node) (container_memory_working_set_bytes{job="cadvisor/cadvisor", pod=~"promtail.+",container!=""})

image

Gakhramanzode commented 7 months ago

@grandich thank you! I try again tomorrow fix memory leak))

Gakhramanzode commented 7 months ago

@grandich thank you for sharing your insights on the memory consumption issue with promtail. we have implemented your suggestion but decided to set max_line_size to 16384 instead of 0, to ensure better control over the log line size. all changes have been applied, and we will observe the system over the next few days to monitor the memory usage. I appreciate your help!

sajithnedunmullage commented 7 months ago

@grandich do you know why setting max_line_size: 0 helped to remove the leak? According to the doc, by setting this we essentially say there's no limit on the log line length. For me it's counter-intuitive.

Gakhramanzode commented 7 months ago

@sajithnedunmullage so, I don't understand 😀 I set max_line_size to 16384. It's wrong?))

Gakhramanzode commented 7 months ago

image this is my changes

sajithnedunmullage commented 7 months ago

@Gakhramanzode I'm also confused. Seeking for insights and advice :)

My Promtail pods have been slowly, but steadily increasing their memory consumption for the last 30 days. Trying to find a proper fix for this.

image

Gakhramanzode commented 7 months ago

@sajithnedunmullage I understand you bro 😄 image

grandich commented 7 months ago

@grandich do you know why setting max_line_size: 0 helped to remove the leak? According to the doc, by setting this we essentially say there's no limit on the log line length. For me it's counter-intuitive.

Hi @sajithnedunmullage I didn't said that. In https://github.com/grafana/loki/issues/8054#issuecomment-1888178237 I mentioned that the leak is correlated to the length of the lines. We had lines in the order of ~500KB / ~1MB which produced huge leaks. We introduced line breaks and in certain cases we eliminated such lines, and the leaks improved.

I only mentioned max_line_size as a reference.

In theory, if we set max_line_size to something != 0, the leak should improve, but we didn't test it.

maudrid commented 6 months ago

I'm seeing the same behavior over a 90 day period: image

Gakhramanzode commented 6 months ago

@maudrid Welcome to the club 🤝

Gakhramanzode commented 6 months ago

hello everyone guys 👋 @grandich I think your revision helped us Снимок экрана_30-5-2024_83210_grafana citydrive tech

...
# -- Section for crafting Promtails config file. The only directly relevant value is `config.file`
# which is a templated string that references the other values and snippets below this key.
# @default -- See `values.yaml`
config:
...
  # -- A section of reusable snippets that can be reference in `config.file`.
  # Custom snippets may be added in order to reduce redundancy.
  # This is especially helpful when multiple `kubernetes_sd_configs` are use which usually have large parts in common.
  # @default -- See `values.yaml`
  snippets:
...
    # -- You can put here any keys that will be directly added to the config file's 'limits_config' block.
    # @default -- empty
    extraLimitsConfig: |
      max_line_size: 0
      max_line_size_truncate: false
...
Rohlik commented 5 months ago

This issue should be renamed to "Promtail: memory leak" as it was confirmed that this is really a memory leak. Also, a fix would be nice 🙏🏼.

Gakhramanzode commented 5 months ago

it's work 🤔 image

naveen2112 commented 4 months ago

Hi , facing memory leak issue in the promtail. My Promtail pods have been slowly, but steadily increasing their memory consumption for the last 30 days kindly provide any solution for that

Screenshot 2024-07-25 at 10 39 13 AM
AniketTendulkar2510 commented 3 months ago

Any solution for this issue? Still facing this for v3.0.0

VtG242 commented 2 months ago

This is happening even promtail is used only for ingesting logs from journal.Image

defanator commented 2 months ago

This is happening even promtail is used only for ingesting logs from journal

We have a few installations of promtail scraping logs from regular files and systemd journal. Historically, these are running separately, each instance with its own config. We noticed that only the ones with journal scraping are consistently leaking, while others (those scraping regular files) doing just fine, e.g.:

Image

Image

So far we confirmed the leak on v2.8.2 and v3.0.1.

We also tried to migrate from promtail to alloy (for this particular log scrapping task), but alloy was consuming almost 2x more RSS doing exactly the same thing as promtail, so we had to postpone the migration for now.

For this particular workloads we do not use any k8s or containers; promtails are running on regular VMs.

I'd be happy to provide more context if it helps to finally resolve the issue.

M4teo commented 1 month ago

Hello on my side, I had the same mem leak problem but just on few promtail of the daemonset. (helm chart : 6.16.4 with promtail : 3.0.0) I've tried some settings under extraLimitsConfig without evolution. Finally, the best thing I've found is to reduce the number of files being tracked. In our context, there is no need to keep every container discovered by kubernetes_sd_configs under watch in Loki. As an example, some exclusions (slightly modified for the post):

    extraRelabelConfigs:
      - action: drop
        source_labels:
          - namespace
        regex: '(kube-system|kubernetes-dashboard|postgres|observability)'
      - action: drop
        source_labels:
          - pod
        regex: '(loki|promtail|.*botkube|kafka-lag-exporter)-.*'
      - action: drop
        source_labels:
          - container
        regex: "(vault-agent|jaeger-agent|envoy-sidecar)"

This workaround fixes our recurring OOM of promtail in various envs. Image