grafana / loki

Like Prometheus, but for logs.
https://grafana.com/loki
GNU Affero General Public License v3.0
22.73k stars 3.31k forks source link

Promtail: possible memory leak #8054

Open MiGrandjean opened 1 year ago

MiGrandjean commented 1 year ago

Describe the bug We are seeing a slow but steady increase in memory usage for our Promtail pods. IMHO this looks very typical for a memory leak. Sooner or later we experience OOM kills for Promtail.

To Reproduce Steps to reproduce the behavior:

  1. Running Loki (2.6.1), deployed on EKS via Helm Chart
  2. Running Promtail (2.7.0), deployed on EKS via Helm Chart
  3. Running some Pods (Ingress, Prometheus, Grafana, Thanos ... ) on the Node

Expected behavior I would expect memory consumption to be somewhat linear under regular operation. Or, if there are spikes, that the memory is freed after increased demand.

Environment:

Screenshots, Promtail config, or terminal output

Auswahl_274

We are using the default values of the Helm Chart (with exception of the Loki URL and some podAnnotations and tolerations).

I'm also happy to investigate more on this and e.g. track down what is actually driving the memory consumption, if someone can point me in the right direction how to do this.

DylanGuedes commented 1 year ago

Thanks for your report! Could you share a memory dump? You can grab one the following way:

  1. Run promtail with -server.profiling_enabled
  2. Access http://localhost:9080/debug/pprof/heap?duration=15s

since that is still using less than 150MiB maybe this isn't a real leak but anyway, always good to have a look at things :smile:

edit: ok looking at my own cluster looks like there's definetely a leak going on lol. thanks for the report.

grandich commented 5 months ago

Also seeing a "leaky" pattern and oom kills across our promtail daemonsets (promtail v2.8.3)

image

sum by (node) (container_memory_working_set_bytes{job="cadvisor/cadvisor", pod=~"promtail.+",container!=""})

grandich commented 5 months ago

We could correlate the leak with extremely long lines or absolutely no line breaks in certain pods logs. We experimented adding line breaks and the leak disappeared.

For reference, we are using max_line_size = 0 (no limit)

Gakhramanzode commented 2 months ago

We could correlate the leak with extremely long lines or absolutely no line breaks in certain pods logs. We experimented adding line breaks and the leak disappeared.

Using max_line_size = 0

hi @grandich . you mean adding this lines ?

config:
  snippets:
    extraLimitsConfig: |
      max_line_size: 0
      max_line_size_truncate: true
Gakhramanzode commented 2 months ago

or can you send example what you mean ? :)

grandich commented 2 months ago

hi @Gakhramanzode

or can you send example what you mean ? :)

I meant we are using v2.5 defaults (no override/config): https://grafana.com/docs/loki/v2.5.x/configuration/#limits_config

max_line_size: 0 max_line_szie_truncate: false

grandich commented 2 months ago

This are our promtail pods as today (one week); the "big leakers" are gone; they were apps with gigantic log lines, but the leaky pattern remains, I guess it is related to "very long lines" apps

sum by (node) (container_memory_working_set_bytes{job="cadvisor/cadvisor", pod=~"promtail.+",container!=""})

image

Gakhramanzode commented 2 months ago

@grandich thank you! I try again tomorrow fix memory leak))

Gakhramanzode commented 2 months ago

@grandich thank you for sharing your insights on the memory consumption issue with promtail. we have implemented your suggestion but decided to set max_line_size to 16384 instead of 0, to ensure better control over the log line size. all changes have been applied, and we will observe the system over the next few days to monitor the memory usage. I appreciate your help!

sajithnedunmullage commented 2 months ago

@grandich do you know why setting max_line_size: 0 helped to remove the leak? According to the doc, by setting this we essentially say there's no limit on the log line length. For me it's counter-intuitive.

Gakhramanzode commented 2 months ago

@sajithnedunmullage so, I don't understand 😀 I set max_line_size to 16384. It's wrong?))

Gakhramanzode commented 2 months ago

image this is my changes

sajithnedunmullage commented 2 months ago

@Gakhramanzode I'm also confused. Seeking for insights and advice :)

My Promtail pods have been slowly, but steadily increasing their memory consumption for the last 30 days. Trying to find a proper fix for this.

image

Gakhramanzode commented 2 months ago

@sajithnedunmullage I understand you bro 😄 image

grandich commented 2 months ago

@grandich do you know why setting max_line_size: 0 helped to remove the leak? According to the doc, by setting this we essentially say there's no limit on the log line length. For me it's counter-intuitive.

Hi @sajithnedunmullage I didn't said that. In https://github.com/grafana/loki/issues/8054#issuecomment-1888178237 I mentioned that the leak is correlated to the length of the lines. We had lines in the order of ~500KB / ~1MB which produced huge leaks. We introduced line breaks and in certain cases we eliminated such lines, and the leaks improved.

I only mentioned max_line_size as a reference.

In theory, if we set max_line_size to something != 0, the leak should improve, but we didn't test it.

maudrid commented 1 month ago

I'm seeing the same behavior over a 90 day period: image

Gakhramanzode commented 1 month ago

@maudrid Welcome to the club 🤝

Gakhramanzode commented 1 month ago

hello everyone guys 👋 @grandich I think your revision helped us Снимок экрана_30-5-2024_83210_grafana citydrive tech

...
# -- Section for crafting Promtails config file. The only directly relevant value is `config.file`
# which is a templated string that references the other values and snippets below this key.
# @default -- See `values.yaml`
config:
...
  # -- A section of reusable snippets that can be reference in `config.file`.
  # Custom snippets may be added in order to reduce redundancy.
  # This is especially helpful when multiple `kubernetes_sd_configs` are use which usually have large parts in common.
  # @default -- See `values.yaml`
  snippets:
...
    # -- You can put here any keys that will be directly added to the config file's 'limits_config' block.
    # @default -- empty
    extraLimitsConfig: |
      max_line_size: 0
      max_line_size_truncate: false
...
Rohlik commented 2 weeks ago

This issue should be renamed to "Promtail: memory leak" as it was confirmed that this is really a memory leak. Also, a fix would be nice 🙏🏼.

Gakhramanzode commented 1 week ago

it's work 🤔 image