grafana / loki

Like Prometheus, but for logs.
https://grafana.com/loki
GNU Affero General Public License v3.0
22.7k stars 3.3k forks source link

Promtail consumes all available RAM when accidentally scraping a "large" sparse file #11229

Open timwsuqld opened 7 months ago

timwsuqld commented 7 months ago

Describe the bug Promtail consumes all RAM (doesn't start swapping) and causes the VM to freeze. OOM doesn't appear to kick in. This is caused when /var/log/lastlog ends up in the pattern match, which is a massively sparse file with almost no data in it. Promtail shouldn't consume all memory to read large files.

To Reproduce Steps to reproduce the behavior:

  1. Start promtail 2.9.2 via docker with /var/log/lastlog as part of the included list
  2. Watch memory usage skyrocket to 100% and then VM crashes

Expected behavior Promtail should limit used memory (even at startup) so it can't consume everything on the machine causing it to crash. Yes, this can be avoided by excluding the lastlog file from being processed, but we should have limits on memory usage. I'm guessing we try to mmap the whole file?

Environment:

Screenshots, Promtail config, or terminal output If applicable, add any output to help explain your problem.

config.yaml

server:
  http_listen_port: 9080
  grpc_listen_port: 0

limits_config:
  readline_rate_enabled: true
  readline_rate: 100
  readline_burst: 100
  readline_rate_drop: false
  max_streams: 10

positions:
  filename: /etc/promtail/data/positions.yaml

clients:
  - url: https://loki-server.internal/loki/api/v1/push
    basic_auth:
      username: loki
      password_file: /etc/secrets/loki.auth

scrape_configs:
- job_name: system
  static_configs:
  - targets:
      - localhost
    labels:
      job: varlogs
      __path__: /var/log/*log
      hostname: ${HOST_HOSTNAME}

docker-compose.yml

version: '3.6'
# This file is managed by salt salt://suqld/docker/generic/docker-compose.yml.jinja
services:

    promtail:
        image: grafana/promtail:2.9.2
        network_mode: host
        restart: no
        command: -config.file=/etc/promtail/config.yaml -config.expand-env=true --dry-run --log.level=debug
        labels:
            - au.org.suqld.docker.type="Promtail"
            - com.centurylinklabs.watchtower.enable=true
        volumes:
            - ./promtail-config.yaml:/etc/promtail/config.yaml
            - ./promtail-data/:/etc/promtail/data/
            - ./secrets/:/etc/secrets/
            - /var/log:/var/log
            - /var/run/docker.sock:/var/run/docker.sock:ro
        environment:
            HOST_HOSTNAME: test3
        logging:
            options:
                max-size: "256m"

Du shows lastlog as small, while we can see if we use apparent-size lastlog is massive.

$ du -hcs /var/log/*log
12K     /var/log/alternatives.log
468K    /var/log/auth.log
108K    /var/log/cloud-init-output.log
3.6M    /var/log/cloud-init.log
16K     /var/log/cron.log
72K     /var/log/dpkg.log
716K    /var/log/kern.log
28K     /var/log/lastlog
52K     /var/log/mail.log
0       /var/log/php_error.log
2.8M    /var/log/syslog
8.0K    /var/log/ubuntu-advantage-daemon.log
0       /var/log/ubuntu-advantage-timer.log
60K     /var/log/ubuntu-advantage.log
7.9M    total

$ du -h --apparent-size /var/log/lastlog 
1.1T    /var/log/lastlog

Yes, this can be mitigated with __path_exclude__, however we really shouldn't crash systems for config mistakes like this (and there are plenty of users where this exact config hasn't crashed their system, and then one day it will)

     labels:
       job: varlogs
       __path__: /var/log/*log
      __path_exclude__: /var/log/lastlog
wkwan4096 commented 7 months ago

I have the same issue. On a VM with 16G RAM, I have a 188GB /var/log/lastlog. During the test phrase we included /var/log/*log also. Promtail was killed by OOM.

Should there be some sort of failsafe to prevent promtail taking all the memory?


  build user:       root@21ab03f17324
  build date:       2023-09-14T16:24:53Z
  go version:       go1.20.6
  platform:         linux/amd64
  tags:             promtail_journal_enabled```
Skaronator commented 2 months ago

We just ran into the same issue. Our config worked fine on ~30 VMs except for 3. They have a huge /var/log/lastlog size of 330G.

$ ls -lsah /var/log/lastlog
44K -rw-rw-r-- 1 root utmp 330G Apr 16 14:52 /var/log/lastlog

Interestingly, only Ubuntu 20.04 VMs are affected for us. Ubuntu 22.04 has a normal size of around ~288K.