grafana / loki

Like Prometheus, but for logs.
https://grafana.com/loki
GNU Affero General Public License v3.0
24.08k stars 3.47k forks source link

pattern-ingester msg="unknown log level while observing stream" level=informational severity=informational (messages from promtail syslog) #14213

Closed candlerb closed 2 months ago

candlerb commented 2 months ago

(I'm not sure if this is user error, but I am raising in bug format anyway just to capture the details)

Describe the bug I am trying out loki 3.2.0 and attempting to use the explore logs / pattern ingester functionality.

I am getting a very large volume of these messages logged to loki's stdout, which I wasn't before:

Sep 21 20:07:22 loki loki-linux-amd64[5197]: level=warn ts=2024-09-21T20:07:22.76097582Z caller=instance.go:274 component=pattern-ingester msg="unknown log level while observing stream" level=informational stream="{app=\"loki-linux-amd64\", facility=\"daemon\", host=\"loki\", ip=\"127.0.0.1\", service_name=\"loki-linux-amd64\", severity=\"informational\"}"
Sep 21 20:07:22 loki loki-linux-amd64[5197]: level=warn ts=2024-09-21T20:07:22.760986596Z caller=instance.go:274 component=pattern-ingester msg="unknown log level while observing stream" level=informational stream="{app=\"loki-linux-amd64\", facility=\"daemon\", host=\"loki\", ip=\"127.0.0.1\", service_name=\"loki-linux-amd64\", severity=\"informational\"}"
Sep 21 20:07:22 loki loki-linux-amd64[5197]: level=warn ts=2024-09-21T20:07:22.760997349Z caller=instance.go:274 component=pattern-ingester msg="unknown log level while observing stream" level=informational stream="{app=\"loki-linux-amd64\", facility=\"daemon\", host=\"loki\", ip=\"127.0.0.1\", service_name=\"loki-linux-amd64\", severity=\"informational\"}"
Sep 21 20:07:22 loki loki-linux-amd64[5197]: level=warn ts=2024-09-21T20:07:22.761009328Z caller=instance.go:274 component=pattern-ingester msg="unknown log level while observing stream" level=informational stream="{app=\"loki-linux-amd64\", facility=\"daemon\", host=\"loki\", ip=\"127.0.0.1\", service_name=\"loki-linux-amd64\", severity=\"informational\"}"
Sep 21 20:07:22 loki loki-linux-amd64[5197]: level=warn ts=2024-09-21T20:07:22.761020505Z caller=instance.go:274 component=pattern-ingester msg="unknown log level while observing stream" level=informational stream="{app=\"loki-linux-amd64\", facility=\"daemon\", host=\"loki\", ip=\"127.0.0.1\", service_name=\"loki-linux-amd64\", severity=\"informational\"}"

I have promtail (2.9.1) receiving syslog messages and forwarding them to loki, and I believe this is where severity="informational" is coming from - see syslog_message.go

However the fact that these messages have {app="loki-linux-amd64",service_name="loki-linux-amd64",severity="informational"} suggests that either there is a loop (loki is eating its own messges), or loki is generating them with severity="informational", even though earlier on the line says level=warn. I don't know enough about the provenance of these labels to diagnose further.

It seems to me that:

  1. pattern-ingester should accept the severity levels which promtail generates from parsing syslog priority, including "informational" and "warning". Possibly relates to #12645. (You could argue that promtail should generate "info" rather than "informational", but that would be a backwards-incompatible change);
  2. pattern-ingester could be more forgiving about the severity/level names it receives; and/or
  3. pattern-ingester could rate-limit the warnings it sends about unknown log levels; and/or
  4. pattern-ingester's warning log messages could be clearer, as they currently give little information about the original message which triggered the warning. (However I can search using logcli query '{severity="informational",app!="loki-linux-amd64"}' --since=1h --limit=10)

I note that the Grafana dashboard Explore Logs plugin (which is the whole point of this exercise) appears to be happy with "informational" as a level:

Image

To Reproduce

/etc/loki/loki.yaml

auth_enabled: false

server:
  http_listen_port: 3100
  grpc_listen_port: 9096
  log_level: debug
  grpc_server_max_concurrent_streams: 1000

common:
  instance_addr: 127.0.0.1
  path_prefix: /var/lib/loki
  storage:
    filesystem:
      chunks_directory: /var/lib/loki/chunks
      rules_directory: /var/lib/loki/rules
  replication_factor: 1
  ring:
    kvstore:
      store: inmemory

ingester_rf1:
  enabled: false

query_range:
  results_cache:
    cache:
      embedded_cache:
        enabled: true
        max_size_mb: 100

schema_config:
  configs:
    - from: 2020-10-24
      store: boltdb-shipper
      object_store: filesystem
      schema: v11
      index:
        prefix: index_
        period: 24h
    - from: 2023-09-15
      store: tsdb
      object_store: filesystem
      schema: v12
      index:
        period: 24h
        prefix: index_
    - from: 2024-09-21
      store: tsdb
      object_store: filesystem
      schema: v13
      index:
        prefix: index_
        period: 24h

pattern_ingester:
  enabled: true
  metric_aggregation:
    enabled: true
    loki_address: localhost:3100

ruler:
  alertmanager_url: http://prometheus:9093/alertmanager

frontend:
  encoding: protobuf

compactor:
  retention_enabled: true
  delete_request_store: filesystem

limits_config:
  retention_period: 91d
  reject_old_samples: true
  reject_old_samples_max_age: 168h
  # https://notes.nicolevanderhoeven.com/How+to+enable+Explore+Logs+for+Loki
  allow_structured_metadata: true
  volume_enabled: true

# By default, Loki will send anonymous, but uniquely-identifiable usage and configuration
# analytics to Grafana Labs. These statistics are sent to https://stats.grafana.org/
#
# Statistics help us better understand how Loki is used, and they show us performance
# levels for most users. This helps us prioritize features and documentation.
# For more information on what's sent, look at
# https://github.com/grafana/loki/blob/main/pkg/analytics/stats.go
# Refer to the buildReport method to see what goes into a report.
#
# If you would like to disable reporting, uncomment the following lines:
analytics:
  reporting_enabled: false

/etc/systemd/system/loki.service

[Unit]
Description=Loki logging server
Documentation=https://grafana.com/oss/loki/
After=network-online.target

[Service]
User=syslog
Group=syslog
Restart=on-failure
RestartSec=60
WorkingDirectory=/var/lib/loki
EnvironmentFile=/etc/default/loki
ExecStart=/opt/loki/loki-linux-amd64 $OPTIONS

[Install]
WantedBy=multi-user.target

/etc/default/loki

OPTIONS='-config.file=/etc/loki/loki.yaml -log.level=warn'

/etc/loki/promtail.yaml

server:
  http_listen_address: 127.0.0.1
  http_listen_port: 9080
  grpc_listen_port: 0

# How to find the loki server(s)
clients:
  - url: http://127.0.0.1:3100/loki/api/v1/push

# Logs to read and forward
scrape_configs:
  - job_name: syslog
    syslog:
      listen_address: 127.0.0.1:5140
    relabel_configs:
      - source_labels: [__syslog_message_severity]
        target_label: severity
      - source_labels: [__syslog_message_facility]
        target_label: facility
      - source_labels: [__syslog_message_hostname]
        regex: '([^/]*)'
        target_label: host
      - source_labels: [__syslog_message_hostname]
        regex: '(.*)/(.*)'
        replacement: '$1'
        target_label: host
      - source_labels: [__syslog_message_hostname]
        regex: '(.*)/(.*)'
        replacement: '$2'
        target_label: ip
      - source_labels: [__syslog_message_app_name]
        target_label: app

/etc/systemd/system/promtail.service

[Unit]
Description=Promtail
Documentation=https://github.com/grafana/loki/tree/master/docs/clients/promtail
After=network-online.target
# Fix shutdown delays: if loki is running on the same host,
# Loki should start first, and shutdown after.
After=loki.service

[Service]
User=syslog
Group=syslog
Restart=on-failure
RestartSec=5
WorkingDirectory=/var/log
EnvironmentFile=/etc/default/promtail
ExecStart=/opt/loki/promtail-linux-amd64 $OPTIONS

[Install]
WantedBy=multi-user.target

/etc/default/promtail

OPTIONS='-config.file=/etc/loki/promtail.yaml'

/etc/rsyslog.d/25-promtail.conf

module(load="mmutf8fix")
action(type="mmutf8fix" replacementChar="?")

template(name="Custom_SyslogProtocol23Format" type="string"
     string="<%PRI%>1 %TIMESTAMP:::date-rfc3339% %HOSTNAME%/%FROMHOST-IP% %APP-NAME% %PROCID% %MSGID% %STRUCTURED-DATA% %msg%\n")

*.*     action(type="omfwd" protocol="tcp"
               target="127.0.0.1" port="5140"
               Template="Custom_SyslogProtocol23Format"
               TCP_Framing="octet-counted")

Expected behavior Messages to be processed successfully

Environment:

# /opt/loki/loki-linux-amd64 --version
loki, version 3.2.0 (branch: k218, revision: 659f5421)
  build user:       root@003ce357cdf4
  build date:       2024-09-18T16:21:52Z
  go version:       go1.22.6
  platform:         linux/amd64
  tags:             netgo
candlerb commented 2 months ago

I found this in the source code at pkg/pattern/instance.go:

                if _, ok := streamMetrics[lvl]; !ok {
                        level.Warn(i.logger).Log(
                                "msg", "unknown log level while observing stream",
                                "level", lvl,
                                "stream", stream,
                        )

                        lvl = constants.LogLevelUnknown
                }

Clearly, large numbers of these messages could be generated. Simply logging them at "Debug" level might be a simple solution. Keeping a map of unknown levels, and sending a warning only once for each one seen, might be another. (With an overall size limit on the map)

candlerb commented 2 months ago

According to grafana, the volume of logs generated per minute is growing... I will need to turn this pattern ingester off. (EDIT: it's happy after doing that)

Image

svennergr commented 2 months ago

@trevorwhitney is there a reason why we only would support those configured log levels: https://github.com/grafana/loki/blob/913e9f93477b5b811fbcf44d0e750f600c9ded69/pkg/util/constants/levels.go

trevorwhitney commented 2 months ago

The reason for hardcoding the levels we support was for performance.

We read the level from structured metadata to avoid re-parsing / re-detecting log level. I made the mistake in reading this code that we'd only be populating structured metadata with one of these constant levels. In the case we get it from a label it looks like it could be anything.

The performance gain was just to pre-allocate the map, so I can change that.

trevorwhitney commented 2 months ago

This should fix it: https://github.com/grafana/loki/pull/14255

candlerb commented 2 months ago

Thank you.

On my side, I realised I should also ensure that there's no chance loki can ever ingest its own logs. I'm thinking of running loki and promtail in separate containers, so that rsyslog in the promtail container never receives any journald or /dev/log messages from loki.

carlopalacio commented 2 weeks ago

I've been keeping track of this bug, has this already been merged into Loki 3.2.1 public docker image?

StefanSa commented 2 weeks ago

@carlopalacio apparently not, have the same problems with the Docker image v3.2.1

trevorwhitney commented 2 weeks ago

It will definitely be in 3.3: https://github.com/grafana/loki/pull/14750, which is planned to go out tomorrow.