grafana / loki

Like Prometheus, but for logs.
https://grafana.com/loki
GNU Affero General Public License v3.0
23.52k stars 3.4k forks source link

bug: OOM when filtering lines after pattern #13427

Open Kamilcuk opened 3 months ago

Kamilcuk commented 3 months ago

Describe the bug

I am having some log lines generated by fabio proxy.

The following query runs in a few milliseconds and runs fine and results in 2139 lines:

# RUNS OK
{job=~"fabio.*"}
|= `" "`
| pattern `<remote> [<time>] <service> <response_time> <status> <body_size> <scheme>://<dest> "<method> <path>`

However, the following query makes loki allocate over 60GB (!) of memory and fail with OOM out-of-memory:

# RESULTS IN OUT OF MEMORY
{job=~"fabio.*"}
|= `" "`
| pattern `<remote> [<time>] <service> <response_time> <status> <body_size> <scheme>://<dest> "<method> <path>`
|= `jenkins`

Today loki was several times restarted just because of this query as I was debugging the issue. There is nothing else happening on our loki instance. The initial loki memory usage after startup is around 500MB. The memory sharply raised every time right after sending the query.

Adding |= 'jenkins' before patterns, runs completely fine and finishes in milliseconds.

# RUNS OK
{job=~"fabio.*"}
|= `" "`
|= `jenkins`
| pattern `<remote> [<time>] <service> <response_time> <status> <body_size> <scheme>://<dest> "<method> <path>`

Removing initial |= " " also works.

# RUNS OK
{job=~"fabio.*"}
| pattern `<remote> [<time>] <service> <response_time> <status> <body_size> <scheme>://<dest> "<method> <path>`
|= `jenkins`

Changing order works:

# RUNS OK
{job=~"fabio.*"}
|= `jenkins`
|= `" "`
| pattern `<remote> [<time>] <service> <response_time> <status> <body_size> <scheme>://<dest> "<method> <path>`

Shortening pattern results in OOM:

# also causes OOM
{job=~"fabio.*"}
|= `" "`
| pattern `<remote>`
|= `jenkins`

To Reproduce

  1. Run the query in our environment.
  2. Observe LOKI OOM.
Loki config: ``` # loki.yml # https://community.grafana.com/t/how-to-improve-loki-performance/42800/2 # https://github.com/grafana/loki/issues/4204#issuecomment-1260868356 # https://grafana.com/docs/loki/latest/configuration/query-frontend/ --- auth_enabled: false server: http_listen_port: ${NOMAD_PORT_http:-3100} grpc_listen_port: ${NOMAD_PORT_grpc:-9095} log_level: warn # https://github.com/grafana/loki/issues/2271 grpc_server_max_recv_msg_size: 8388608 grpc_server_max_send_msg_size: 8388608 common: path_prefix: /loki storage: filesystem: chunks_directory: /loki/chunks rules_directory: /loki/rules replication_factor: 1 # https://community.grafana.com/t/loki-error-on-port-9095-error-contacting-scheduler/67263/2 instance_interface_names: - "lo" ring: instance_addr: 127.0.0.1 instance_interface_names: - "lo" kvstore: # store: inmemory store: consul prefix: loki-collectors/ consul: host: ...:8500 #host: 127.0.0.1:8500 # https://grafana.com/docs/loki/latest/operations/storage/retention/ # https://github.com/grafana/loki/issues/6300#issuecomment-1431887039 compactor: working_directory: /loki/retention delete_request_store: filesystem retention_enabled: true schema_config: configs: - from: 2020-10-24 store: boltdb-shipper object_store: filesystem schema: v11 index: prefix: index_ period: 24h - from: 2023-08-22 store: tsdb object_store: filesystem schema: v12 index: prefix: index_ period: 24h # https://grafana.com/docs/loki/latest/storage/ # https://grafana.com/docs/loki/latest/operations/storage/tsdb/ # https://grafana.com/docs/loki/latest/setup/upgrade/#structured-metadata-open-telemetry-schemas-and-indexes - from: 2024-05-07 store: tsdb object_store: filesystem schema: v13 index: prefix: index_ period: 24h storage_config: tsdb_shipper: active_index_directory: /loki/tsdb-index cache_location: /loki/tsdb-cache query_range: parallelise_shardable_queries: false # # # make queries more cache-able by aligning them with their step intervals align_queries_with_step: true max_retries: 5 # parallelize queries in 5min intervals cache_results: true results_cache: cache: # We're going to use the in-process "FIFO" cache #enable_fifocache: true #fifocache: #max_size_bytes: 2GB #validity: 24h # https://github.com/grafana/loki/issues/7677 embedded_cache: enabled: true max_size_mb: 2048 ttl: 24h querier: max_concurrent: 2048 frontend: max_outstanding_per_tenant: 32768 #log_queries_longer_than: 15s #address: 0.0.0.0 #compress_responses: true limits_config: # The time to keep the logs. retention_period: 60d # allow backfilling reject_old_samples_max_age: 60d # #split_queries_by_interval: 5m # # Fix maxium active stream limit exceeded. https://github.com/grafana/loki/issues/3335 max_global_streams_per_user: 100000 # #max_cache_freshness_per_query: 10m #max_query_parallelism: 120 ## for big logs tune # We typically recommend setting per_stream_rate_limit no higher than 5MB, # and per_stream_rate_limit_burst no higher than 20MB. per_stream_rate_limit: 5M per_stream_rate_limit_burst: 20M # https://github.com/grafana/loki/issues/4204#issuecomment-1260868356 #cardinality_limit: 200000 ingestion_burst_size_mb: 1000 ingestion_rate_mb: 10000 #max_entries_limit_per_query: 1000000 #max_label_value_length: 20480 #max_label_name_length: 10240 #max_label_names_per_series: 300 analytics: reporting_enabled: false ``` Loki job nomad config: ``` job "loki" { ... group "loki" { count = 1 restart { interval = "15s" attempts = 1 mode = "delay" } update { auto_revert = true } network { port "http" { static = 3100 } port "grpc" { static = 9095 } } task "loki" { driver = "docker" template { destination = "local/loki.yml" data = file("loki.yml") } user = "3044335:10269" config { image = "grafana/loki:3.0.0" ports = [ "http", "grpc" ] network_mode = "host" args = [ "-config.file=/local/loki.yml", "-config.expand-env=true", "-print-config-stderr", ] mount { type = "bind" source = ".../loki" target = "/loki" readonly = false } } kill_timeout = "1m" resources { memory = 60000 } service { name = "loki" tags = [ "urlprefix-...", "metrics", "metricspath-/metrics", ] port = "http" check { type = "http" path = "/ready" interval = "2s" timeout = "2s" initial_status = "passing" } } } } } ```

Expected behavior

I do not understand why adding |= filtering over 2139 lines causes loki to allocate SO MUCH memory.

Environment:

Screenshots, Promtail config, or terminal output

Loki job memory usage as monitored by Nomad + prometheus:

image

Grafana view of the query. Adding |= 'jenkins' would result in OOM.

image

Nomad job view of the loki job:

image

Thanks!

ASatanicPickle commented 2 months ago

Hi, I think this is related. #13277