grafana / loki

Like Prometheus, but for logs.
https://grafana.com/loki
GNU Affero General Public License v3.0
23.39k stars 3.39k forks source link

Loki crashes with 'cannot allocate memory' #8686

Open suikast42 opened 1 year ago

suikast42 commented 1 year ago

Loki version 2.7.8

If I try to start loki then it crashes everytime with that log below:

level=warn ts=2023-03-02T11:01:17.896646246Z caller=loki.go:251 msg="per-tenant timeout not configured, using default engine timeout (\"5m0s\"). This behavior will change in the next major to always use the default per-tenant timeout (\"5m\")." level=warn ts=2023-03-02T11:01:17.899715857Z caller=experimental.go:20 msg="experimental feature in use" feature="In-memory (FIFO) cache - store.index-cache-read.embedded-cache" level=warn ts=2023-03-02T11:01:17.899805942Z caller=experimental.go:20 msg="experimental feature in use" feature="In-memory (FIFO) cache - store.index-cache-write.embedded-cache" level=warn ts=2023-03-02T11:01:17.899880446Z caller=experimental.go:20 msg="experimental feature in use" feature="In-memory (FIFO) cache - chunksembedded-cache" cannot allocate memory error initialising module: compactor github.com/grafana/dskit/modules.(Manager).initModule /src/loki/vendor/github.com/grafana/dskit/modules/modules.go:122 github.com/grafana/dskit/modules.(Manager).InitModuleServices /src/loki/vendor/github.com/grafana/dskit/modules/modules.go:92 github.com/grafana/loki/pkg/loki.(*Loki).Run /src/loki/pkg/loki/loki.go:422 main.main /src/loki/cmd/loki/main.go:105 runtime.main /usr/local/go/src/runtime/proc.go:250 runtime.goexit /usr/local/go/src/runtime/asm_amd64.s:1598

With that config:

auth_enabled: false

server:
  #default 3100
  http_listen_port: 3100
  #default 9005
  #grpc_listen_port: 9005
  # Max gRPC message size that can be received
  # CLI flag: -server.grpc-max-recv-msg-size-bytes
  #default 4194304 -> 4MB
  grpc_server_max_recv_msg_size: 419430400

  # Max gRPC message size that can be sent
  # CLI flag: -server.grpc-max-send-msg-size-bytes
  #default 4194304 -> 4MB
  grpc_server_max_send_msg_size:  419430400

  # Limit on the number of concurrent streams for gRPC calls (0 = unlimited)
  # CLI flag: -server.grpc-max-concurrent-streams
  grpc_server_max_concurrent_streams:  100

  # Log only messages with the given severity or above. Supported values [debug,
  # info, warn, error]
  # CLI flag: -log.level
  log_level: "warn"
ingester:
  wal:
    enabled: true
    dir: /data/wal
  lifecycler:
    address: 127.0.0.1
    ring:
      kvstore:
        store: memberlist
      replication_factor: 1
    final_sleep: 0s
  chunk_idle_period: 5m
  chunk_retain_period: 30s
  chunk_encoding: snappy

ruler:
  storage:
    type: local
    local:
      directory: /data/rules
  rule_path: /data/scratch
  alertmanager_url: http://mimir.service.consul:9009/alertmanager

  ring:
    kvstore:
      store: memberlist
  enable_api: true

compactor:
  working_directory: /data/retention
  shared_store: filesystem
  compaction_interval: 10m
  retention_enabled: true
  retention_delete_delay: 2h
  retention_delete_worker_count: 150

schema_config:
  configs:
    - from: 2020-05-15
      store: boltdb-shipper
      object_store: filesystem
      schema: v11
      index:
        prefix: index_
        period: 24h

storage_config:
  boltdb_shipper:
    active_index_directory: /data/index
    cache_location: /data/index-cache
    shared_store: filesystem
  filesystem:
    directory: /data/chunks
  index_queries_cache_config:
    enable_fifocache: false
    embedded_cache:
      enabled: true
querier:
  multi_tenant_queries_enabled: false
  max_concurrent: 4096
  query_store_only: false

query_scheduler:
  max_outstanding_requests_per_tenant: 10000

query_range:
  cache_results: true
  results_cache:
    cache:
      enable_fifocache: false
      embedded_cache:
        enabled: true

chunk_store_config:
  chunk_cache_config:
    enable_fifocache: false
    embedded_cache:
      enabled: true
  write_dedupe_cache_config:
    enable_fifocache: false
    embedded_cache:
      enabled: true

distributor:
  ring:
    kvstore:
      store: memberlist

limits_config:
  ingestion_rate_mb: 64
  ingestion_burst_size_mb: 8
  max_label_name_length: 4096
  max_label_value_length: 8092
  enforce_metric_name: false
  # Loki will reject any log lines that have already been processed and will not index them again
  reject_old_samples: true
  # 7d
  reject_old_samples_max_age: 168h
  # The limit to length of chunk store queries. 0 to disable.
  max_query_length: 0
  # Maximum number of log entries that will be returned for a query.
  max_entries_limit_per_query: 20000
  # Limit the maximum of unique series that is returned by a metric query.
  max_query_series: 100000
  # Maximum number of queries that will be scheduled in parallel by the frontend.
  max_query_parallelism: 64
  split_queries_by_interval: 24h
  # Alter the log line timestamp during ingestion when the timestamp is the same as the
  # previous entry for the same stream. When enabled, if a log line in a push request has
  # the same timestamp as the previous line for the same stream, one nanosecond is added
  # to the log line. This will preserve the received order of log lines with the exact
  # same timestamp when they are queried, by slightly altering their stored timestamp.
  # NOTE: This is imperfect, because Loki accepts out of order writes, and another push
  # request for the same stream could contain duplicate timestamps to existing
  # entries and they will not be incremented.
  # CLI flag: -validation.increment-duplicate-timestamps
  increment_duplicate_timestamp: true
  #Log data retention for all
  retention_period: 24h
  # Comment this out for fine grained retention
#  retention_stream:
#  - selector: '{namespace="dev"}'
#    priority: 1
#    period: 24h
  # Comment this out for having overrides
#  per_tenant_override_config: /etc/overrides.yaml
suikast42 commented 1 year ago

I think I figure out what is happening but don't know how to solve.

I had my local homelab cluster one week down.

At startup the loki compactor try to delete old indexes and can't do this because of memory allocation. If I delete he persistence oli coming up as expected.

The same happens with mimir as well.

Any suggestions?

sourcehawk commented 2 weeks ago

I ran into similar issues with the chunks cache.

The problem is probably that at some point someone decided to set the default allocated memory of the chunks cache to a baffling 8GB (RAM). Your machine probably doesn't have that capacity.

In the loki helm chart there is a configuration option:

chunksCache:
  allocatedMemory: 8192

Maybe you can try to change that to something more reasonable for your cluster size like 2GB or something.