index-gateway: panic - invalid memory address or nil pointer dereference

beatkind commented 1 month ago

Describe the bug

We have a Loki setup inside GCP on GKE running. Deployed via the community helm chart and running in the distributed mode. We experience an interesting behaviour. When we query the last up to 24 hours, everything works as expected. But once I start querying a specific day in the past (23-09-2024) every thing stops working at all. Once I have done that, Loki is not responding to any query.

To Reproduce Steps to reproduce the behavior:

Started Loki (2.9.8)
Query: timeframe for the specific day

Expected behavior Get data back and dont stop working

Environment:

Infrastructure: GKE, Google Cloud
Deployment tool: helm

Screenshots, Promtail config, or terminal output

    auth_enabled: true

    server:
      log_format: json
      log_level: info
      grpc_server_max_recv_msg_size: 104857600 # Default 4MiB, changed to 100MiB to be in line with rest of internal comms: https://grafana.com/docs/loki/latest/configure/#grpc_client
      grpc_server_max_send_msg_size: 104857600 # Also this is implicitly the max amount of data a loki query can return to grafana.

    memberlist:
      cluster_label: "loki-cluster"

    storage_config:
      gcs:
        bucket_name: ${var.loki_chunks_storage}
        chunk_buffer_size: 0
        request_timeout: "0s"
        enable_http2: true
      boltdb_shipper:
        shared_store: gcs
        cache_ttl: 24h
      index_queries_cache_config:
        memcached:
          batch_size: 100
          parallelism: 100
        memcached_client:
          host: index-cache-memcached.loki.svc.cluster.local
          service: memcache
          consistent_hash: true

    chunk_store_config:
      chunk_cache_config:
        memcached:
          batch_size: 256
          parallelism: 10
        memcached_client:
          host: chunk-cache-memcached.loki.svc.cluster.local
          service: memcache

    common:
      replication_factor: 3
      storage:
        gcs:
          bucket_name: ${var.loki_chunks_storage}

    index_gateway:
      mode: ring
      ring:
        kvstore: 
          store: memberlist

    query_scheduler:
      max_outstanding_requests_per_tenant: 10000

    frontend:
      max_outstanding_per_tenant: 1024
      log_queries_longer_than: 5s

    querier:
      query_ingesters_within: 30m

    query_range:
      parallelise_shardable_queries: true
      cache_results: true
      results_cache:
        cache:
          memcached_client:
            consistent_hash: true
            host: results-cache-memcached.loki.svc.cluster.local
            service: memcache
            max_idle_conns: 16
            timeout: 500ms
            update_interval: 1m

    ingester:
      lifecycler:
        ring:
          replication_factor: 3
      chunk_idle_period: 10m
      chunk_block_size: 262144
      chunk_encoding: snappy
      chunk_retain_period: 30s
      max_transfer_retries: 0
      max_chunk_age: 30m

    ruler:
      storage:
        type: gcs
        gcs:
          bucket_name: ${var.loki_ruler_storage}

    compactor:
      shared_store: gcs
      working_directory: "/var/loki/compactor"
      retention_enabled: true
      delete_request_store: gcs

    schema_config:
      configs:
        - from: "2022-01-11"
          index:
            period: 24h
            prefix: loki_index_
          object_store: gcs
          schema: v12
          store: boltdb-shipper
        - from: "2023-02-20"
          index:
            period: 24h
            prefix: loki_dis_index_
          object_store: gcs
          schema: v12
          store: boltdb-shipper

    limits_config:
      retention_period: 2160h
      max_streams_per_user: 100000000
      max_global_streams_per_user: 2000000
      split_queries_by_interval: 15m
      max_query_parallelism: 32
      ingestion_rate_mb: 10
      per_stream_rate_limit: 5MB
      per_stream_rate_limit_burst: 20MB
      max_query_series: 1000
      allow_structured_metadata: false

index-gateway-2.log index-gateway-3.log index-gateway-4.log

chaudum commented 1 month ago

Hi @beatkind We strongly advise to upgrade your Loki installation to 3.2 and use TSDB as index type. boltdb-shipper is deprecated.

It seems a cached boltdb file is the offender here:

"msg":"failed to open existing index file /var/loki/cache/loki_dis_index_19986/compactor-1726985323, removing the file and continuing without it to let the sync operation catch up"

You can try to empty the cache directory to see whether the problem persists.

beatkind commented 2 weeks ago

Hi @beatkind We strongly advise to upgrade your Loki installation to 3.2 and use TSDB as index type. boltdb-shipper is deprecated.

It seems a cached boltdb file is the offender here:
"msg":"failed to open existing index file /var/loki/cache/loki_dis_index_19986/compactor-1726985323, removing the file and continuing without it to let the sync operation catch up"
You can try to empty the cache directory to see whether the problem persists.

Hi @chaudum, thanks for the reply. Deleting the whole pvc (because it is most of the time easier) brings intermediate short-term improvement of the situation, until this happens again...

grafana / loki

index-gateway: panic - invalid memory address or nil pointer dereference #14295