Loki frontend-querier pod crush when running queries with large intervals or queries with huge results.

Hi, I have a Kubernetes cluster with 1 master and 9 workers, each node has 4 CPU cores and 4GB RAM. I am facing an issue that when the query frontend executes a query with a very large interval or with a large limit like 400000 logs, the loki-query-frontend pod consumes a huge amount of memory and crushes when the limit is reached. Are there any suggestions to overcome this issue? I think that is related to the behavior of the query frontend where it merges the responses returned from split queries in the memory (please correct me if I am wrong). I have read the following parameter and I think it might be useful. I am considering setting this parameter based on the available memory for the pod.

-frontend.max-query-bytes-read value
        Max number of bytes a query can fetch. Enforced in log and metric queries only when TSDB is used. The default value of 0 disables this limit.

in the helm chart

limits_config:
  max_query_bytes_read: xxxB

my current helm chart deployment

# Setup the DNS service that is used in the cluster in our case is coreDNS
  global:
    dnsService: "kube-dns"
    dnsNamespace: "kube-system"

  # Distributed mode is preferred for more than 1TB logs per day
  deploymentMode: Distributed

# Begin Loki Configuration of the system
  loki:
    # 100MB of disk used by Loki Golang garbage collector to reduce total CPU utilization during garbage collector cycles
    ballast_bytes: 104857600
    # disable multi tenants because is not required
    auth_enabled: false
    # The schema config used to define how data is stored in Loki storage
    schemaConfig:
      configs:
        # From specifies the date from which the following schema configuration will apply
        # It tells Loki to use the TSDB indexing format for logs written from this date onward
        # This configuration also helps maintain backward compatibility with older storage formats
        - from: "2024-10-25"
          # Specifies which indexing format/schema will be used (tsdb is the same format used in Prometheus)
          store: tsdb
          # Indicates the object storage backend where chunks of log data will be stored
          object_store: s3
          # Selects the version of the schema being used
          schema: v13
          index:
            # Index files will be prefixed with loki_index_ when stored in the S3 storage
            prefix: loki_index_
            # Specifies the time period for creating new index tables and storing them in the S3 storage (every 24 hours)
            # Never change this value. Compactor will only work if period is 24h
            period: 24h
    # The following configuration will be used to send all TSDB index operations to the index gateway
    storage_config:
      tsdb_shipper:
        # The location that the ingesters will use to write the tsdb index files before flushing them to the storage
        active_index_directory: /var/loki/tsdb-shipper-active
        # The location where the index files are cached in the index gateway. This very critical component in the cluster where it must be configured properly.
        cache_location: /var/loki/tsdb-shipper-cache        
        # Both the previous parameters need to be configured with persistent storage
        # how long the index files will be cached in the index gateway 
        cache_ttl: 48h0m0s
        # time interval to resync the cached index files. Sometimes the compactor merge the index files and chunks therefore we need to pull the index files
        resync_interval: 10m
        # index gateways client configuration
        index_gateway_client:
          # the service name of the index gateway will be taken from the helm deployment
          server_address: '{{ include "loki.indexGatewayAddress" . }}'
      # http hedging is used to send multiple requests to the storage (read operations). If any of the sent requests succeeded the operation is considered successful and the other sent requests will be canceled. Hedging is used overcome the issue that some requests took long time to respond or dropped because of network related issues or high latency network routes. Therefore, sending more requests after specific duration will overcome this issue
      hedging:
        # wait 300 ms if the first request did not succeed then start send subsequent requests
        at: "300ms"
        # send only two more request when no response returned after 300ms
        up_to: 2
        # set the maximum allowed requests to be sent to the storage. This parameter sets the total limits of hedging requests in the cluster to not overwhelm the storage.
        max_per_second: 15
    server:
      http_listen_port: 3100
      grpc_listen_port: 9095
      # sets the maximum duration for reading the entire request from the client.
      http_server_read_timeout: 5m
      # defines the maximum duration for writing a response to the client. If the server takes too long to send the response back, the connection will be closed.
      http_server_write_timeout: 5m

      # enable limitless tcp connections from clients
      http_listen_conn_limit: 0
      # enable metric requests
      register_instrumentation: true

      #close idle connections after timeouts duration
      http_server_idle_timeout: 5m

      # cluster logs will be in logfmt format
      log_level: debug
      log_format: "json"
      log_source_ips_enabled: true
      # The number of allowed messages size between cluster services and number of concurrent streams.
      grpc_server_max_recv_msg_size: 104857600 # 100MB
      grpc_server_max_send_msg_size: 104857600 #100MB
      grpc_server_max_concurrent_streams: 256

      #  define cluster limits
    limits_config:
      # remove service name discovery
      discover_service_name: []
      # allow reject of outdated logs
      reject_old_samples: true
      # only reject logs that are older by 1 day of the last received log message timestamp per unique stream.
      # if the time of last log message received for specific stream is 10 PM then we will drop all the new received log messages with a timestamp before the 9PM
      # always set this parameter to half of max_chunk_age
      # reject_old_samples_max_age: 1h
      reject_old_samples_max_age: 1d
      # If the timestamp of the newest log entry for the query result is greater than 5 minutes then the result is cached.
      max_cache_freshness_per_query: 5m
      # each query will be split to small queries. For instance, query with 1 hour time interval will be split to 3 small queries of 20 minutes
      split_queries_by_interval: 30m
      # number of the maximum scheduled queries in the query scheduler. This means the maximum length of scheduled queries in the query scheduler is 64 days (6h * 256)/24)
      tsdb_max_query_parallelism: 256
      # how long to wait until the query considered dead
      query_timeout: 5m
      # enable volume (logs size in the storage) tracking 
      volume_enabled: true
      # Maximum number of returned logs per query. This limit will be shown to the user when he set the limit in grafana to more than this value
      max_entries_limit_per_query: 300000
      # How long logs will stay in the storage before deleting them (180 days = 4320 hours)
      retention_period: 4320h0m0s
      # Limit queries interval to 180 days to prevent looking for logs before 180 days
      max_query_lookback: 4320h0m0s
      # if you want to set different retention per stream
      # retention_stream:
      # - selector: '{host="haproxy1"}'
      #   priority: 1
      #   period: 24h

      # the total number of concurrent allowed log streams from the user per all the ingester instances (0 means unlimited)
      max_global_streams_per_user: 0
      # the total number of concurrent allowed log streams from the user per single ingester instances (0 means unlimited)
      max_streams_per_user: 0
      # per second rate limit per unique stream
      per_stream_rate_limit: 16MB
      # the maximum accepted push message size with huge number of logs (burst mode from Promtail)
      per_stream_rate_limit_burst: 64MB
      # do not execute queries that return more than 300 matching log streams (client need to rewrite the query to reduce the number of the returned unique log streams)
      max_streams_matchers_per_query: 300
      # shard queries based on the expected bytes per shard
      tsdb_max_bytes_per_shard: 600MB
      # the maximum allowed burst log message size allowed (default 6)
      # useful if client is sending many many log messages at once in single request
      ingestion_burst_size_mb: 100
      # the maximum rate limit allowed to the ingesters per user (default 4)
      # if we are using promtail to pull data from apache Kafka then the ingesters will see only single user sending data
      ingestion_rate_mb: 100000
      ingestion_rate_strategy: local
      # Set the maximum size in kb for the single received log message
      max_line_size: 256KB
      # if log message received is more 265KB then they will be truncated instead of dropped
      max_line_size_truncate: false

    commonConfig:
      # No need for ingester replication
      replication_factor: 1

    ingester:
      # enable snappy encoding for data in the chunks (preferred)
      chunk_encoding: snappy
      # How long chunks should be retained in-memory after they've been flushed.
      chunk_retain_period: 10s
      # How long chunks should sit in-memory with no updates (no log messages received for them) before being flushed if they don't hit the max block size. 
      chunk_idle_period: 1h
      # the maximum age the chunk will be stored in the memory before flushed
      max_chunk_age: 2h0m0s

      wal:
        # enable write ahead log (WAL)
        enabled: true
        # time interval of checkpoints of the WAL created.
        # Loki stores the logs in WAL directory as small files of 32KB. When checkpoint duration is reached these small files are merged into single file called checkpoint. Also, if there was older checkpoint file from the previous cycles, then it will be also merged. At the same time Loki will delete the WAL entries for chunks that are flushed in the S3 storage.
        checkpoint_duration: 5m0s
        # the location where write ahead log (WAL) will be written. This needs to be persistent across restarts
        dir: /var/loki/wal
        # when incident happen then the amount of memory the ingester will use to replay the WAL files. Preferred values is 75% of the available memory
        flush_on_shutdown: true
        # when incident happen then the amount of memory the ingester will use to replay the WAL files. Preferred values is 75% of the available memory
        replay_memory_ceiling: 3GB

    querier:
        # number of concurrent quires each querier pod executes
        # this parameter should be set to the number of CPU  cores reserved for the pod
        max_concurrent: 16

    compactor:
      # Enable retention to compact both index and chunks.
      retention_enabled: true
      # Temporary location for compactor storage. This location used to store index and chunk files that will be compacted and the compacted files before storing them in the storage
      working_directory: /tmp/loki/compactor
      # Target store for deletion requests (e.g., compacted indexes and chunks).
      delete_request_store: s3
      # Frequency of applying compaction and retention.
      compaction_interval: 10m
      # Delay before deleting marked-for-deletion chunks.
      retention_delete_delay: 1h
      # Number of workers to handle chunk deletions.
      retention_delete_worker_count: 32

    tracing:
      enabled: true

    storage:
      type: s3
      bucketNames:
        chunks: "loki-test"
        # The following will choose the bucket that the ruler will store its results on.
        # ruler: "loki-test"
      s3:
        # s3 URL can be used to specify the endpoint, access key, secret key, and bucket name
        endpoint: s3.net
        # AWS secret access key
        secretAccessKey: ****
        # AWS access key ID
        accessKeyId: ****
        # AWS signature version (e.g., v2 or v4)
        signatureVersion: v4
        # Forces the path style for S3 (true/false)
        s3ForcePathStyle: true
        # Allows insecure (HTTP) connections (true/false)
        insecure: true
        # HTTP configuration settings
        http_config:
          timeout: 15s
        backoff_config:
          # Minimum backoff time when s3 get Object
          min_period: 100ms
          # Maximum backoff time when s3 get Object
          max_period: 3s
          # Maximum number of times to retry when s3 get Object
          max_retries: 5

  ingester:
    replicas: 3
    maxUnavailable: 1
    zoneAwareReplication:
      enabled: false
    persistence:
      enabled: true
      claims:
        - name: data
          size: 25G
          mountPath: /var/loki/  
          storageClass: local-storage
    # extraArgs:
      #   # print entire yaml file configuration to the stderr 
      # - "-print-config-stderr"

  querier:
    replicas: 3
    maxUnavailable: 2
    autoscaling:
      enabled: true
      minReplicas: 3
      maxReplicas: 6
      targetCPUUtilizationPercentage: 70
      targetMemoryUtilizationPercentage: 50
    # extraArgs:
        # - "-print-config-stderr"
  gateway:
    enabled: true
    replicas: 1
    basicAuth:
      # -- Enables basic authentication for the gateway
      enabled: true
      # -- The basic auth username for the gateway
      username: ahsifer
      # -- The basic auth password for the gateway
      password: Master12345
    service:
      type: NodePort
      nodePort: 31000
      port: 80
  queryFrontend:
    replicas: 3
    maxUnavailable: 1
  frontend:
      # log queries that took more than 10 seconds to execute
    log_queries_longer_than: 10s
  queryScheduler:
    replicas: 2
    maxUnavailable: 1
  distributor:
    replicas: 3
    maxUnavailable: 2
  compactor:
    replicas: 1
  indexGateway:
    replicas: 2
    maxUnavailable: 1
    persistence:
      # -- Enable creating PVCs which is required when using boltdb-shipper
      enabled: true
      # -- Use emptyDir with ramdisk for storage. **Please note that all data in indexGateway will be lost on pod restart**
      inMemory: false
      storageClass: local-storage
      size: 24G
  test:
    enabled: false
  lokiCanary:
    enabled: false
    extraArgs:
      - "-user=ahsifer"
      - "-pass=Master12345"
  resultsCache:
    # -- Specifies whether memcached based results-cache should be enabled
    enabled: true
    # -- Specify how long cached results should be stored in the results-cache before being expired
    defaultValidity: 12h
    # -- Memcached operation timeout
    timeout: 500ms
    # -- Total number of results-cache replicas
    replicas: 2
    # -- Port of the results-cache service
    port: 11211
    # -- Amount of memory allocated to results-cache for object storage (in MB).
    allocatedMemory: 1024
    # -- Maximum item results-cache for memcached (in MB).
    maxItemMemory: 20
    # -- Maximum number of connections allowed
    connectionLimit: 16384
    # -- Max memory to use for cache write back
    writebackSizeLimit: 500MB
    # -- Max number of objects to use for cache write back
    writebackBuffer: 500000
    # -- Number of parallel threads for cache write back
    writebackParallelism: 8

  chunksCache:
    # -- Specifies whether memcached based chunks-cache should be enabled
    enabled: true
    # -- Batchsize for sending and receiving chunks from chunks cache
    batchSize: 8
    # -- Parallel threads for sending and receiving chunks from chunks cache
    parallelism: 5
    # -- Memcached operation timeout
    timeout: 2000ms
    # -- Specify how long cached chunks should be stored in the chunks-cache before being expired
    defaultValidity: 0s
    # -- Total number of chunks-cache replicas
    replicas: 2
    # -- Port of the chunks-cache service
    port: 11211
    # -- Amount of memory allocated to chunks-cache for object storage (in MB).
    allocatedMemory: 2048
    # -- Maximum item memory for chunks-cache (in MB).
    maxItemMemory: 20
    # -- Maximum number of connections allowed
    connectionLimit: 16384
    # -- Max memory to use for cache write back
    writebackSizeLimit: 500MB
    # -- Max number of objects to use for cache write back
    writebackBuffer: 500000
    # -- Number of parallel threads for cache write back
    writebackParallelism: 8
    # extraVolumes:
    # - name: my-emptydir
    #   emptyDir: {}
    # extraVolumeMounts:
    # - name: my-emptydir
    #   mountPath: /data
    # The following is defined to enable caching on the disk for memcached. Only used when eviction from the memory occur. This parameter will boost cluster read performance if used properly. Also I think we can use SSD for caching purposes using memcached
    # extraArgs: 
    #    memcached.ext_path: "/data/file:8G"
    # persistence:
    #   # -- Enable creating PVCs for the results-cache
    #   enabled: true
    #   # -- Size of persistent disk, must be in G or Gi
    #   storageClass: local-storage
    #   storageSize: 10G
    #   # -- Volume mount path
    #   mountPath: /data
  bloomCompactor:
    replicas: 0
  bloomGateway:
    replicas: 0
  ruler:
    replicas: 0
  backend:
    replicas: 0
  read:
    replicas: 0
  write:
    replicas: 0

  singleBinary:
    replicas: 0

grafana / loki

Loki frontend-querier pod crush when running queries with large intervals or queries with huge results. #14749