ingesters' Persistent Volume Claims (PVCs) filling up almost immediately in loki 2.9.4

We are experiencing significant issues with our current Loki distributed setup, specifically with the ingesters' Persistent Volume Claims (PVCs) filling up almost immediately. Additionally, we are encountering frequent "SlowDown" errors from our S3 backend, indicating excessive request rates. Below is a detailed description of our setup and the observed errors, along with a request for suggestions on improving the configuration.

Setup Details Loki Version: 2.9.4 chart version 0.79.1 Deployment Type: Loki Distributed Number of Ingesters: 5 PVC Size per Ingester: 90GB Configuration

auth_enabled: false
chunk_store_config:
  chunk_cache_config:
    embedded_cache:
      enabled: false
    memcached_client:
      addresses: dnssrv+_memcached-client._tcp.loki-distributed-memcached-chunks.loki-distributed.svc.cluster.local
      consistent_hash: true
  max_look_back_period: 0s
  write_dedupe_cache_config:
    memcached_client:
      addresses: dnssrv+_memcached-client._tcp.loki-distributed-memcached-index-writes.loki-distributed.svc.cluster.local
      consistent_hash: true
common:
  compactor_address: http://loki-distributed-compactor:3100
compactor:
  compaction_interval: 10m
  retention_enabled: true
  shared_store: s3
  working_directory: /var/loki/compactor
distributor:
  ring:
    kvstore:
      store: memberlist
frontend:
  compress_responses: true
  grpc_client_config:
    grpc_compression: snappy
    max_recv_msg_size: 999999999999
    max_send_msg_size: 999999999999
  log_queries_longer_than: 5s
  max_outstanding_per_tenant: 2048
  scheduler_address: loki-distributed-query-scheduler:9095
  scheduler_worker_concurrency: 20
  tail_proxy_url: http://loki-distributed-querier:3100
frontend_worker:
  grpc_client_config:
    grpc_compression: snappy
    max_recv_msg_size: 999999999999
    max_send_msg_size: 999999999999
  scheduler_address: loki-distributed-query-scheduler:9095
ingester:
  chunk_block_size: 262144
  chunk_encoding: snappy
  chunk_idle_period: 1h
  chunk_retain_period: 1m
  chunk_target_size: 1536000
  lifecycler:
    ring:
      kvstore:
        store: memberlist
      replication_factor: 3
  max_chunk_age: 1h
  max_transfer_retries: 0
  wal:
    dir: /var/loki/wal
ingester_client:
  grpc_client_config:
    grpc_compression: snappy
    max_recv_msg_size: 999999999999
    max_send_msg_size: 999999999999
limits_config:
  cardinality_limit: 50000
  enforce_metric_name: false
  ingestion_rate_mb: 150
  max_cache_freshness_per_query: 10m
  max_chunks_per_query: 100000
  max_query_series: 100000
  per_stream_rate_limit: 5MB
  per_stream_rate_limit_burst: 10MB
  reject_old_samples: true
  reject_old_samples_max_age: 168h
  retention_period: 30d
  split_queries_by_interval: 6h
  tsdb_max_query_parallelism: 2048
memberlist:
  join_members:
  - loki-distributed-memberlist
query_range:
  align_queries_with_step: true
  cache_results: true
  max_retries: 5
  parallelise_shardable_queries: false
  results_cache:
    cache:
      memcached_client:
        addresses: dnssrv+_memcached-client._tcp.loki-distributed-memcached-frontend.loki-distributed.svc.cluster.local
        consistent_hash: true
query_scheduler:
  max_outstanding_requests_per_tenant: 4096
ruler:
  alertmanager_url: https://alertmanager.xx
  external_url: https://alertmanager.xx
  ring:
    kvstore:
      store: memberlist
  rule_path: /tmp/loki/scratch
  storage:
    local:
      directory: /etc/loki/rules
    type: local
runtime_config:
  file: /var/loki-distributed-runtime/runtime.yaml
schema_config:
  configs:
  - from: "2020-09-07"
    index:
      period: 24h
      prefix: loki_index_
    object_store: s3
    schema: v11
    store: boltdb-shipper
server:
  grpc_listen_port: 9095
  grpc_server_max_recv_msg_size: 999999999999
  grpc_server_max_send_msg_size: 999999999999
  http_listen_port: 3100
  http_server_idle_timeout: 1200s
  http_server_read_timeout: 1200s
  http_server_write_timeout: 1200s
storage_config:
  aws:
    bucketnames: loki-Prod
    http_config:
      response_header_timeout: 5s
    region: us-east-1
    s3: s3://us-east-1/loki-prod
    s3forcepathstyle: false
  boltdb_shipper:
    active_index_directory: /var/loki/index
    cache_location: /var/loki/cache
    cache_ttl: 72h
    index_gateway_client:
      server_address: dns:///loki-distributed-index-gateway:9095
    shared_store: s3
  filesystem:
    directory: /var/loki/chunks
  index_queries_cache_config:
    memcached_client:
      addresses: dnssrv+_memcached-client._tcp.loki-distributed-memcached-index-queries.loki-distributed.svc.cluster.local
      consistent_hash: true
table_manager:
  retention_deletes_enabled: false
  retention_period: 0s

Observed Errors

level=error ts=2024-07-11T17:50:00.996031395Z caller=flush.go:143 org_id=fake msg="failed to flush" err="failed to flush chunks: store put chunk: SlowDown: Please reduce your request rate.\n\tstatus code: 503, request id: XXXXXX, host id: XXXXXXXXXXXXXXXX, num_chunks: 3, labels: {app=\"Abc\", cluster=\"prod\", component=\"ingester\", container=\"ingester\", filename=\"/var/log/pods/mimir-xxxxxxx-3_XXXXXX/ingester/0.log\", instance=\"mimir-production\", job=\"mimir-production/mimir\", namespace=\"xxxx\", node_name=\"ip-10-XX-XX-XX.ec2.internal\", pod=\"mimir-xxxx-xxxxxx\", stream=\"stderr\"}" Current Issues Immediate filling of ingesters' PVCs: This leads to storage issues and potential data loss. Frequent "SlowDown" errors from the S3 backend: These errors indicate that our request rate is too high for the S3 service.

is there any issue with loki-distributed 2.9.4 ? in which version this issue got fixed

following issues seem to be similar for loki-distributed 2.9.4 https://github.com/grafana/loki/pull/11776 https://github.com/grafana/loki/pull/12456 https://github.com/grafana/loki/pull/12456

https://grafana.slack.com/archives/CEPJRLQNL/p1715165059349319

is latest loki-distributed with 2.9.8 will fix this issue ?

grafana / loki

ingesters' Persistent Volume Claims (PVCs) filling up almost immediately in loki 2.9.4 #13675