grafana / loki

Like Prometheus, but for logs.
https://grafana.com/loki
GNU Affero General Public License v3.0
23.58k stars 3.41k forks source link

Increased Discarderd Lines from Large Datasource since Upgrade to Helm 5.15.0 #10430

Open seanocca opened 1 year ago

seanocca commented 1 year ago

Describe the bug I recently upgraded from Loki Helm Chart 5.8.9 to 5.15.0 (this is of course not good practice but the CHANGELOG.md did not indicate any breaking changes for us). Since then we have seen a large number of logs being "discarded" from one of our tenants. This tenant can get up to roughly 3.5 million logs every minute. But it only bursts up to that throughout the middle of the day. The version we were originally on handled this perfectly. However since the upgrade the tenant reliably shows the loki_discarded_samples_total metric as over 1/3 of the logs being collected. We have other tenants in this Loki deployment but they only ingest 1 million lines combined.

To Reproduce Steps to reproduce the behavior:

  1. Start Loki on a kubernetes cluster with a helm chart in simple scalable mode on version 5.15.0 (with 6 readers, 9 writers, 6 backend, 3 gateway)
  2. Ingest 3.5 millions logs per minute from a data source using a Vector agent running in stateless aggregator mode
  3. Check the discarded metrics from Prometheus

Expected behavior We expect that Loki (possibly with some scaling) will handle this number of logs. As it was described as being able to handle this number of logs. Especially when some of the examples listed are over 1TB of logs ingested a day. We get to roughly 300-400GB a day uncompressed.

Environment:

Screenshots, Promtail config, or terminal output CONFIG

      auth_enabled: true

      server:
        log_level: "error"
        http_server_read_timeout: 610s
        http_server_write_timeout: 610s
        http_listen_port: 3100
        grpc_listen_port: 9095
        grpc_server_max_recv_msg_size: 268435456    # 256mb
        grpc_server_max_send_msg_size: 268435456    # 256mb
        grpc_server_max_concurrent_streams: 0 # Unlimited GRPC streams

      memberlist:
        join_members:
          - loki-memberlist

      {{- if .Values.loki.commonConfig}}
      common:
      {{- toYaml .Values.loki.commonConfig | nindent 2}}
        storage:
        {{- include "loki.commonStorageConfig" . | nindent 4}}
      {{- end}}

      limits_config:
        ingestion_rate_mb: 20
        ingestion_burst_size_mb: 30
        max_global_streams_per_user: 20000
        max_entries_limit_per_query: 100000 # number of log lines returned per query
        enforce_metric_name: false
        reject_old_samples: true
        reject_old_samples_max_age: 168h
        max_cache_freshness_per_query: 20m
        split_queries_by_interval: 30m
        per_stream_rate_limit: "20MB"
        per_stream_rate_limit_burst: "60MB"
        max_query_series: 100000 # unique series returned by a metric query
        max_label_names_per_series: 30 # default
        creation_grace_period: 10m # default
        max_line_size: 0 # default
        max_line_size_truncate: false # default 
        max_chunks_per_query: 2000000 # default
        max_query_parallelism: 256 # default
        cardinality_limit: 500000 # default
        retention_period: 2160h # 90 days
        max_query_lookback: 0 # default
        allow_deletes: true
        deletion_mode: "filter-and-delete"

      runtime_config:
        file: /etc/loki/runtime-config/runtime-config.yaml

      chunk_store_config:
        chunk_cache_config:
          redis:
            endpoint: ${LOKI_CACHE_URL}
            password: ${LOKI_CACHE_PASSWORD}
            tls_enabled: true
            tls_insecure_skip_verify: true
            timeout: 610s

      schema_config:
        configs:
        - from: "2022-07-21"
          store: boltdb-shipper
          object_store: s3
          schema: v11
          index:
            period: 24h
            prefix: loki_index_
        - from: "2022-10-19"
          store: boltdb-shipper
          object_store: s3
          schema: v12
          index:
            period: 24h
            prefix: loki_index_
        - from: "2023-04-17"
          store: tsdb
          object_store: s3
          schema: v12
          index:
            period: 24h
            prefix: loki_index_

      {{- if or .Values.minio.enabled (eq .Values.loki.storage.type "s3") (eq .Values.loki.storage.type "gcs") }}
      ruler:
        storage:
        {{- include "loki.rulerStorageConfig" . | nindent 4}}
      {{- end }}

      table_manager:
        retention_deletes_enabled: false
        retention_period: 0

      query_range:
        align_queries_with_step: true
        max_retries: 5
        cache_results: true
        results_cache:
          cache:
            enable_fifocache: false
            redis:
              endpoint: ${LOKI_CACHE_URL}
              password: ${LOKI_CACHE_PASSWORD}
              tls_enabled: true
              tls_insecure_skip_verify: true
              timeout: 610s

      storage_config:
        index_queries_cache_config:
          redis:
            endpoint: ${LOKI_CACHE_URL}
            password: ${LOKI_CACHE_PASSWORD}
            tls_enabled: true
            tls_insecure_skip_verify: true
            timeout: 610s

        hedging:
          at: "250ms"
          max_per_second: 20
          up_to:  3

      query_scheduler:
        max_outstanding_requests_per_tenant: 2048
        grpc_client_config:
          max_recv_msg_size: 268435456    # 256mb
          max_send_msg_size: 268435456    # 256mb

      querier:
        multi_tenant_queries_enabled: true
        max_concurrent: 2000
        engine:
          timeout: 600s

      compactor:
        retention_enabled: true
        retention_delete_delay: 1m
        retention_delete_worker_count: 500
        delete_request_cancel_period: 1m

      ingester:
        chunk_idle_period: 8h
        max_chunk_age: 8h
        chunk_target_size: 3e+6
        chunk_encoding: snappy

      ingester_client:
        grpc_client_config:
          max_recv_msg_size: 268435456    # 256mb

      frontend_worker:
        grpc_client_config:
          max_recv_msg_size: 268435456    # 256mb
seanocca commented 1 year ago

@JStickler is there any update to this issue?