grafana / loki

Like Prometheus, but for logs.
https://grafana.com/loki
GNU Affero General Public License v3.0
23.29k stars 3.38k forks source link

loki_distributor_bytes_received_total statistic error #9967

Open wkshare opened 1 year ago

wkshare commented 1 year ago

Describe the bug I have a three node Loki cluster, but there is always one Metric whose stats look inaccurate, as you can see in the screenshot below. The total amount of logs received keeps increasing, while in reality there is not such a large amount of data. With three nodes, only one node has this, the other 2 are normal. How should I check where the data traffic is coming from? wich ip?

To Reproduce loki, version 2.6.1 (branch: HEAD, revision: https://github.com/grafana/loki/commit/6bd05c9a4399805b7951f8a3fb59f4aebf60e6cd) build user: root@af90ed01061f build date: 2022-07-18T08:41:09Z go version: go1.17.6 platform: linux/amd64

Query: rate(loki_distributor_bytes_received_total[3m])

Expected behavior Unable to correctly determine Loki's data traffic

Environment: Oracle Linux 7

Screenshots, Promtail config, or terminal output my configs:

auth_enabled: true

common:
  replication_factor: 1
  ring:
    kvstore:
      store: memberlist
    instance_interface_names:
      - ens160

server:
  log_level: warn
  http_listen_port: 3100
  grpc_listen_port: 13100
  grpc_server_max_recv_msg_size: 114857600
  grpc_server_max_send_msg_size: 114857600
  grpc_server_max_concurrent_streams: 1024
  http_server_write_timeout: 60s
  http_server_read_timeout: 60s

  http_tls_config:
    cert_file: /mon/app/loki/certs/server.crt
    key_file: /mon/app/loki/certs/server.key

ingester:
  lifecycler:
    join_after: 60s
    observe_period: 5s
    ring:
      kvstore:
        store: memberlist
    final_sleep: 0s
  autoforget_unhealthy: true
  chunk_idle_period: 1h
  wal:
    enabled: true
    dir: /mon/data/loki/wal
  max_chunk_age: 1h
  chunk_retain_period: 30s
  chunk_encoding: snappy
  chunk_target_size: 0
  chunk_block_size: 262144

memberlist:
  abort_if_cluster_join_fails: false

  bind_port: 7946

  join_members:
    - 10.157.159.170
    - 10.157.135.233
    - 10.157.159.218

  dead_node_reclaim_time: 30s
  gossip_to_dead_nodes_time: 15s
  left_ingesters_timeout: 30s

  max_join_backoff: 1m
  max_join_retries: 10
  min_join_backoff: 1s

storage_config:
  max_chunk_batch_size: 1024
  boltdb_shipper:
    active_index_directory: /mon/data/loki/boltdb-shipper-active
    cache_location: /mon/data/loki/boltdb-shipper-cache
    cache_ttl: 24h
    resync_interval: 5s
    shared_store: s3
  aws:
    bucketnames: loki
    endpoint: https://10.157.135.235:9299
    access_key_id: ${ACCESS_KEY_ID}
    secret_access_key: ${SECRET_ACCESS_KEY}
    insecure: true
    http_config:
      idle_conn_timeout: 90s
      response_header_timeout: 60s
      insecure_skip_verify: true
    s3forcepathstyle: true

schema_config:
  configs:
    - from: 2022-03-22
      store: boltdb-shipper
      object_store: s3
      schema: v11
      index:
        prefix: index_
        period: 24h

limits_config:
  max_query_series: 1000
  max_cache_freshness_per_query: '10m'
  enforce_metric_name: false
  reject_old_samples: true
  reject_old_samples_max_age: 168h
  per_stream_rate_limit: 20MB
  ingestion_rate_mb: 300
  ingestion_burst_size_mb: 500
  split_queries_by_interval: 15m
  retention_period: 720h

table_manager:
  retention_deletes_enabled: true
  retention_period: 336h

query_range:
  align_queries_with_step: true
  max_retries: 5
  parallelise_shardable_queries: true
  cache_results: true
  results_cache:
    cache:
      enable_fifocache: true
      fifocache:
        size: 1024
        validity: 24h

compactor:
  working_directory: /mon/data/loki/compactor
  shared_store: s3
  compaction_interval: 5m
  retention_enabled: true
  retention_delete_delay: 1h
  retention_delete_worker_count: 150

ruler:
  enable_api: true
  enable_sharding: true
  rule_path: /mon/data/loki/scratch
  storage:
    type: s3
    s3:
      bucketnames: loki
      endpoint: https://10.157.135.235:9299
      access_key_id: ${ACCESS_KEY_ID}
      secret_access_key: ${SECRET_ACCESS_KEY}
      insecure: true
      http_config:
        idle_conn_timeout: 90s
        response_header_timeout: 60s
        insecure_skip_verify: true
      s3forcepathstyle: true

  remote_write:
    enabled: true
    client:
      url: https://10.157.135.235:9443/api/v1/write
      tls_config:
        insecure_skip_verify: true
  wal:
    dir: /mon/data/loki/ruler-wal

querier:
  max_concurrent: 2048
  engine:
    timeout: 5m
  query_timeout: 5m
  query_ingesters_within: 2h

frontend:
   address: 10.157.159.170
wkshare commented 1 year ago

IMG20230718095125 this is a trend graph

JeffCT0216 commented 1 year ago

Loki version: 2.8.0

Hi there,

We are also seeing similar issues when trying to gather Loki ingest volume metrics.

We are currently in a transition period from cutting logs completely over from another Log collector (we've been using this for years and has never had any complains of missing logs) to Loki, and we are now sending logs to both our existing log collector and Loki.

We have a good idea of the total log volume ingested in our old collector but when trying to compare with Loki's ingest volume from loki_distributor_bytes_received_total, the metrics seems inconsistent.

For example:

Our log ingest volume metric from our existing log collector stays consistent of 7/31 - 8/6 6.50 TB 8/7 - 8/16 7.28 TB 8/17 - 8/26 7.07 TB

Where as the numbers I gathered from loki_distributor_bytes_received_total varies quite a lot. I've tried a few various of aggregation methods For 7/31 - 8/6

running_sum(loki_distributor_bytes_received_total{job="loki-prod"}[_interval]) 552968862335 bytes = 0.553 TB

image image

sum(max_over_time(loki_distributor_bytes_received_total{job="loki-prod"}[$__range])) 895442033191 = 0.895TB

image image

increase(loki_distributor_bytes_received_total{job="loki-prod"}[$__range]) 760847954266571 = 760 TB

image image

avg(rate(loki_distributor_bytes_received_total[$__range])) * 7 * 24 * 60 * 60 380423977149437 = 380 TB

image image

Note that we do have pod restarts from time to time and is reflected on the metric itself.

image

We are really puzzled by this, any pointers on how to get the accurate data is much appreciated!

JeffCT0216 commented 12 months ago

bump, this is still an issue

JeffCT0216 commented 11 months ago

Can anyone please provide guidance on how to get the accurate data on Loki ingestion volume?