grafana / loki

Like Prometheus, but for logs.
https://grafana.com/loki
GNU Affero General Public License v3.0
23.64k stars 3.42k forks source link

Loki Logs Not Consistently Showing Up On Query #11239

Open charris-ca opened 11 months ago

charris-ca commented 11 months ago

Describe the bug Logs are ingesting correctly and getting uploaded to our GCP bucket correctly. But on query time, the logs are inconsistent. Generally I see about ~50% of the logs show up for the first 25 minutes. After that time, the same query will return 100% of the logs. There is a window in between 15-25 minutes where a query will vary between the ~50% or 100% return on each execute.

I have increased all of the timeout/query related values that I have found. I have also reduced chunk_idle_period to 1m, but still seeing the issue. I have also turned off all memcached instances and not using results cache for testing.

If I restart all loki pods, logs will completely disappear. But will re-appear after I hit 15-25 minutes from when the log was ingested.

I have tracing setup, and things appear to be getting broken at the /logproto.Querier/QuerySample on the querier component. I'm getting a context cancelled (screenshot below)

To Reproduce Steps to reproduce the behavior:

  1. Started Loki 2.9.2
  2. Started Promtail 2.9.2
  3. Query: {log_component_id="msserver"} |= requestedCardBrand_seq14

Expected behavior Logs should be directly queriable right after they are ingested

Environment:

Screenshots, Promtail config, or terminal output If applicable, add any output to help explain your problem.

Log sample of subsequent queries returning different results below. First log line returns total_entries=3, the third returns total_entries=10, and the fifth returns total_entries=3 again.

level=info ts=2023-11-15T19:19:18.75410031Z caller=metrics.go:159 component=frontend org_id=fake traceID=334d501327e3bf0d sampled=true latency=fast query="sum by (level) (count_over_time({log_component_id=\"msserver\"} |= `requestedCardBrand_seq14`[1s]))" query_hash=893980325 query_type=metric range_type=range length=30m0s start_delta=30m0.428090436s end_delta=428.090598ms step=1s duration=178.039939ms status=200 limit=10000 returned_lines=0 throughput=964MB total_bytes=172MB total_bytes_structured_metadata=0B lines_per_second=584913 total_lines=104138 post_filter_lines=3 total_entries=1 store_chunks_download_time=13.084811ms queue_time=3.554180995s splits=2 shards=32 cache_chunk_req=856 cache_chunk_hit=856 cache_chunk_bytes_stored=0 cache_chunk_bytes_fetched=23734146 cache_chunk_download_time=1.032203ms cache_index_req=0 cache_index_hit=0 cache_index_download_time=0s cache_stats_results_req=0 cache_stats_results_hit=0 cache_stats_results_download_time=0s cache_result_req=0 cache_result_hit=0 cache_result_download_time=0s source=logvolhist
level=info ts=2023-11-15T19:19:20.939715791Z caller=roundtrip.go:241 org_id=fake traceID=513fb9e770c8b91f sampled=true msg="executing query" type=range query="{log_component_id=\"msserver\"} |= `requestedCardBrand_seq14`" length=30m0s step=500ms query_hash=3562538396
level=info ts=2023-11-15T19:19:21.076735353Z caller=metrics.go:159 component=frontend org_id=fake traceID=513fb9e770c8b91f sampled=true latency=fast query="{log_component_id=\"msserver\"} |= `requestedCardBrand_seq14`" query_hash=3562538396 query_type=filter range_type=range length=30m0s start_delta=30m0.384727968s end_delta=384.728163ms step=500ms duration=135.825604ms status=200 limit=10000 returned_lines=0 throughput=1.3GB total_bytes=170MB total_bytes_structured_metadata=0B lines_per_second=762661 total_lines=103589 post_filter_lines=10 total_entries=10 store_chunks_download_time=338.288945ms queue_time=1.403924995s splits=2 shards=32 cache_chunk_req=864 cache_chunk_hit=855 cache_chunk_bytes_stored=10907 cache_chunk_bytes_fetched=23808155 cache_chunk_download_time=980.285µs cache_index_req=0 cache_index_hit=0 cache_index_download_time=0s cache_stats_results_req=0 cache_stats_results_hit=0 cache_stats_results_download_time=0s cache_result_req=0 cache_result_hit=0 cache_result_download_time=0s
level=info ts=2023-11-15T19:19:23.219091967Z caller=roundtrip.go:241 org_id=fake traceID=5f035daad71594ed sampled=true msg="executing query" type=range query="{log_component_id=\"msserver\"} |= `requestedCardBrand_seq14`" length=30m0s step=500ms query_hash=3562538396
level=info ts=2023-11-15T19:19:23.22302903Z caller=roundtrip.go:241 org_id=fake traceID=5fa4b44b95d08e8f sampled=true msg="executing query" type=range query="sum by (level) (count_over_time({log_component_id=\"msserver\"} |= `requestedCardBrand_seq14`[1s]))" length=30m0s step=1s query_hash=893980325
level=info ts=2023-11-15T19:19:23.328879764Z caller=metrics.go:159 component=frontend org_id=fake traceID=5f035daad71594ed sampled=true latency=fast query="{log_component_id=\"msserver\"} |= `requestedCardBrand_seq14`" query_hash=3562538396 query_type=filter range_type=range length=30m0s start_delta=30m0.354870733s end_delta=354.870896ms step=500ms duration=108.997926ms status=200 limit=10000 returned_lines=0 throughput=1.5GB total_bytes=167MB total_bytes_structured_metadata=0B lines_per_second=933641 total_lines=101765 post_filter_lines=3 total_entries=3 store_chunks_download_time=11.146325ms queue_time=1.052435998s splits=2 shards=32 cache_chunk_req=870 cache_chunk_hit=870 cache_chunk_bytes_stored=0 cache_chunk_bytes_fetched=20933561 cache_chunk_download_time=916.194µs cache_index_req=0 cache_index_hit=0 cache_index_download_time=0s cache_stats_results_req=0 cache_stats_results_hit=0 cache_stats_results_download_time=0s cache_result_req=0 cache_result_hit=0 cache_result_download_time=0s

Loki config:

    auth_enabled: false
    chunk_store_config:
      max_look_back_period: 0s
    common:
      compactor_address: http://loki-loki-distributed-compactor:3100
      storage:
        gcs:
          bucket_name: loki-storage-psec-uat
          chunk_buffer_size: 0
          enable_http2: true
          request_timeout: 0s
    compactor:
      compaction_interval: 1m
      retention_delete_delay: 2h
      retention_enabled: true
      shared_store: gcs
    distributor:
      ring:
        kvstore:
          store: memberlist
    frontend:
      compress_responses: true
      log_queries_longer_than: 5s
      scheduler_address: loki-loki-distributed-query-scheduler:9095
      tail_proxy_url: http://loki-loki-distributed-querier:3100
    frontend_worker:
      scheduler_address: loki-loki-distributed-query-scheduler:9095
    ingester:
      autoforget_unhealthy: true
      chunk_block_size: 262144
      chunk_encoding: gzip
      chunk_idle_period: 1m
      chunk_retain_period: 0s
      concurrent_flushes: 64
      lifecycler:
        ring:
          kvstore:
            store: memberlist
          replication_factor: 1
      max_transfer_retries: 0
      wal:
        dir: /var/loki/wal
        enabled: false
    ingester_client:
      grpc_client_config:
        connect_timeout: 15s
        max_recv_msg_size: 1048576000
        max_send_msg_size: 1048576000
      pool_config:
        remote_timeout: 5s
      remote_timeout: 15s
    limits_config:
      cardinality_limit: 1000000
      enforce_metric_name: false
      ingestion_burst_size_mb: 75
      ingestion_rate_mb: 60
      max_cache_freshness_per_query: 10m
      max_entries_limit_per_query: 1000000
      max_global_streams_per_user: 0
      max_label_name_length: 10240
      max_label_names_per_series: 300
      max_label_value_length: 20480
      max_query_parallelism: 32
      max_query_series: 100000
      max_streams_per_user: 0
      per_stream_rate_limit: 30MB
      per_stream_rate_limit_burst: 45MB
      query_timeout: 5m
      reject_old_samples: true
      reject_old_samples_max_age: 2w
      retention_period: 14d
      split_queries_by_interval: 1h
    memberlist:
      dead_node_reclaim_time: 10s
      gossip_interval: 200ms
      gossip_nodes: 6
      gossip_to_dead_nodes_time: 5s
      join_members:
      - loki-loki-distributed-memberlist
      packet_dial_timeout: 6s
      packet_write_timeout: 10s
      pull_push_interval: 60s
      stream_timeout: 20s
    query_range:
      align_queries_with_step: true
      cache_results: false
      max_retries: 5
    query_scheduler:
      max_outstanding_requests_per_tenant: 4096
    ruler:
      alertmanager_url: http://prometheus-mgmt-kube-prome-alertmanager.prometheus:9093
      enable_alertmanager_v2: true
      enable_api: true
      ring:
        kvstore:
          store: memberlist
      rule_path: /tmp/loki/scratch
      storage:
        local:
          directory: /var/loki/rules
        type: local
    runtime_config:
      file: /var/loki-distributed-runtime/runtime.yaml
    schema_config:
      configs:
      - from: "2020-05-15"
        index:
          period: 24h
          prefix: loki_
        object_store: gcs
        schema: v11
        store: boltdb-shipper
    server:
      grpc_server_max_concurrent_streams: 0
      grpc_server_max_recv_msg_size: 100000000
      grpc_server_max_send_msg_size: 100000000
      http_listen_port: 3100
      http_server_read_timeout: 300s
      http_server_write_timeout: 300s
    storage_config:
      boltdb_shipper:
        active_index_directory: /var/loki/index
        cache_location: /var/loki/cache
        cache_ttl: 24h
        index_gateway_client:
          server_address: dns:///loki-loki-distributed-index-gateway:9095
        shared_store: gcs
      filesystem:
        directory: /var/loki/chunks
    table_manager:
      retention_deletes_enabled: true
      retention_period: 14d
    tracing:
      enabled: true

Showing 6 entries: image Same query 20 minutes later, showing 11 entries: image

image

tonyswu commented 11 months ago

Sounds like your ingesters aren't part of the memberlist ring. What does /ring look like from your ingesters?

charris-ca commented 11 months ago

/ring looks fine,

image

tonyswu commented 11 months ago

Another thing I noticed, query_ingesters_within is set to 3h by default, so you will probably want to adjust chunk_idle_period accordingly. In general you should prefer for chunks to be flushed closer to target size, so you should look to align it with query_ingesters_within with a reasonable number (I'd say default 3h is fine for most use cases).

charris-ca commented 11 months ago

I had chunk_idle_period at the default time period (30m) and 2 hours set previously during testing and was seeing the same errors. Could I set query_ingesters_within to a very low number so that it will go to the object store sooner? It's still not explaining why ingesters are not able to have the same data or combine their results on query return

tonyswu commented 11 months ago

Perhaps someone with more experience can comment on this as well, but for me I try to line up query_ingesters_within with both chunk_idle_period and max_chunk_age. You can set querier to go to object storage sooner (you'd want to adjust both idle and max age of course), but you should want to not do that, because it's both faster and more efficient to not have to query object store too early.

And it kinda explain why your ingester would return inconsistent logs, because you have 5 ingesters, each of them would flush idle chunks at different time, so it's conscievable that results would differ everytime you query as well. I'd say give it a try by lining up those configurations and see if it helps.

charris-ca commented 11 months ago

I tested setting those to 3 hours, but still getting the same problem. I also tested setting, query_store_only=true. And the logs were not showing up at all for 10-15 minutes, then popped up.

I think you're on to something with the ring communication though. Is it possible that the queries are hitting the memberlist for a specific ingester, but it is being load balanced back to the same ingester? Thinking that it is collecting the logs from all ingesters.

tonyswu commented 11 months ago

What does /ring look like from your querier?

charris-ca commented 11 months ago

Looks normal, those are all the correct internal IPs,

image

mysterious-ben commented 11 months ago

@charris-ca I had a possibly related issue with Loki 2.9 + Promtail 2.9 with the local filesystem storage. After a version upgrade, the logs started to disappear in the period approximately between -2 hours and -2 days from all feeds. I tried various config tweaks - alas, to no avail. Eventually I reverted back to Loki 2.6 and the issue disappeared.