grafana / loki

Like Prometheus, but for logs.
https://grafana.com/loki
GNU Affero General Public License v3.0
23.28k stars 3.37k forks source link

Loki v3.0.0 quietly gives incomplete results for some queries including a line_format #12798

Open kilo-gi opened 4 months ago

kilo-gi commented 4 months ago

Describe the bug In v3.0.0, Loki will quietly drop some log lines in some cases when the query includes a line format.

To Reproduce Steps to reproduce the behavior:

  1. Started Loki v3.0.0, (and v2.7.5 for comparison)
  2. Started Promtail v2.7.5
  3. Ingested a significant number of logs
  4. Make queries for a specified time range like the following '{image="standalone/llms:c38c6aa8f223b1aa86e1f0726362b15432c8697f", job="container_logs"} |= saved checkpoint at batch 75 | json | line_format .'
  5. Make queries for the same time range like the following ''{image="standalone/llms:c38c6aa8f223b1aa86e1f0726362b15432c8697f", job="container_logs"} |= saved checkpoint at batch 75 | json'
  6. Observe that for the exact same time range and logs ingested: v2.7.5 with and without the line format give 1800 logs v3.0.0 without the line format gives 1800 logs v3.0.0 with the line format only gives 1793 logs

Expected behavior Neither the version nor the line format pipeline directive should change the count of logs received.

Environment:

Screenshots, Promtail config, or terminal output I've posted screenshots and dumps of logs in the community slack: https://grafana.slack.com/archives/C06RUQJHTHQ/p1713891805440339

Here's the loki config we're using for v3.0.0:

schema_config:
  configs:
  - from: 2023-08-01
    store: boltdb-shipper
    object_store: s3
    schema: v11
    index:
      prefix: loki_
      period: 24h
  - from: 2024-04-23
    store: tsdb
    object_store: s3
    schema: v13
    index:
      prefix: index_
      period: 24h

common:
  path_prefix: "/loki"
  replication_factor: 1
  ring:
    instance_addr: 127.0.0.1
    kvstore:
      store: "inmemory"
  instance_addr: 127.0.0.1

storage_config:
  tsdb_shipper:
    active_index_directory: "/loki/tsdb-index"
    cache_location: "/loki/tsdb-cache"
    cache_ttl: 168h
  boltdb_shipper:
    active_index_directory: "/loki/index"
    cache_location: "/loki/boltdb-cache"
    cache_ttl: 168h
  aws:
    s3: "s3://us-west-2/test-int8-logs"
    # TODO: we will want to replace these with a dedicated set at some point
    access_key_id: {{ lookup('ansible.builtin.env', 'AWS_ACCESS_KEY_ID', default=Undefined) }}
    secret_access_key: {{ lookup('ansible.builtin.env', 'AWS_SECRET_ACCESS_KEY', default=Undefined) }}
  index_queries_cache_config: &general_cache_config
    default_validity: 168h
    embedded_cache: &general_embedded_cache_config
      enabled: true
      max_size_mb: 4096 # 4gb should be plenty with our 2.5g/logs per day and still manageable, even duplicated 5 places.
      ttl: 168h

# Here, Auth refers to specifically the X-Scope-OrgID header, which we don't use.
auth_enabled: false

server:
    http_tls_config:
        cert_file: "/loki/certs/server.crt"
        key_file: "/loki/certs/server.key"
        client_ca_file: "/loki/certs/root.crt"
        # For debugging, we can set this to "VerifyClientCertIfGiven" temporarily to get at the metrics and api in your browser.
        client_auth_type: "VerifyClientCertIfGiven"
    grpc_server_max_recv_msg_size: 67108864 # 64mb, default 4mb
    grpc_server_max_send_msg_size: 67108864
    http_server_read_timeout: 2m
    http_server_write_timeout: 2m
    log_level: debug

memberlist:
  node_name: singleton

limits_config:
  # We're actually pretty good about being <1kb per log line, so this corresponds to 100mb, which is large but not massive.
  # We specifically want a higher value than the default because grafana doesn't really do pagination, so this is the max that can be displayed there. Most queries should probably be limited to ~10k lines
  max_entries_limit_per_query: 100000
  # The default of 32 is quite low - Grafana serves queries for everyone
  max_query_parallelism: 512
  # Default of 1h shards things far too thin (4k requests for a single month long query). 6h is much more reasonable
  split_queries_by_interval: 6h
  # We hit the default of 10k streams per user :(
  max_streams_per_user: 0
  max_global_streams_per_user: 0
  allow_structured_metadata: false
  ingestion_rate_mb: 12
  ingestion_burst_size_mb: 48
  max_line_size: 16KB
  max_line_size_truncate: true

frontend:
  # Default of 100 is far too low
  max_outstanding_per_tenant: 2048
  log_queries_longer_than: 20s

frontend_worker:
    grpc_client_config: &grpc_client_config
      max_send_msg_size: 67108864 # 64mb, default 16mb
      max_recv_msg_size: 67108864

querier:
  # Default of 10 is quite low. We've got beefy machines
  max_concurrent: 20

query_scheduler:
  max_outstanding_requests_per_tenant: 10000
  grpc_client_config: *grpc_client_config

query_range:
  # https://github.com/grafana/loki/issues/4613#issuecomment-1021421653
  parallelise_shardable_queries: false
  align_queries_with_step: true
  cache_results: true
  results_cache:
    cache: *general_cache_config
  cache_instant_metric_results: true
  instant_metric_query_split_align: true
  instant_metric_results_cache:
    cache: *general_cache_config

chunk_store_config:
  chunk_cache_config:
    default_validity: 168h
    embedded_cache:
      <<: *general_embedded_cache_config
      max_size_mb: 40960 # pretty big 40gb so we can just keep a whole weeks worth of logs in memory. Why not?
  write_dedupe_cache_config: *general_cache_config
  # It's really unclear from docs what this number is actually supposed to be. Just the point at which we're confident logs have been ingested?
  cache_lookups_older_than: 1h

Querying v3.0.0 without the line format gives 1800 lines: https://files.slack.com/files-pri/T05675Y01-F06V8NGCVB9/download/screenshot_2024-04-23_at_12.03.56___pm.png?origin_team=T05675Y01 https://files.slack.com/files-pri/T05675Y01-F0706C4AS85/download/explore-v3.0.0-no-line-format.json?origin_team=T05675Y01

v3.0.0 with the line format gives 1793 lines: https://grafana.slack.com/files/U06V4UJKA7N/F0706CB2A8M/screenshot_2024-04-23_at_12.05.42___pm.png?origin_team=T05675Y01&origin_channel=C05675Y4F https://files.slack.com/files-pri/T05675Y01-F070Y3GHSDN/download/explore-v3.0.0-with-line-format.json?origin_team=T05675Y01

v2.7.5 without the line format is 1800 lines: https://grafana.slack.com/files/U06V4UJKA7N/F0709A4M34J/screenshot_2024-04-23_at_12.06.52___pm.png?origin_team=T05675Y01&origin_channel=C05675Y4F https://files.slack.com/files-pri/T05675Y01-F070Y3NP4U8/download/explore-v2.7.5-no-line-format.json?origin_team=T05675Y01

And v2.7.5 with the line format is 1800 lines: https://grafana.slack.com/files/U06V4UJKA7N/F0706CTN38V/screenshot_2024-04-23_at_12.08.55___pm.png?origin_team=T05675Y01&origin_channel=C05675Y4F https://files.slack.com/files-pri/T05675Y01-F070Y3ZBERW/download/explore-v2.7.5-with-line-format.json?origin_team=T05675Y01

To get good diffs you have to remove all the "service_name": lines fields and format the json. With that done:

⟐:Downloads ▶ wc -l Explore-v*
   72362 Explore-v2.7.5-no-line-format.json
   72362 Explore-v2.7.5-with-line-format.json
   72362 Explore-v3.0.0-no-line-format.json
   72075 Explore-v3.0.0-with-line-format.json
  289161 total
⟐:Downloads ▶
kilo-gi commented 4 months ago

This has been observed by others on the community forums:

https://community.grafana.com/t/line-format-reduces-log-count-to-only-a-few/119730

praccu commented 2 months ago

Hi.

Thanks @kilo-gi for providing specific inputs that triggers this behavior.

I believe I have identified the commit that introduced this behavior: https://github.com/grafana/loki/commit/793a689d1ffd78333fe651d81addfaa180ae8886

I crafted a set of 6 records, with the following behavior:

The way I was able to trigger this was to:

Example data: synthetic.15.json

Script that generates the synthetic data:

import time

_TEMPLATE = """{"text": "2024-06-06 14:48:12.183 |INFO     | 9500__2024_06_06_21_43_32_358250__P000_W0000/C0000__user_100.74.76.3_LINE_REPLACE  - _lsh_central_server.score_and_build_index_step logged_runtime: 0.4835s", "record": {"elapsed": {"repr": "0:00:13.271696", "seconds": 13.271696}, "exception": null, "extra": {"command_key": "COMMAND_KEY", "machine": "user_100.74.76.3_5057", "container": "fcfa9e6dd7792bd76d9fc777dabf828e281f8969cfd357bd62b14688b5d7cb7f", "user": "buildbot", "image": "SYNTHETIC", "run": "buildbot_job", "command_label": "computronium", "hammer_id": null, "local_rank": null, "global_rank": null, "relaunch_name": null, "full_location": ""}, "file": {"name": "log_utils.py", "path": "/opt/projects/generally_intelligent/computronium/computronium/common/log_utils.py"}, "function": "log_runtime", "level": {"icon": "ℹ️", "name": "INFO", "no": 20}, "line": 449, "message": "_lsh_central_server.score_and_build_index_step logged_runtime: 0.4835s", "module": "log_utils", "name": "computronium.common.log_utils", "process": {"id": 9271, "name": "MainProcess"}, "thread": {"id": THREAD_ID, "name": "MainThread"}, "time": {"repr": "2024-06-06 14:48:12.183338-07:00", "timestamp": TIMESTAMP}}}"""

def fill_record(synthetic, timestamp, command_key, line_diff):
    record = _TEMPLATE
    record = record.replace("THREAD_ID", str(line_diff))
    record = record.replace("SYNTHETIC", f"synthetic_{synthetic}")
    record = record.replace("COMMAND_KEY", f"9500__2024_06_06_21_43_32_358250__P000_W0000/C0000__user_100.74.76.3_{command_key}")
    record = record.replace("TIMESTAMP", str(timestamp))
    #record = record.replace('LINE_REPLACE', str(line_diff))
    return record

if __name__ == "__main__":
    ts = time.time()
    ts2 = ts + 1.0
    ck = "ck0"
    ck2 = "ck1"

    synthetic = 15
    with open(f"example_logs/synthetic.{synthetic}.json", "w") as out:
        for i, (ts_out, ck_out) in enumerate([(ts, ck), (ts, ck2), (ts2, ck), (ts2, ck2), (ts2, ck), (ts2, ck)]):
            print(fill_record(synthetic, ts_out, ck_out, i), file=out)

I think the following block of code is the most likely culprit: https://github.com/grafana/loki/commit/793a689d1ffd78333fe651d81addfaa180ae8886#diff-89016a3dad2dc806b1c9db8e6232914701a99a7908ba0ac37742b27d03e8f618L233

[0] https://hub.docker.com/layers/grafana/loki/main-d24fe3e-amd64/images/sha256-71016ac17acb8f5d8d706a94d5405e05728d038d65a857b1ca77aaa027a8e59d?context=explore [1] https://hub.docker.com/layers/grafana/loki/main-793a689-amd64/images/sha256-3a20c3e72dbd4734e19ee32910811512e2d027387f6fc6079f32e37fb4f86558?context=explore

praccu commented 2 months ago

To clarify why I generated the data I generated: in the data @kilo-gi shared, the 7 dropped records all have the basic behavior of identical timestamps and lines, which AFAIK is a proxy for identical timestamps and lines in this case.

Studying this, I am wondering if this isn't a bug, but rather a bugfix, and if head is WAI? Is (timestamp, line) considered sufficient to uniquely identify a record?