grafana / loki

Like Prometheus, but for logs.
https://grafana.com/loki
GNU Affero General Public License v3.0
23.87k stars 3.45k forks source link

Duplicate logs in case of using json parser in loki #7482

Open 17billion opened 2 years ago

17billion commented 2 years ago

Describe the bug If you use the json parser, the logs appear to be duplicates.

To Reproduce It only happens at certain labels. The difference between the label and the others is that the log volume is large. (About 10k log processing per second.)

Expected behavior Only one log should be shown.

Environment:

Screenshots, Promtail config, or terminal output

loki config

server:
  http_listen_port: 3100
  grpc_server_max_recv_msg_size: 104857600
  grpc_server_max_send_msg_size: 104857600
  grpc_server_max_concurrent_streams: 1000
  http_server_write_timeout: 2m
  http_server_read_timeout: 2m

distributor:
  ring:
    kvstore:
      store: memberlist

memberlist:
  join_members:
    - loki-memberlist

ingester_client:
  grpc_client_config:
    max_recv_msg_size: 104857600
    max_send_msg_size: 104857600
  remote_timeout: 1s

querier:
  query_ingesters_within: 4h # twice the max-chunk age for safety buffer
  max_concurrent: 8

ingester:
  lifecycler:
    ring:
      kvstore:
        store: memberlist
      replication_factor: 3
  chunk_idle_period: 30m
  chunk_target_size: 2097152
  chunk_encoding: snappy
  chunk_retain_period: 15m
  max_chunk_age: 2h
  max_transfer_retries: 0
  autoforget_unhealthy: true
  wal:
    enabled: true
    dir: /var/loki/wal
    replay_memory_ceiling: 4GB # should be set upto ~50% of available memory

limits_config:
  ingestion_rate_mb: 2048
  ingestion_burst_size_mb: 2048
  enforce_metric_name: false
  reject_old_samples: true
  reject_old_samples_max_age: 168h
  max_cache_freshness_per_query: 10m
  max_streams_per_user: 0
  max_global_streams_per_user: 50000
  per_stream_rate_limit: 60MB
  per_stream_rate_limit_burst: 60MB
  split_queries_by_interval: 15m

schema_config:
  configs:
    - from: 2020-09-07
      store: boltdb-shipper
      object_store: s3
      schema: v11
      index:
        prefix: loki_index_
        period: 24h
    - from: 2022-06-29
      store: boltdb-shipper
      object_store: s3
      schema: v12
      index:
        prefix: loki_index_v12_
        period: 24h

storage_config:
  index_queries_cache_config:
    memcached:
      expiration: 15m
      batch_size: 100
      parallelism: 100
    memcached_client:
      consistent_hash: true
      host: 'memcached-index-queries'
      service: memcache
      timeout: 1s
  boltdb_shipper:
    shared_store: s3
    active_index_directory: /var/loki/index
    cache_location: /var/loki/cache
    cache_ttl: 168h
    query_ready_num_days: 7
  aws:
    s3: s3://ap-northeast-2/loki-data-production

chunk_store_config:
  max_look_back_period: 0s
  chunk_cache_config:
    memcached:
      expiration: 30m
      batch_size: 100
      parallelism: 100
    memcached_client:
      consistent_hash: true
      host: 'memcached-chunks'
      service: memcache
      timeout: 1s

query_range:
  align_queries_with_step: true
  max_retries: 5
  cache_results: true
  results_cache:
    cache:
      memcached:
        expiration: 30m
      memcached_client:
        consistent_hash: true
        host: 'memcached-frontend'
        service: memcache
        max_idle_conns: 16
        timeout: 1s
        update_interval: 1m

frontend_worker:
  parallelism: 4
  frontend_address: loki-query-frontend-headless:9095
  match_max_concurrent: false
  grpc_client_config:
    max_send_msg_size: 104857600

frontend:
  log_queries_longer_than: 30s
  compress_responses: true
  max_outstanding_per_tenant: 2048

compactor:
  shared_store: s3
  compaction_interval: 15m

analytics:
  reporting_enabled: false

logql : {role="api",service="odr"} normal log

logql : {role="api",service="odr"} | json duplication log (parser json)

dannykopping commented 2 years ago

@17billion thanks for this; could you please provide an example of those logs that are duplicated? This will help us to build a test to replicate the issue. Feel free to redact the information in the logs; I'm just interested in the structure.

17billion commented 2 years ago

I have json filtered the below log from fluentd. (Log processing : fluent-bit -> fluentd > loki)

100.1.87.80 - - [26/Oct/2022:10:15:10 +0900] "GET /v1/region/rgn3?lat=37.31131439&lng=126.82345896 HTTP/1.1" 200 140 "https://p....block/" "Mozilla/5.0 (Linux; Android 12; SM-G981N Build/SP1A.210812
.016; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/106.0.5249.126 Mobile Safari/537.36" 0.004 0.004 "127.0.0.1:8080" "111.11.11.111, 100.1.111.9,127.0.0.1, 100.11.111.111" -

fluentd filter

    # nginx
    <filter tail.ec2.**.nginx>
      @type parser
      key_name message
      reserve_data true
      remove_key_name_field true
      <parse>
        @type multi_format
        <pattern>
          format json
          time_key time_local
          time_type string
          time_format %d/%b/%Y:%H:%M:%S %z
        </pattern>
        <pattern>
          format regexp
          expression /^(?<remote_addr>[^\s]*)\s-\s(?<remote_user>[^\s]*)\s\[(?<time_local>[^\]]*)\]\s"(?<request_method>\S+)(?: +(?<request_path>[^\"]*?)(?: +\S*)?)?"\s(?<status>\d*)\s(?<body_bytes_sent>\d*)\s"(?<http_referer>[^\s]*)"\s"(?<http_user_agent>[^\"]*)"\s(?<request_time>[\d.]+)\s(?<upstream_response_time>[\d.]+)\s"(?<upstream_addr>[^\"]*)"\s"(?<http_x_forwarded_for>[^\"]*)"\s(?<upstream_cache_status>[^\"]*)/
          time_format %d/%b/%Y:%H:%M:%S %z
          time_key time_local
        </pattern>
        <pattern>
          format none
        </pattern>
      </parse>
    </filter>

grafana query result (Some have been obscured for security.) normal query result

query result using json (appear to be duplicates)

17billion commented 2 years ago

reopen (sorry i clicked wrong)

17billion commented 2 years ago

If I send the same log to a local loki in my local environment, the same thing won't happen. But it's still happening in production environment right now. Lookup with grafana has more logs when using json in the same query. (meaning a lot of duplicate logs)

스크린샷 2022-10-26 오후 3 49 31

dannykopping commented 2 years ago

Thank you for the detail :+1:

o-TvT-o commented 1 year ago

I solved this problem by changing the [row_shards] value

https://grafana.com/docs/loki/latest/configuration/

chunks: prefix: [period: | default = 168h] tags: [: ...]

[row_shards: | default = 16]

rozhao2 commented 1 year ago

Hi @o-TvT-o ,

Thank you for sharing this great news, would you please provide more information?

Which version, change row_shards to what value to solve the issue?

Thanks.

o-TvT-o commented 1 year ago

Hi @o-TvT-o ,

Thank you for sharing this great news, would you please provide more information?

Which version, change row_shards to what value to solve the issue?

Thanks.

version is 2.7.0


schema_config:
  configs:
  - from: 2020-08-01
    store: boltdb-shipper
    object_store: s3
    schema: v11
    index:
      prefix: index_
      period: 24h
    chunks:
      prefix: chunks_
      period: 24h
    row_shards: 1

It seems that when s3 storage is enabled, duplicate logs are stored, associated with the row_shards configuration, and when changed to 1, there are no more duplicate logs.

This configuration is proven in my production environment

rucciva commented 1 year ago

i faced the same problem and specifying the row_shards to 1 as showed by @o-TvT-o solved it.

Just wondering whether it will have any negative impact when using s3 backend.

rucciva commented 1 year ago

i've found out that setting query_range.parallelise_shardable_queries to false while keeping the row_shards to default (16) also avoid the duplicate logs. For context, i'm using simple scalable deployment with additional query-frontend instances

17billion commented 1 year ago

Changing row_shards to 1 seems to have fixed the problem. @rucciva, is it correct to set parallelise_shardable_queries to false? I wonder if it looks like that because the previous data has already been processed with row_shards: 1 set.

Also setting row_shards to 1 seems to affect performance. I wonder if I haven't found a fundamental solution yet.

rucciva commented 1 year ago

is it correct to set parallelise_shardable_queries to false

hi @17billion , i'm not sure whether it is correct or not but that is the workaround i'm currently using. But i guess it might have impacts on query performance

I wonder if it looks like that because the previous data has already been processed with row_shards: 1 set.

i don't thinks so, since i have create new cluster with default row_shards and parallelise_shardable_queries set to false , and no duplicate logs are returned form search

17billion commented 1 year ago

Similar issues are occurring even now. (loki ver: 2.8.4) The logs are all the same(The billing number of the log is a value that cannot be duplicated.), but the only difference is the label fluentd_thread flush_thread_x. dup issue

The strange thing is that when Loki's load is high and flunetd retries a lot, a lot of duplication occurs. (If Loki is stable and there are no retries from flunetd, duplication rarely occurs.)

Currently, row_shards and parallelise_shardable_queries have not been set separately to solve the fundamental problem. I think there is influence from Fluentd in front of loki. Similar to below issues.

Is there a way to tune the settings of the flunetd agent below? (Our flow : Fluent-bit -> Flunetd -> Loki) https://grafana.com/docs/loki/latest/send-data/fluentd/

lake2 commented 3 months ago

same issue + 1