grafana / loki

Like Prometheus, but for logs.
https://grafana.com/loki
GNU Affero General Public License v3.0
23.64k stars 3.41k forks source link

Migrated data never becomes queryable. #6905

Open bcarlock-emerge opened 2 years ago

bcarlock-emerge commented 2 years ago

Describe the bug Boltdb-shipper data migrated between storage accounts never becomes queryable

To Reproduce Steps to reproduce the behavior:

  1. Create new Azure storage account
  2. Start Loki 2.6.1, referring to new storage account
  3. Copy data (using the Loki migrate tool or azcopy)
  4. Query any time prior to application start

Expected behavior Data is accessible and can be queried

Environment:

Screenshots, Promtail config, or terminal output If applicable, add any output to help explain your problem. image image

Loki config:

auth_enabled: false

tracing:
  enabled: true

server:
  http_listen_port: 3100
  grpc_server_max_recv_msg_size: 52428800
  grpc_server_max_send_msg_size: 52428800

distributor:
  ring:
    kvstore:
      store: memberlist

memberlist:
  join_members:
    - loki-grafana-loki-gossip-ring
  randomize_node_name: true
  gossip_to_dead_nodes_time: 200s
  dead_node_reclaim_time: 300s
  rejoin_interval: 330s

ingester:
  lifecycler:
    ring:
      kvstore:
        store: memberlist
      replication_factor: 3
  max_chunk_age: 2h
  chunk_idle_period: 2h
  chunk_block_size: 262144
  chunk_encoding: snappy
  chunk_retain_period: 10m
  max_transfer_retries: 0
  wal:
    dir: /tmp/loki/wal

limits_config:
  ingestion_rate_strategy: local
  ingestion_burst_size_mb: 100
  max_streams_per_user: 100000
  ingestion_rate_mb: 500
  enforce_metric_name: false
  reject_old_samples: true
  reject_old_samples_max_age: 336h
  max_cache_freshness_per_query: 1m
  split_queries_by_interval: 24h
  max_query_series: 100000
  max_query_parallelism: 400
  max_query_lookback: 336h
  retention_period: 336h
  retention_stream:
    - selector: '{cluster="aks.dev"}'
      priority: 1
      period: 336h
    - selector: '{cluster="aks.demo"}'
      priority: 1
      period: 300h
    - selector: '{cluster="aks.prod"}'
      priority: 1
      period: 336h

schema_config:
  configs:
  - from: 2022-06-19
    store: boltdb-shipper
    object_store: azure
    schema: v11
    index:
      prefix: index_
      period: 24h

storage_config:
  azure:
    account_name: <STORAGE ACCOUNT>
    container_name: <CONTAINER>
    use_managed_identity: true
  boltdb_shipper:
    shared_store: azure
    active_index_directory: /tmp/loki/loki/index
    cache_location: /tmp/loki/loki/cache
    cache_ttl: 336h
    index_gateway_client:
      server_address: dns:///loki-grafana-loki-index-gateway:9095
  filesystem:
    directory: /tmp/loki/chunks
  index_queries_cache_config:
    memcached:
      expiration: 168h
      batch_size: 2048
      parallelism: 100
    memcached_client:
      consistent_hash: true
      addresses: dns+loki-memcachedindexqueries:11211
      service: http

chunk_store_config:
  max_look_back_period: 336h
  chunk_cache_config:
    memcached:
      expiration: 168h
      batch_size: 2048
      parallelism: 100
    memcached_client:
      consistent_hash: true
      addresses: dns+loki-memcachedchunks:11211
  write_dedupe_cache_config:
    memcached:
      expiration: 168h
      batch_size: 2048
      parallelism: 100
    memcached_client:
      consistent_hash: true
      addresses: dns+loki-memcachedindexwrites:11211

table_manager:
  retention_deletes_enabled: true
  retention_period: 336h

query_range:
  align_queries_with_step: true
  parallelise_shardable_queries: true
  max_retries: 5
  cache_results: true
  results_cache:
    cache:
      memcached_client:
        consistent_hash: true
        addresses: dns+loki-memcachedfrontend:11211
        max_idle_conns: 16
        timeout: 500ms
        update_interval: 1m
        max_item_size: 134217728

querier:
  max_concurrent: 20480
  query_ingesters_within: 3h
  query_timeout: 10m
  engine:
    timeout: 10m

frontend_worker:
  grpc_client_config:
      max_send_msg_size: 1.048576e+08
      grpc_compression: 'snappy'
  # parallelism: 40
  match_max_concurrent: true

query_scheduler:
  max_outstanding_requests_per_tenant: 20480
  grpc_client_config:
      max_send_msg_size: 1.048576e+08
      grpc_compression: 'snappy'

frontend:
  max_outstanding_per_tenant: 20480
  log_queries_longer_than: 15s
  compress_responses: true
  tail_proxy_url: http://loki-grafana-loki-querier:3100

compactor:
  shared_store: azure
  compaction_interval: 10m
  retention_enabled: true
  retention_delete_delay: 2h
  retention_delete_worker_count: 150

ruler:
  enable_api: true
  storage:
    type: local
    local:
      directory: /etc/loki/rules
  ring:
    kvstore:
      store: memberlist
  wal:
    dir: /tmp/loki/wal
  rule_path: /tmp/loki/scratch
  alertmanager_url: http://prometheus-alertmanager.monitoring:9093
  enable_alertmanager_v2: true
  remote_write:
    enabled: true
    client:
      url: http://prometheus-prometheus.monitoring:9090/api/v1/write
jqnote commented 2 years ago

I have been working on this for two days. there's a strange question. loki saved log data to s3 under path bucket name, like s3://bucket/index/index_12345. it's unexpected. the queier still query data under root path like s3://index/index_12345

bcarlock-emerge commented 2 years ago

In our case the issue was probably just unclear documentation. In this bit of our config, setting reject_old_samples to false allowed the migrated logs to be queried. I'm not sure you should have to disable reject_old_samples when all of the samples are newer than the reject_old_samples_max_age... In our case, we did.

limits_config:
  ingestion_rate_strategy: local
  ingestion_burst_size_mb: 100
  max_streams_per_user: 100000
  ingestion_rate_mb: 500
  enforce_metric_name: false
  reject_old_samples: true
  reject_old_samples_max_age: 336h