grafana / helm-charts

Apache License 2.0
1.66k stars 2.28k forks source link

[loki-distributed] loki configuration with aws s3 in the example doesn't work #564

Open ilovemysillybanana opened 3 years ago

ilovemysillybanana commented 3 years ago

The example in this location here is incomplete and does not work. It seems to lack a compactor according to this question on stackoverflow.

Even with that added, I am not able to get it to work. Loki still comes up in microservice mode however, the service is not pushing anything to my s3 bucket.

This is what my configuration looks like:

    auth_enabled: false

    server:
      log_level: info
      # Must be set to 3100
      http_listen_port: 3100

    distributor:
      ring:
        kvstore:
          store: memberlist

    ingester:
      # Disable chunk transfer which is not possible with statefulsets
      # and unnecessary for boltdb-shipper
      max_transfer_retries: 0
      chunk_idle_period: 1h
      chunk_target_size: 1536000
      max_chunk_age: 1h
      lifecycler:
        join_after: 0s
        ring:
          kvstore:
            store: memberlist

    memberlist:
      join_members:
        - {{ include "loki.fullname" . }}-memberlist

    limits_config:
      ingestion_rate_mb: 10
      ingestion_burst_size_mb: 20
      max_concurrent_tail_requests: 20
      max_cache_freshness_per_query: 10m

    schema_config:
      configs:
        - from: 2020-09-07
          store: boltdb-shipper
          object_store: aws
          schema: v11
          index:
            prefix: loki_index_
            period: 24h

    storage_config:
      aws:
        s3: s3://${s3_bucket_region}
        bucketnames: ${s3_bucket_name}
        access_key_id: ${access_key_id}
        secret_access_key: ${secret_access_key}
      boltdb_shipper:
        active_index_directory: /var/loki/index
        shared_store: s3
        cache_location: /var/loki/cache

    query_range:
      # make queries more cache-able by aligning them with their step intervals
      align_queries_with_step: true
      max_retries: 5
      # parallelize queries in 15min intervals
      split_queries_by_interval: 15m
      cache_results: true

      results_cache:
        cache:
          enable_fifocache: true
          fifocache:
            max_size_items: 1024
            validity: 24h

    frontend_worker:
      frontend_address: {{ include "loki.queryFrontendFullname" . }}:9095

    frontend:
      log_queries_longer_than: 5s
      compress_responses: true
      tail_proxy_url: http://{{ include "loki.querierFullname" . }}:3100

    compactor:
      working_directory: /var/loki/boltdb-shipper-compactor
      shared_store: aws

Compactor doesn't really say much:

level=info ts=2021-07-15T01:58:19.79084079Z caller=main.go:130 msg="Starting Loki" version="(version=2.2.1, branch=HEAD, revision=babea82ef)"
level=info ts=2021-07-15T01:58:19.791254715Z caller=server.go:229 http=[::]:3100 grpc=[::]:9095 msg="server listening on addresses"
level=info ts=2021-07-15T01:58:19.792409718Z caller=module_service.go:59 msg=initialising module=server
level=info ts=2021-07-15T01:58:19.793146855Z caller=module_service.go:59 msg=initialising module=compactor
level=info ts=2021-07-15T01:58:19.794245258Z caller=loki.go:248 msg="Loki started"

The querier responds with this when I try to test my configuration via Grafana:

level=warn ts=2021-07-15T02:02:25.281232377Z caller=logging.go:71 traceID=5aef29e44bdd988a msg="GET /loki/api/v1/labels?end=1626314545277009321&start=1626313944983000000 (500) 63.894µs Response: \"too many unhealthy instances in the ring\\n\" ws: false; X-Scope-Orgid: fake; uber-trace-id: 5aef29e44bdd988a:0b792c2379068184:60f9956a8e6da4ee:0; "
level=info ts=2021-07-15T02:03:29.653260267Z caller=table_manager.go:208 msg="syncing tables"

I've noticed that they all seem to take a while to connect to the "ring" and they usually say they found 1 or 2 instances as the following example from the loki-distributor-ingester service shows:

level=info ts=2021-07-15T01:59:02.868091883Z caller=memberlist_client.go:521 msg="joined memberlist cluster" reached_nodes=1

Could this be because they're doing everything in memory instead of through AWS/S3? Any guidance would be appreciated.

grvhi commented 3 years ago

@ilovemysillybanana - did you manage to find a solution?

simonwh commented 3 years ago

Same issue here - any inputs? Did you find a solution @ilovemysillybanana and @grvhi?

jdziat commented 3 years ago

@grvhi @simonwh @ilovemysillybanana

I used the below with some success:

    memberlist:
      randomize_node_name: false
      join_members:
        - {{ include "loki.fullname" . }}-memberlist.{{namespace}}.svc.cluster.local
    ingester:
      # Disable chunk transfer which is not possible with statefulsets
      # and unnecessary for boltdb-shipper
      max_transfer_retries: 0
      chunk_idle_period: 1h
      chunk_target_size: 1536000
      max_chunk_age: 1h
     # set autoforget_unhealthy to true for troubleshooting purposes
      autoforget_unhealthy: false
      lifecycler:
        join_after: 0s
        ring:
          kvstore:
            store: memberlist
          replication_factor: 1