grafana / loki

Like Prometheus, but for logs.
https://grafana.com/loki
GNU Affero General Public License v3.0
23.91k stars 3.45k forks source link

The documentation says `use_boltdb_shipper_as_backup` is defaulted to `false` for `tsdb_shipper`, however `tsdb_shipper` won't work unless there is either `boltdb_shipper` config or `use_boltdb_shipper_as_backup` is declared and set to `false` #9603

Closed steveannett closed 1 year ago

steveannett commented 1 year ago

I'm running Loki v2.8.2, using Helm Chart version 5.5.12, deploying on to Kubernetes via the Kustomize tool

Describe the bug In the configuration when using tsdb_shipper without the value tsdb_shipper.use_boltdb_shipper_as_backup set as false, or a boltdb_shipper configuration as backup the loki-read pods fail with the following error:

Unrecognized storage client , choose one of: aws, s3, gcs, azure, filesystem
error initialising module: index-gateway
github.com/grafana/dskit/modules.(*Manager).initModule
 /src/loki/vendor/github.com/grafana/dskit/modules/modules.go:122
github.com/grafana/dskit/modules.(*Manager).InitModuleServices
 /src/loki/vendor/github.com/grafana/dskit/modules/modules.go:92
github.com/grafana/loki/pkg/loki.(*Loki).Run
 /src/loki/pkg/loki/loki.go:422
main.main
 /src/loki/cmd/loki/main.go:105
runtime.main
 /usr/local/go/src/runtime/proc.go:250
runtime.goexit
 /usr/local/go/src/runtime/asm_amd64.s:1594

To Reproduce Steps to reproduce the behavior:

Use the following config:

      loki:
        enabled: true
        auth_enabled: false
        commonConfig:
          path_prefix: /var/loki
          replication_factor: 1
        storage:
          bucketNames:
            chunks: chunks-bucket-name
          type: s3
          s3:
            s3: s3://us-east-1/chunks-bucket-name
            region: us-east-1
            insecure: false
            s3ForcePathStyle: true
        storage_config:
          tsdb_shipper:
            active_index_directory: /var/loki/tsdb-index
            shared_store: s3
            cache_location: /var/loki/tsdb-cache
            cache_ttl: 24h
        schemaConfig:
          configs:
            - from: "2023-01-01"
              store: tsdb
              object_store: s3
              schema: v12
              index:
                prefix: loki_index_
                period: 24h
        rulerConfig:
          storage:
            type: local
            local:
              directory: /var/loki/rules
        compactor:
          working_directory: /var/loki/compactor
          shared_store: s3
        index_gateway:
          mode: simple
        query_scheduler:
          max_outstanding_requests_per_tenant: 32768
        querier:
          max_concurrent: 16

Expected behavior

In the documentation at https://grafana.com/docs/loki/latest/configuration/ it states that use_boltdb_shipper_as_backup is false, so the expected behavior is that everything would start correctly. However it won't start unless the value tsdb_shipper.use_boltdb_shipper_as_backup is set as false, or there has been a boltdb_shipper configuration added.

 # Use boltdb-shipper index store as backup for indexing chunks. When enabled,
  # boltdb-shipper needs to be configured under storage_config
  # CLI flag: -tsdb.shipper.use-boltdb-shipper-as-backup
  [use_boltdb_shipper_as_backup: <boolean> | default = false]

Environment: Loki v2.8.2, using Helm Chart version 5.5.12, deploying on to EKS Kubernetes 1.23 via the Kustomize tool

Workaround

Either add boltdb configuration to the storage_config and schemaConfig or set tsdb_shipper.use_boltdb_shipper_as_backup to false. This will allow the loki-read pods to run correctly.

chaudum commented 1 year ago

Hey @steveannett As you correctly stated, use_boltdb_shipper_as_backup should default to false and therefore not require any boltdb_shipper storage config.

I quickly tested -target=read with this config:

schema_config:
  configs:
  - from: 2022-02-08
    schema: v12
    store: tsdb
    object_store: filesystem
    index:
      prefix: index_tsdb_
      period: 24h

storage_config:
  tsdb_shipper:
    active_index_directory: /tmp/loki/index
    cache_location: /tmp/loki/cache
    shared_store: filesystem

However, could not reproduce the issue.

Could you please also post the contents of your generated config.yaml (ConfigMap)?

steveannett commented 1 year ago

Hi @chaudum , thanks for looking into this - here is generated config.yaml

apiVersion: v1
data:
  config.yaml: |
    auth_enabled: false
    common:
      compactor_address: 'loki-read'
      path_prefix: /var/loki
      replication_factor: 1
      storage:
        s3:
          bucketnames: ${LOKI_S3_BUCKET}
          insecure: false
          region: ap-east-1
          s3: s3://ap-east-1/${LOKI_S3_BUCKET}
          s3forcepathstyle: true
    compactor:
      shared_store: s3
      working_directory: /var/loki/compactor
    ingester:
      chunk_idle_period: 3m
      chunk_retain_period: 1m
    limits_config:
      enforce_metric_name: false
      max_cache_freshness_per_query: 10m
      reject_old_samples: true
      reject_old_samples_max_age: 168h
      split_queries_by_interval: 15m
    memberlist:
      join_members:
      - loki-memberlist
    querier:
      engine:
        timeout: 5m
      max_concurrent: 16
      query_timeout: 5m
    query_range:
      align_queries_with_step: true
    query_scheduler:
      max_outstanding_requests_per_tenant: 32768
    ruler:
      alertmanager_url: http://_http-web._tcp.alertmanager-operated.monitoring.svc.cluster.local:9093
      enable_alertmanager_v2: true
      enable_api: true
      ring:
        kvstore:
          store: inmemory
      rule_path: /var/loki/rules-temp
      storage:
        local:
          directory: /var/loki/rules
        type: local
    runtime_config:
      file: /etc/loki/runtime-config/runtime-config.yaml
    schema_config:
      configs:
      - from: "2023-01-01"
        index:
          period: 24h
          prefix: loki_index_
        object_store: s3
        schema: v12
        store: tsdb
    server:
      grpc_listen_port: 9095
      http_listen_port: 3100
    storage_config:
      hedging:
        at: 250ms
        max_per_second: 20
        up_to: 3
      tsdb_shipper:
        active_index_directory: /var/loki/tsdb-index
        cache_location: /var/loki/tsdb-cache
        cache_ttl: 24h
        shared_store: s3
    table_manager:
      retention_deletes_enabled: false
      retention_period: 0
kind: ConfigMap
metadata:
  labels:
    app.kubernetes.io/instance: loki-instance
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: loki
    app.kubernetes.io/version: 2.7.3
    helm.sh/chart: loki-4.8.0
  name: loki
  namespace: monitoring

This generated using the following kustomization.yaml:

namespace: monitoring
helmCharts:
  # Loki Logging (see https://github.com/grafana/loki/tree/main/production/helm/loki also very useful guide at https://rtfm.co.ua/en/grafana-loki-architecture-and-running-in-kubernetes-with-aws-s3-storage-and-boltdb-shipper/)
  - name: loki
    version: 5.10.0 # App version 2.8.3
    namespace: monitoring
    releaseName: loki-instance
    repo: https://grafana.github.io/helm-charts
    valuesInline:
      loki:
        enabled: true
        auth_enabled: false
        isDefault: false # So it doesn't override Prometheus
        commonConfig:
          path_prefix: /var/loki
          replication_factor: 1
        storage:
          bucketNames:
            chunks: ${LOKI_S3_BUCKET}
          type: s3
          s3:
            s3: s3://ap-east-1/${LOKI_S3_BUCKET}
            region: ap-east-1
            insecure: false
            s3ForcePathStyle: true
            sse_encryption: true
            sse:
              type: "SSE-S3"
        storage_config:
          tsdb_shipper:
            active_index_directory: /var/loki/tsdb-index
            shared_store: s3
            cache_location: /var/loki/tsdb-cache
            cache_ttl: 24h
        schemaConfig:
          configs:
            - from: "2023-01-01"
              store: tsdb
              object_store: s3
              schema: v12
              index:
                prefix: loki_index_
                period: 24h
        rulerConfig:
          storage:
            type: local
            local:
              directory: /var/loki/rules
          rule_path: "/var/loki/rules-temp"
          ring:
            kvstore:
              store: inmemory
          alertmanager_url: http://_http-web._tcp.alertmanager-operated.monitoring.svc.cluster.local:9093
          enable_alertmanager_v2: true
          enable_api: true
        compactor:
          working_directory: /var/loki/compactor
          shared_store: s3
        index_gateway:
          mode: simple
        query_scheduler:
          # TSDB sends more requests, so increase the pending request queue sizes (https://grafana.com/docs/loki/latest/operations/storage/tsdb/)
          max_outstanding_requests_per_tenant: 32768
        querier:
          # Each `querier` component process runs a number of parallel workers to process queries simultaneously.
          # but we find the most success running at around `16` with tsdb (https://grafana.com/docs/loki/latest/operations/storage/tsdb/)
          max_concurrent: 16
          engine:
              timeout: 5m
          query_timeout: 5m
        ingester:
          # Flush chunks that don't receive new data
          chunk_idle_period: 3m
          # Keep flushed chunks in memory for a duration
          chunk_retain_period: 1m
      monitoring:
        dashboards: # Grafana Dashboards
          enabled: true
        rules:
          enabled: true
          alerting: true
        serviceMonitor: # For alerts etc
          enabled: true
        selfMonitoring:
          enabled: false
          grafanaAgent:
            installOperator: false
        lokiCanary:
          enabled: false
      test:
        enabled: false
      write:
        replicas: 2
        extraArgs:
          - -config.expand-env=true
        extraEnvFrom:
          - configMapRef:
                name: loki-s3-storage
        resources:
          limits:
            memory: 1.5Gi
          requests:
            memory: 1.5Gi
            cpu: "0.1"
      read:
        replicas: 1
        extraArgs:
          - -config.expand-env=true
        extraEnvFrom:
          - configMapRef:
                name: loki-s3-storage
        resources:
          limits:
            memory: 3Gi
          requests:
            memory: 3Gi
            cpu: "0.5"
      backend:
        replicas: 1
        extraVolumeMounts:
          - name: rules
            mountPath: "/var/loki/rules/fake"
        extraVolumes:
          - name: rules
            configMap:
              name: loki-alerting-rules
      memberlist:
        service:
          publishNotReadyAddresses: false
      gateway:
        replicas: 1
      # This service account is already created by eksctl
      serviceAccount:
        create: false
        name: loki-sa
        annotations:
          eks.amazonaws.com/role-arn: "arn:aws:iam::00000000000:role/loki_s3_role"
chaudum commented 1 year ago

Thanks @steveannett I tried the resulting config with Loki 2.8.2 (outside of Kubernetes) and I could not reproduce the error either. So I thought there might be an issue with Helm.

However, I also tested your values (taken from kustomization.yaml) and installed the chart with the following command:

helm --kube-context k3d upgrade loki grafana/loki --namespace lokitest --version 5.5.12 --values values.yaml

The read-loki pod starts up correctly. So I am a bit clueless what the problem could be.

Any chance you could try to upgrade to a later Helm chart version?