grafana / loki

Like Prometheus, but for logs.
https://grafana.com/loki
GNU Affero General Public License v3.0
23.92k stars 3.45k forks source link

Clean installation documentation. #12773

Open mvtab opened 6 months ago

mvtab commented 6 months ago

Is your feature request related to a problem? Please describe. I stumbled upon Loki while looking for a centralized logging solution for Kubernetes. I have a Grafana UI already implemented, and I needed only promtail and loki. Installing promtail? No problem whatsoever. Installing Loki? Surprisingly uselessly complicated and not one single documentation out there that gives the impression they know what they are saying.

I tried the official way: using Helm (The recommended way) As I was only trying to check it out, I went with the monolithic installation. Error:

Error: INSTALLATION FAILED: execution error at (loki/templates/validate.yaml:31:4): You have more than zero replicas configured for both the single binary and simple scalable targets. If this was intentional change the deploymentMode to the transitional 'SingleBinary<->SimpleScalable' mode

I add deploymentMode: SingleBinary<->SimpleScalable as suggested, error:

Error: INSTALLATION FAILED: execution error at (loki/templates/validate.yaml:40:4): You must provide a schema_config for Loki, one is not provided as this will be individual for every Loki cluster. See https://grafana.com/docs/loki/latest/operations/storage/schema/ for schema information. For quick testing (with no persistence) add `--set loki.useTestSchema=true`

I add loki.useTestSchema: true as suggested, helm install goes this time, but then:

2024/04/24 10:47:36 [crit] 13#13: *54 connect() to 10.215.155.137:3100 failed (1: Operation not permitted) while connecting to upstream, client: 10.214.3.37, server: , request: "GET /loki/api/v1/tail?query=%7Bstream%3D%22stdout%22%2Cpod%3D%22loki-canary-8tfvx%22%7D+ HTTP/1.1", upstream: "http://10.215.155.137:3100/loki/api/v1/tail?query=%7Bstream%3D%22stdout%22%2Cpod%3D%22loki-canary-8tfvx%22%7D+", host: "loki-gateway.default.svc.cluster.local.:80"
10.214.3.37 - self-monitoring [24/Apr/2024:10:47:36 +0000]  502 "GET /loki/api/v1/tail?query=%7Bstream%3D%22stdout%22%2Cpod%3D%22loki-canary-8tfvx%22%7D+ HTTP/1.1" 157 "-" "Go-http-client/1.1" "-"

I try the scalable one, similar errors. After a few troubleshooting steps like above, I also came to a dead end.

In the end, thanks to beautiful communities out there, I discovered grafana/loki-stack and this seems to do the job, but I am being blasted with errors that I assume are because of a poorly configured installation, due to the lack of a real official documentation on the subject. (The github loki-stack documentation is at best minimal)

Describe the solution you'd like I would like an official documentation to describe how to simply install Loki and promtail. The best would be to have a native Kubernetes YAML way, and a Helm way.

elchenberg commented 6 months ago

I think the documentation that you referred to is missing this value:

deploymentMode: SingleBinary
mvtab commented 6 months ago

Hi @elchenberg, I also considered that. However result does not change.

values.yaml

deploymentMode: SingleBinary
loki:
  commonConfig:
    replication_factor: 1
  storage:
    type: 'filesystem'
#  useTestSchema: true
singleBinary:
  replicas: 1

result

Error: INSTALLATION FAILED: execution error at (loki/templates/validate.yaml:31:4): You have more than zero replicas configured for both the single binary and simple scalable targets. If this was intentional change the deploymentMode to the transitional 'SingleBinary<->SimpleScalable' mode

In the meantime, the grafana/loki-stack has these values:

promtail:
  enabled: true

(I know this is not even required, just have a system that requires passing in values)

Simply helm install and instantly what I need:

pod/loki-0                                                1/1     Running   0          82s
pod/loki-promtail-58v4d                                   1/1     Running   0          82s
pod/loki-promtail-6dmpn                                   1/1     Running   0          82s
pod/loki-promtail-fvwnn                                   1/1     Running   0          82s
pod/loki-promtail-hnwtk                                   1/1     Running   0          82s
pod/loki-promtail-hrcrm                                   1/1     Running   0          82s
pod/loki-promtail-k2b8g                                   1/1     Running   0          82s
service/loki                                   ClusterIP      10.215.54.78     <none>           3100/TCP                        82s
service/loki-headless                          ClusterIP      None             <none>           3100/TCP                        82s
service/loki-memberlist                        ClusterIP      None             <none>           7946/TCP                        82s
daemonset.apps/loki-promtail   6         6         6       6            6           <none>          82s
statefulset.apps/loki   1/1     82s
adthonb commented 6 months ago

@elchenberg Also validate.yaml is not correct logical. They don't check deploymentMode: singleBinary with other value.

You can use this values.yaml for singleBinary deployment

deploymentMode: SingleBinary
loki:
  schemaConfig:
    configs:
      - from: 2024-04-01
        object_store: filesystem
        store: tsdb
        schema: v13
        index:
          prefix: index_
          period: 24h
singleBinary:
  replicas: 1
# Zero out replica counts of Single Scale deployment mode
backend:
  replicas: 0
read:
  replicas: 0
write:
  replicas: 0
mvtab commented 6 months ago

Hi @adthonb I really don't know, there must be something wrong with the whole setup. I executed with your values, got this:

Error: INSTALLATION FAILED: template: loki/templates/single-binary/statefulset.yaml:44:28: executing "loki/templates/single-binary/statefulset.yaml" at <include (print .Template.BasePath "/config.yaml") .>: error calling include: template: loki/templates/config.yaml:19:7: executing "loki/templates/config.yaml" at <include "loki.calculatedConfig" .>: error calling include: template: loki/templates/_helpers.tpl:461:24: executing "loki.calculatedConfig" at <tpl .Values.loki.config .>: error calling tpl: error during tpl function execution for "{{- if .Values.enterprise.enabled}}\n{{- tpl .Values.enterprise.config . }}\n{{- else }}\nauth_enabled: {{ .Values.loki.auth_enabled }}\n{{- end }}\n\n{{- with .Values.loki.server }}\nserver:\n  {{- toYaml . | nindent 2}}\n{{- end}}\n\npattern_ingester:\n  enabled: {{ .Values.loki.pattern_ingester.enabled }}\n\nmemberlist:\n{{- if .Values.loki.memberlistConfig }}\n  {{- toYaml .Values.loki.memberlistConfig | nindent 2 }}\n{{- else }}\n{{- if .Values.loki.extraMemberlistConfig}}\n{{- toYaml .Values.loki.extraMemberlistConfig | nindent 2}}\n{{- end }}\n  join_members:\n    - {{ include \"loki.memberlist\" . }}\n    {{- with .Values.migrate.fromDistributed }}\n    {{- if .enabled }}\n    - {{ .memberlistService }}\n    {{- end }}\n    {{- end }}\n{{- end }}\n\n{{- with .Values.loki.ingester }}\ningester:\n  {{- tpl (. | toYaml) $ | nindent 4 }}\n{{- end }}\n\n{{- if .Values.loki.commonConfig}}\ncommon:\n{{- toYaml .Values.loki.commonConfig | nindent 2}}\n  storage:\n  {{- include \"loki.commonStorageConfig\" . | nindent 4}}\n{{- end}}\n\n{{- with .Values.loki.limits_config }}\nlimits_config:\n  {{- tpl (. | toYaml) $ | nindent 4 }}\n{{- end }}\n\nruntime_config:\n  file: /etc/loki/runtime-config/runtime-config.yaml\n\n{{- with .Values.chunksCache }}\n{{- if .enabled }}\nchunk_store_config:\n  chunk_cache_config:\n    default_validity: {{ .defaultValidity }}\n    background:\n      writeback_goroutines: {{ .writebackParallelism }}\n      writeback_buffer: {{ .writebackBuffer }}\n      writeback_size_limit: {{ .writebackSizeLimit }}\n    memcached:\n      batch_size: {{ .batchSize }}\n      parallelism: {{ .parallelism }}\n    memcached_client:\n      addresses: dnssrvnoa+_memcached-client._tcp.{{ template \"loki.fullname\" $ }}-chunks-cache.{{ $.Release.Namespace }}.svc\n      consistent_hash: true\n      timeout: {{ .timeout }}\n      max_idle_conns: 72\n{{- end }}\n{{- end }}\n\n{{- if .Values.loki.schemaConfig }}\nschema_config:\n{{- toYaml .Values.loki.schemaConfig | nindent 2}}\n{{- end }}\n\n{{- if .Values.loki.useTestSchema }}\nschema_config:\n{{- toYaml .Values.loki.testSchemaConfig | nindent 2}}\n{{- end }}\n\n{{ include \"loki.rulerConfig\" . }}\n\n{{- if or .Values.tableManager.retention_deletes_enabled .Values.tableManager.retention_period }}\ntable_manager:\n  retention_deletes_enabled: {{ .Values.tableManager.retention_deletes_enabled }}\n  retention_period: {{ .Values.tableManager.retention_period }}\n{{- end }}\n\nquery_range:\n  align_queries_with_step: true\n  {{- with .Values.loki.query_range }}\n  {{- tpl (. | toYaml) $ | nindent 4 }}\n  {{- end }}\n  {{- if .Values.resultsCache.enabled }}\n  {{- with .Values.resultsCache }}\n  cache_results: true\n  results_cache:\n    cache:\n      default_validity: {{ .defaultValidity }}\n      background:\n        writeback_goroutines: {{ .writebackParallelism }}\n        writeback_buffer: {{ .writebackBuffer }}\n        writeback_size_limit: {{ .writebackSizeLimit }}\n      memcached_client:\n        consistent_hash: true\n        addresses: dnssrvnoa+_memcached-client._tcp.{{ template \"loki.fullname\" $ }}-results-cache.{{ $.Release.Namespace }}.svc\n        timeout: {{ .timeout }}\n        update_interval: 1m\n  {{- end }}\n  {{- end }}\n\n{{- with .Values.loki.storage_config }}\nstorage_config:\n  {{- tpl (. | toYaml) $ | nindent 4 }}\n{{- end }}\n\n{{- with .Values.loki.query_scheduler }}\nquery_scheduler:\n  {{- tpl (. | toYaml) $ | nindent 4 }}\n{{- end }}\n\n{{- with .Values.loki.compactor }}\ncompactor:\n  {{- tpl (. | toYaml) $ | nindent 4 }}\n{{- end }}\n\n{{- with .Values.loki.analytics }}\nanalytics:\n  {{- tpl (. | toYaml) $ | nindent 4 }}\n{{- end }}\n\n{{- with .Values.loki.querier }}\nquerier:\n  {{- tpl (. | toYaml) $ | nindent 4 }}\n{{- end }}\n\n{{- with .Values.loki.index_gateway }}\nindex_gateway:\n  {{- tpl (. | toYaml) $ | nindent 4 }}\n{{- end }}\n\n{{- with .Values.loki.frontend }}\nfrontend:\n  {{- tpl (. | toYaml) $ | nindent 4 }}\n{{- end }}\n\n{{- with .Values.loki.frontend_worker }}\nfrontend_worker:\n  {{- tpl (. | toYaml) $ | nindent 4 }}\n{{- end }}\n\n{{- with .Values.loki.distributor }}\ndistributor:\n  {{- tpl (. | toYaml) $ | nindent 4 }}\n{{- end }}\n\ntracing:\n  enabled: {{ .Values.loki.tracing.enabled }}\n": template: gotpl:40:6: executing "gotpl" at <include "loki.commonStorageConfig" .>: error calling include: template: loki/templates/_helpers.tpl:228:19: executing "loki.commonStorageConfig" at <$.Values.loki.storage.bucketNames.chunks>: nil pointer evaluating interface {}.chunks

Added loki.storage.type: filesystem and it went, but then back at initial problem:

Canary:

Connecting to loki at ws://loki-gateway.tester.svc.cluster.local.:80/loki/api/v1/tail?query=%7Bstream%3D%22stdout%22%2Cpod%3D%22loki-canary-c2xg9%22%7D+, querying for label 'pod' with value 'loki-canary-c2xg9'
failed to connect to ws://loki-gateway.tester.svc.cluster.local.:80/loki/api/v1/tail?query=%7Bstream%3D%22stdout%22%2Cpod%3D%22loki-canary-c2xg9%22%7D+ with err websocket: bad handshake

Gateway:

10.214.1.247 - - [25/Apr/2024:08:52:43 +0000]  200 "GET / HTTP/1.1" 2 "-" "kube-probe/1.29" "-"
2024/04/25 08:52:44 [crit] 9#9: *166 connect() to 10.215.192.193:3100 failed (1: Operation not permitted) while connecting to upstream, client: 10.214.1.111, server: , request: "GET /loki/api/v1/tail?query=%7Bstream%3D%22stdout%22%2Cpod%3D%22loki-canary-tr6ll%22%7D+ HTTP/1.1", upstream: "http://10.215.192.193:3100/loki/api/v1/tail?query=%7Bstream%3D%22stdout%22%2Cpod%3D%22loki-canary-tr6ll%22%7D+", host: "loki-gateway.tester.svc.cluster.local.:80"
10.214.1.111 - self-monitoring [25/Apr/2024:08:52:44 +0000]  502 "GET /loki/api/v1/tail?query=%7Bstream%3D%22stdout%22%2Cpod%3D%22loki-canary-tr6ll%22%7D+ HTTP/1.1" 157 "-" "Go-http-client/1.1" "-"

And just to be clear, in the meantime I did my setup how I wanted with the loki-stack, but I think it's important that new users have at least a way to try out the software by following official documentations.

ts-sean-foley commented 6 months ago

I think the documentation that you referred to is missing this value:

deploymentMode: SingleBinary

I wanted to provide some additional related information for anyone attempting "simple scalable" mode with the exmaple helm deplyment:

The example provided for "simple scalable" mode is also broken. It says that this is the default mode, and it appears that it does not complain about the mode not being used but it does complain about a missing schema_config, which is not mentioned in the simple scalable example of values.yaml.

Dragotic commented 6 months ago

Yep, the documentation is quite broken. I also tried to install it with simple scalable and didn't pass the validation multiple times.

Turns out that the schema_config should be actually passed to values.yaml as schemaConfig instead. As it is shown in the helm chart default values

a-h commented 6 months ago

Step 3 of the documentation at https://grafana.com/docs/loki/latest/setup/install/helm/install-monolithic/ lists:

mode: SingleBinary
loki:
  commonConfig:
    replication_factor: 1

Note that it should be deploymentMode based on the Helm chart values.

sunidhi271 commented 6 months ago

@a-h @Dragotic @adthonb @ts-sean-foley Could you please help me here ? I already have a issue opened - https://github.com/grafana/loki/issues/12972

With the below values set, my Simple scalable deployment is not working out, all the pods are going to crashlookbackoff:

loki:
  global:
    image:
      registry: registry.xyz.com
    fullnameOverride: loki
    imagePullSecrets: [dacsecret]
  test:
    enabled: false
  gateway:
    enabled: false
  lokiCanary:
    enabled: false
  monitoring:
    selfMonitoring:
      grafanaAgent:
        installOperator: false

# SimpleScalable Mode related values
  deploymentMode: SimpleScalable
  sidecar:
    image:
      repository: registry.xyz.com/public/kiwigrid/k8s-sidecar
      tag: 1.24.3
    resources:
      limits:
        cpu: 100m
        memory: 100Mi
      requests:
        cpu: 50m
        memory: 50Mi
    rules:
      enabled: true
      label: loki_rule
      labelValue: ""
      folder: /rules
  memcached:
    image:
      repository: registry.xyz.com/public/memcached
#default#      tag: 1.6.23-alpine 
      tag: 1.6.25
  memcachedExporter:
    image: 
      repository: registry.xyz.com/public/prom/memcached-exporter
      tag: v0.14.2
  minio:
    enabled: true
    image:
      repository: registry.xyz.com/public/minio
    mcImage:
      repository: registry.xyz.com/public/quay.io/minio/mc
  backend:
    replicas: 3
    autoscaling:
      enabled: false
      minReplicas: 3
      maxReplicas: 6
    persistence:
      volumeClaimsEnabled: true
      # -- Parameters used for the `data` volume when volumeClaimEnabled if false
      dataVolumeParameters:
        emptyDir: {}
      # -- Enable StatefulSetAutoDeletePVC feature
      enableStatefulSetAutoDeletePVC: false
      size: 10Gi
      storageClass: "rook-block"
      # -- Selector for persistent disk
      selector: null
    resources:
      limits:
        memory: 50Gi
      requests:
        memory: 1Gi
  read:
    replicas: 3
    autoscaling:
      enabled: false
      minReplicas: 3
      maxReplicas: 6
      targetCPUUtilizationPercentage: 60
    persistence:
      volumeClaimsEnabled: true
      size: 10Gi
      storageClass: rook-block
    resources:
      limits:
        memory: 50Gi
      requests:
        memory: 1Gi
  write:
    replicas: 3
    autoscaling:
      enabled: false
      minReplicas: 3
      maxReplicas: 6
      targetCPUUtilizationPercentage: 60
      resources: {}
    persistence:
      volumeClaimsEnabled: true
      # -- Parameters used for the `data` volume when volumeClaimEnabled if false
      dataVolumeParameters:
        emptyDir: {}
      enableStatefulSetAutoDeletePVC: false
      size: 10Gi
      storageClass: "rook-block"
      selector: null
    resources:
      limits:
        memory: 50Gi
      requests:
        memory: 1Gi
  tableManager:
    enabled: false
    extraVolumes:
      - name: data
        emptyDir: {}
    extraVolumeMounts:
      - name: data
        mountPath: /var/loki
    retention_period: 24h

  loki:
    image:
      registry: registry.dac.nokia.com
      repository: public/grafana/loki
#    schemaConfig:
#      configs:
#        - from: 2024-04-01
#          store: tsdb
#          object_store: s3
#          schema: v13
#          index:
#            prefix: loki_index_
#            period: 24h
    ingester:
      chunk_encoding: snappy
      replicas: 0
    tracing:
      enabled: true
    querier:
      # Default is 4, if you have enough memory and CPU you can increase, reduce if OOMing
      max_concurrent: 4
      replicas: 0
    minio:
      enabled: true
    singleBinary:
      replicas: 0
    queryFrontend:
      replicas: 0
    queryScheduler:
      replicas: 0
    distributor:
      replicas: 0
    compactor:
      replicas: 0
    indexGateway:
      replicas: 0
    bloomCompactor:
      replicas: 0
    bloomGateway:
      replicas: 0
#SimpleScalable Mode related values ends here

#    structuredConfig:
    config: |
      auth_enabled: false
      limits_config:
        ingestion_rate_strategy: local
        reject_old_samples: true
        reject_old_samples_max_age: 168h
        ingestion_rate_mb: 400
        ingestion_burst_size_mb: 600
        max_global_streams_per_user: 10000
        max_query_length: 72h
        max_query_parallelism: 64
        cardinality_limit: 200000
        split_queries_by_interval: 30m
    schemaConfig:
      configs:
        - from: 2024-04-01
          object_store: s3
          store: tsdb
          schema: v13
          index:
            prefix: index_
            period: 24h
    auth_enabled: false
    commonConfig:
      replication_factor: 1
tuoxiebushuijiao commented 3 months ago

I also encountered the following issue:

[error] 10#10: *57484 connect() failed (111: Connection refused) while connecting to upstream, client: 10.200.1.39,             server: , request: "GET /loki/api/v1/tail?query=%7Bstream%3D%22stdout%22%2Cpod%3D%22loki-canary-rp6rq%22%7D+ HTTP/1.1", upstream: "http:            //10.96.227.238:3100/loki/api/v1/tail?query=%7Bstream%3D%22stdout%22%2Cpod%3D%22loki-canary-rp6rq%22%7D+", host: "loki-gateway.loki.svc.            cluster.local.:80"
mvtab commented 3 months ago

I have no idea what's going on with the project but I think Loki is being completely restructured. The official helm chart values site is empty, the official loki helm chart is being redirected to some other source code, and basically there are multiple versions of documentations, none of which actually works.

Parkour, I guess.

Dragotic commented 3 months ago

@mvtab I don't think it's empty but a bad UI on the website (for sure). But, yea, documentation is very bad.

image
vdovhanych commented 4 weeks ago

I tried updating the chart from 5.48.0 to 6.16.0 without success. I tried pretty much everything I found in various versions of the docs, but nothing seems to be working.

My installation is the SingleBinary type with two replicas and two replication targets. According to the official documentation, this setup is supported, and it should work fine. But it does not even finish the values.yaml validation. and fails with:

Error: execution error at (loki/templates/validate.yaml:31:4): You have more than zero replicas configured for both the single binary and simple scalable targets. If this was intentional change the deploymentMode to the transitional 'SingleBinary<->SimpleScalable' mode

Have any of you guys managed to find a solution?