Read and Write pods fail with "failed services" message

rwlove commented 1 year ago

Describe the bug Loki deployment fails because Read and Write pods crash with unexplained "failed services" message.

I'd be happy to learn that this is a configuration error, but I'm not sure what next steps to take to debug this.

To Reproduce Steps to reproduce the behavior: Deploy Loki via Helm Chat with Rook Ceph bucket for storage and Read and Write pods crash with cryptic message.

Expected behavior Read and Write pods don't crash and deployment succeeds.

Environment:

Infrastructure: K8S (1.25.5) nodes running in KVM VMs and some BM nodes, CentOS Stream 8
Deployment tool: helm, flux2

Screenshots, Promtail config, or terminal output

➜  fleet-infra git:(main) ✗ kubectl -n monitoring get all,pvc,secret | grep loki
pod/loki-gateway-896978574-klvhg                1/1     Running            0              47s
pod/loki-gateway-896978574-zchfv                1/1     Running            0              47s
pod/loki-read-0                                 0/1     Error              2 (25s ago)    47s
pod/loki-read-1                                 0/1     CrashLoopBackOff   2 (4s ago)     47s
pod/loki-write-0                                0/1     CrashLoopBackOff   2 (4s ago)     47s
pod/loki-write-1                                0/1     Error              2 (25s ago)    47s
service/loki-gateway                  ClusterIP      11.108.192.193   <none>         80/TCP                       2d20h
service/loki-memberlist               ClusterIP      None             <none>         7946/TCP                     2d20h
service/loki-read                     ClusterIP      11.97.72.159     <none>         3100/TCP,9095/TCP            2d20h
service/loki-read-headless            ClusterIP      None             <none>         3100/TCP,9095/TCP            2d20h
service/loki-write                    ClusterIP      11.103.163.254   <none>         3100/TCP,9095/TCP            2d20h
service/loki-write-headless           ClusterIP      None             <none>         3100/TCP,9095/TCP            2d20h
deployment.apps/loki-gateway               2/2     2            2           47s
replicaset.apps/loki-gateway-896978574                2         2         2       47s
statefulset.apps/loki-read                              0/2     47s
statefulset.apps/loki-write                             0/2     47s
persistentvolumeclaim/data-loki-read-0                                                                 Bound    pvc-4018bd5b-0d79-46f9-8cde-46a5c5390fa7   10Gi       RWO            ceph-block                    16h
persistentvolumeclaim/data-loki-read-1                                                                 Bound    pvc-a8c66e22-9fff-499a-8a70-a94e70da1ee4   10Gi       RWO            ceph-block                    16h
persistentvolumeclaim/data-loki-write-0                                                                Bound    pvc-f980053f-0add-4495-87e9-c3f1602cce8d   10Gi       RWO            ceph-block                    16h
persistentvolumeclaim/data-loki-write-1                                                                Bound    pvc-1eb22486-966b-46c7-9377-60ea97627909   10Gi       RWO            ceph-block                    16h
secret/loki-chunks-bucket-v1                               Opaque                     2      16h
secret/sh.helm.release.v1.loki.v17                         helm.sh/release.v1         1      2m29s
secret/sh.helm.release.v1.loki.v18                         helm.sh/release.v1         1      47s
secret/sh.helm.release.v1.loki.v9                          helm.sh/release.v1         1      16h

➜  fleet-infra git:(main) ✗ kubectl -n monitoring logs pod/loki-read-0                
failed services
github.com/grafana/loki/pkg/loki.(*Loki).Run
    /src/loki/pkg/loki/loki.go:508
main.main
    /src/loki/cmd/loki/main.go:105
runtime.main
    /usr/local/go/src/runtime/proc.go:250
runtime.goexit
    /usr/local/go/src/runtime/asm_amd64.s:1594
➜  fleet-infra git:(main) ✗ kubectl -n monitoring logs loki-read-1    
failed services
github.com/grafana/loki/pkg/loki.(*Loki).Run
    /src/loki/pkg/loki/loki.go:508
main.main
    /src/loki/cmd/loki/main.go:105
runtime.main
    /usr/local/go/src/runtime/proc.go:250
runtime.goexit
    /usr/local/go/src/runtime/asm_amd64.s:1594
➜  fleet-infra git:(main) ✗ kubectl -n monitoring logs loki-write-0
failed services
github.com/grafana/loki/pkg/loki.(*Loki).Run
    /src/loki/pkg/loki/loki.go:508
main.main
    /src/loki/cmd/loki/main.go:105
runtime.main
    /usr/local/go/src/runtime/proc.go:250
runtime.goexit
    /usr/local/go/src/runtime/asm_amd64.s:1594
➜  fleet-infra git:(main) ✗ kubectl -n monitoring logs loki-write-1
failed services
github.com/grafana/loki/pkg/loki.(*Loki).Run
    /src/loki/pkg/loki/loki.go:508
main.main
    /src/loki/cmd/loki/main.go:105
runtime.main
    /usr/local/go/src/runtime/proc.go:250
runtime.goexit
    /usr/local/go/src/runtime/asm_amd64.s:1594

---
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: loki
  namespace: monitoring
spec:
  interval: 15m
  chart:
    spec:
      chart: loki
      version: 4.8.0
      sourceRef:
        kind: HelmRepository
        name: grafana-charts
        namespace: flux-system

  maxHistory: 3

  install:
    remediation:
      retries: 3

  upgrade:
    cleanupOnFail: true
    remediation:
      retries: 3

  uninstall:
    keepHistory: false

  dependsOn:
    - name: rook-ceph-cluster
      namespace: rook-ceph
    - name: kube-prometheus-stack
      namespace: monitoring

  values:
    loki:
      readinessProbe:
        initialDelaySeconds: 120

      structuredConfig:
        auth_enabled: false

        server:
          log_level: debug
          http_listen_port: 3100
          grpc_listen_port: 9095

        memberlist:
          join_members: ["loki-memberlist"]

        limits_config:
          retention_period: 14d
          enforce_metric_name: false
          reject_old_samples: true
          reject_old_samples_max_age: 168h
          max_cache_freshness_per_query: 10m
          split_queries_by_interval: 15m
          ingestion_rate_mb: 16
          ingestion_burst_size_mb: 32
          shard_streams:
            enabled: true

        schema_config:
          configs:
            - from: "2021-08-01"
              store: boltdb-shipper
              object_store: s3
              schema: v11
              index:
                prefix: loki_index_
                period: 24h

        common:
          path_prefix: /var/loki
          replication_factor: 2
          storage:
            s3:
              s3: null
              insecure: true
              s3forcepathstyle: true
          ring:
            kvstore:
              store: memberlist

        ruler:
          enable_api: true
          enable_alertmanager_v2: true
          alertmanager_url: http://kube-prometheus-stack-alertmanager.monitoring.svc.cluster.local:9093
          storage:
            type: local
            local:
              directory: /rules
          rule_path: /tmp/scratch
          ring:
            kvstore:
              store: memberlist

        distributor:
          ring:
            kvstore:
              store: memberlist

        compactor:
          working_directory: /var/loki/boltdb-shipper-compactor
          shared_store: s3
          compaction_interval: 10m
          retention_enabled: true
          retention_delete_delay: 2h
          retention_delete_worker_count: 150

        ingester:
          max_chunk_age: 1h
          lifecycler:
            ring:
              kvstore:
                store: memberlist

        analytics:
          reporting_enabled: false

    gateway:
      replicas: 2
      ingress:
        enabled: true
        ingressClassName: nginx
        hosts:
          - host: &host "loki.${SECRET_DOMAIN}"
            paths:
              - path: /
                pathType: Prefix
        tls:
          - hosts:
              - *host
      resources:
        requests:
          cpu: 50m
          memory: 64Mi

    read:
      replicas: 2
      persistence:
        storageClass: ceph-block
      extraVolumeMounts:
        - name: rules
          mountPath: /rules
      extraVolumes:
        - name: rules
          emptyDir: {}
      resources:
        requests:
          cpu: 100m
          memory: 500M

    write:
      replicas: 2
      persistence:
        storageClass: ceph-block
      resources:
        requests:
          cpu: 100m
          memory: 500M

    backend:
      replicas: 2
      persistence:
        storageClass: ceph-block
      extraVolumeMounts:
        - name: rules
          mountPath: /rules/fake
        - name: scratch
          mountPath: /tmp/scratch
      extraVolumes:
        - name: rules
          configMap:
            name: loki-alerting-rules
        - name: scratch
          emptyDir: {}
      resources:
        requests:
          cpu: 100m
          memory: 500M

    monitoring:
      serviceMonitor:
        enabled: false
        metricsInstance:
          enabled: false
      selfMonitoring:
        enabled: false
        grafanaAgent:
          installOperator: false
      lokiCanary:
        enabled: false

    test:
      enabled: false

  valuesFrom:
    - targetPath: loki.structuredConfig.common.storage.s3.bucketnames
      kind: ConfigMap
      name: loki-chunks-bucket-v1
      valuesKey: BUCKET_NAME
    - targetPath: loki.structuredConfig.common.storage.s3.endpoint
      kind: ConfigMap
      name: loki-chunks-bucket-v1
      valuesKey: BUCKET_HOST
    - targetPath: loki.structuredConfig.common.storage.s3.access_key_id
      kind: Secret
      name: loki-chunks-bucket-v1
      valuesKey: AWS_ACCESS_KEY_ID
    - targetPath: loki.structuredConfig.common.storage.s3.secret_access_key
      kind: Secret
      name: loki-chunks-bucket-v1
      valuesKey: AWS_SECRET_ACCESS_KEY

DylanGuedes commented 1 year ago

Hey thank you for your report. Could you share how did you run/deploy the project? So I can replicate it myself. Suggestions:

Try running with replicas=1, ring store=inmemory and with replication_factor: 1. If that works it means you have a network issue
Try comparing your configuration with the default one available under cmd/loki.

rwlove commented 1 year ago

Hey thank you for your report. Could you share how did you run/deploy the project? So I can replicate it myself. Suggestions:

I deployed it via Flux2 (GitOps infrastructure), so I'm not sure how easily you could reproduce. If there's something else I can share, I'd be glad to.

rwlove commented 1 year ago

Hey thank you for your report. Could you share how did you run/deploy the project? So I can replicate it myself. Suggestions:

Try running with replicas=1, ring store=inmemory and with replication_factor: 1. If that works it means you have a network issue

Read and Write pods start just fine with the above configuration.

As far as I can tell my network is fine. Any suggestions one what to look for?


KVStore:                 Ok   Disabled
Kubernetes:              Ok   1.25 (v1.25.5) [linux/amd64]
Kubernetes APIs:         ["cilium/v2::CiliumClusterwideNetworkPolicy", "cilium/v2::CiliumEndpoint", "cilium/v2::CiliumLocalRedirectPolicy", "cilium/v2::CiliumNetworkPolicy", "cilium/v2::CiliumNode", "core/v1::Namespace", "core/v1::Node", "core/v1::Pods", "core/v1::Service", "discovery/v1::EndpointSlice", "networking.k8s.io/v1::NetworkPolicy"]
KubeProxyReplacement:    Strict   [enp1s0 192.168.4.8 (Direct Routing)]
Host firewall:           Disabled
CNI Chaining:            none
CNI Config file:         CNI configuration file management disabled
Cilium:                  Ok   1.13.0 (v1.13.0-c9723a8d)
NodeMonitor:             Listening for events on 6 CPUs with 64x4096 of shared memory
Cilium health daemon:    Ok   
IPAM:                    IPv4: 31/254 allocated from 11.0.5.0/24, 
IPv6 BIG TCP:            Disabled
BandwidthManager:        Disabled
Host Routing:            Legacy
Masquerading:            IPTables [IPv4: Enabled, IPv6: Disabled]
Controller Status:       149/149 healthy
Proxy Status:            OK, ip 11.0.5.151, 0 redirects active on ports 10000-20000
Global Identity Range:   min 256, max 65535
Hubble:                  Ok   Current/Max Flows: 4095/4095 (100.00%), Flows/s: 90.27   Metrics: Ok
Encryption:              Disabled
Cluster health:          12/12 reachable   (2023-03-06T22:22:09Z)```

grafana / loki

Read and Write pods fail with "failed services" message #8703