grafana / loki

Like Prometheus, but for logs.
https://grafana.com/loki
GNU Affero General Public License v3.0
23.87k stars 3.45k forks source link

Deleting a corrupted directory for boltdb #6582

Closed firepro20 closed 2 years ago

firepro20 commented 2 years ago

I currently have a downed pod which was generated through a helm install of loki. The helm install created a statefulset that controls the loki pod instances. The pod is getting stuck in a back-off restarting failed container state. This is the log from the loki-0 pod:

level=info ts=2022-07-05T19:10:11.52783126Z caller=main.go:94 msg="Starting Loki" version="(version=2.4.2, branch=HEAD, revision=525040a32)"
level=info ts=2022-07-05T19:10:11.528004697Z caller=modules.go:573 msg="RulerStorage is not configured in single binary mode and will not be started."
level=info ts=2022-07-05T19:10:11.528533942Z caller=server.go:260 http=[::]:3100 grpc=[::]:9095 msg="server listening on addresses"
level=warn ts=2022-07-05T19:10:11.52905735Z caller=experimental.go:19 msg="experimental feature in use" feature="In-memory (FIFO) cache"
level=info ts=2022-07-05T19:10:11.529660842Z caller=table_manager.go:241 msg="loading table index_19128"
level=error ts=2022-07-05T19:10:11.529969171Z caller=table.go:491 msg="failed to open file /data/loki/boltdb-shipper-active/index_19128/1652736600. Please fix or remove this file." err="file size too small"
unexpected fault address 0x7f9b138b7008
fatal error: fault
[signal SIGBUS: bus error code=0x2 addr=0x7f9b138b7008 pc=0x17d347e]

goroutine 1 [running]:
runtime.throw({0x223d9b7, 0x7f9b13b54468})
    /usr/local/go/src/runtime/panic.go:1198 +0x71 fp=0xc00068bd70 sp=0xc00068bd40 pc=0x435851
runtime.sigpanic()
    /usr/local/go/src/runtime/signal_unix.go:732 +0x125 fp=0xc00068bdc0 sp=0xc00068bd70 pc=0x44bc05
go.etcd.io/bbolt.(*Cursor).search(0xc00068bf08, {0x38d92b0, 0x5, 0x5}, 0xc00068bea0)
    /src/loki/vendor/go.etcd.io/bbolt/cursor.go:249 +0x5e fp=0xc00068be58 sp=0xc00068bdc0 pc=0x17d347e

This is the loki-values.yaml file:

loki:
  image:
    pullPolicy: IfNotPresent
    repository: grafana/loki
    tag: 2.4.2
  persistence:
    enabled: true
    size: 12Gi
    storageClassName: do-block-storage
  config:
    ingester:
      wal:
        enabled: false # After upgrade to 2.4.1, we encountered a problem "mkdir wal: read-only file system".  Tried creating /data/loki/wal directory manually but it still did not work.
    compactor:
      working_directory: /data/loki/retention
      shared_store: filesystem
      compaction_interval: 10m
      retention_enabled: true
      retention_delete_delay: 2h
      retention_delete_worker_count: 150
    limits_config:
      retention_period: 336h # 2 weeks
    schema_config:
      configs:
      - from: "2021-11-26"
        store: boltdb-shipper
        object_store: filesystem
        schema: v11
        index:
          period: 24h
          prefix: index_
    storage_config:
      boltdb:
        directory: /data/loki/index
      boltdb_shipper:
        active_index_directory: /data/loki/boltdb-shipper-active
        cache_location: /data/loki/boltdb-shipper-cache
        cache_ttl: 24h
        shared_store: filesystem
      filesystem:
        directory: /data/loki/chunks
  tolerations:
  - effect: NoExecute
    key: environment
    operator: Equal
    value: prod
promtail:
  tolerations:
  - effect: NoExecute
    key: environment
    operator: Equal
    value: prod

I tried to execute a bash command to delete the affected directory /data/loki/boltdb-shipper-active/index_19128/ by adding command arguments in the pod yaml configuration however I was not allowed to modify the spec container of the pod directory as it was resulting in an error.

loki-0 pod yaml:

apiVersion: v1
kind: Pod
metadata:
  name: loki-0
  generateName: loki-
  namespace: monitoring
  uid: 30d23988-b124-4a5a-910b-d7b494ecfb34
  resourceVersion: '176526416'
  creationTimestamp: '2022-07-04T20:51:50Z'
  labels:
    app: loki
    controller-revision-hash: loki-687c55fc5
    name: loki
    release: loki
    statefulset.kubernetes.io/pod-name: loki-0
  annotations:
    checksum/config: f658e8a0ef515ab2e874b194df8f08c7fd5fc3e8f9f6128943b577fe5d503628
    prometheus.io/port: http-metrics
    prometheus.io/scrape: 'true'
  ownerReferences:
    - apiVersion: apps/v1
      kind: StatefulSet
      name: loki
      uid: 729aeecb-2495-4e1a-b0a3-2a7cdfe5ebdd
      controller: true
      blockOwnerDeletion: true
  managedFields:
    - manager: kube-controller-manager
      operation: Update
      apiVersion: v1
      time: '2022-07-04T20:51:50Z'
      fieldsType: FieldsV1
      fieldsV1:
        f:metadata:
          f:annotations:
            .: {}
            f:checksum/config: {}
            f:prometheus.io/port: {}
            f:prometheus.io/scrape: {}
          f:generateName: {}
          f:labels:
            .: {}
            f:app: {}
            f:controller-revision-hash: {}
            f:name: {}
            f:release: {}
            f:statefulset.kubernetes.io/pod-name: {}
          f:ownerReferences:
            .: {}
            k:{"uid":"729aeecb-2495-4e1a-b0a3-2a7cdfe5ebdd"}:
              .: {}
              f:apiVersion: {}
              f:blockOwnerDeletion: {}
              f:controller: {}
              f:kind: {}
              f:name: {}
              f:uid: {}
        f:spec:
          f:affinity: {}
          f:containers:
            k:{"name":"loki"}:
              .: {}
              f:args: {}
              f:image: {}
              f:imagePullPolicy: {}
              f:livenessProbe:
                .: {}
                f:failureThreshold: {}
                f:httpGet:
                  .: {}
                  f:path: {}
                  f:port: {}
                  f:scheme: {}
                f:initialDelaySeconds: {}
                f:periodSeconds: {}
                f:successThreshold: {}
                f:timeoutSeconds: {}
              f:name: {}
              f:ports:
                .: {}
                k:{"containerPort":3100,"protocol":"TCP"}:
                  .: {}
                  f:containerPort: {}
                  f:name: {}
                  f:protocol: {}
              f:readinessProbe:
                .: {}
                f:failureThreshold: {}
                f:httpGet:
                  .: {}
                  f:path: {}
                  f:port: {}
                  f:scheme: {}
                f:initialDelaySeconds: {}
                f:periodSeconds: {}
                f:successThreshold: {}
                f:timeoutSeconds: {}
              f:resources: {}
              f:securityContext:
                .: {}
                f:readOnlyRootFilesystem: {}
              f:terminationMessagePath: {}
              f:terminationMessagePolicy: {}
              f:volumeMounts:
                .: {}
                k:{"mountPath":"/data"}:
                  .: {}
                  f:mountPath: {}
                  f:name: {}
                k:{"mountPath":"/etc/loki"}:
                  .: {}
                  f:mountPath: {}
                  f:name: {}
          f:dnsPolicy: {}
          f:enableServiceLinks: {}
          f:hostname: {}
          f:restartPolicy: {}
          f:schedulerName: {}
          f:securityContext:
            .: {}
            f:fsGroup: {}
            f:runAsGroup: {}
            f:runAsNonRoot: {}
            f:runAsUser: {}
          f:serviceAccount: {}
          f:serviceAccountName: {}
          f:subdomain: {}
          f:terminationGracePeriodSeconds: {}
          f:tolerations: {}
          f:volumes:
            .: {}
            k:{"name":"config"}:
              .: {}
              f:name: {}
              f:secret:
                .: {}
                f:defaultMode: {}
                f:secretName: {}
            k:{"name":"storage"}:
              .: {}
              f:name: {}
              f:persistentVolumeClaim:
                .: {}
                f:claimName: {}
    - manager: kubelet
      operation: Update
      apiVersion: v1
      time: '2022-07-04T20:52:20Z'
      fieldsType: FieldsV1
      fieldsV1:
        f:status:
          f:conditions:
            k:{"type":"ContainersReady"}:
              .: {}
              f:lastProbeTime: {}
              f:lastTransitionTime: {}
              f:message: {}
              f:reason: {}
              f:status: {}
              f:type: {}
            k:{"type":"Initialized"}:
              .: {}
              f:lastProbeTime: {}
              f:lastTransitionTime: {}
              f:status: {}
              f:type: {}
            k:{"type":"Ready"}:
              .: {}
              f:lastProbeTime: {}
              f:lastTransitionTime: {}
              f:message: {}
              f:reason: {}
              f:status: {}
              f:type: {}
          f:containerStatuses: {}
          f:hostIP: {}
          f:phase: {}
          f:podIP: {}
          f:podIPs:
            .: {}
            k:{"ip":"10.244.1.222"}:
              .: {}
              f:ip: {}
          f:startTime: {}
  selfLink: /api/v1/namespaces/monitoring/pods/loki-0
status:
  phase: Running
  conditions:
    - type: Initialized
      status: 'True'
      lastProbeTime: null
      lastTransitionTime: '2022-07-04T20:51:50Z'
    - type: Ready
      status: 'False'
      lastProbeTime: null
      lastTransitionTime: '2022-07-04T20:51:50Z'
      reason: ContainersNotReady
      message: 'containers with unready status: [loki]'
    - type: ContainersReady
      status: 'False'
      lastProbeTime: null
      lastTransitionTime: '2022-07-04T20:51:50Z'
      reason: ContainersNotReady
      message: 'containers with unready status: [loki]'
    - type: PodScheduled
      status: 'True'
      lastProbeTime: null
      lastTransitionTime: '2022-07-04T20:51:50Z'
  hostIP: 10.114.0.2
  podIP: 10.244.1.222
  podIPs:
    - ip: 10.244.1.222
  startTime: '2022-07-04T20:51:50Z'
  containerStatuses:
    - name: loki
      state:
        waiting:
          reason: CrashLoopBackOff
          message: >-
            back-off 5m0s restarting failed container=loki
            pod=loki-0_monitoring(30d23988-b124-4a5a-910b-d7b494ecfb34)
      lastState:
        terminated:
          exitCode: 2
          reason: Error
          startedAt: '2022-07-05T19:15:19Z'
          finishedAt: '2022-07-05T19:15:19Z'
          containerID: >-
            containerd://7545732da1d82ef04a34de13edbf3d512b256751d63fdb9b4889316bac09ffda
      ready: false
      restartCount: 267
      image: docker.io/grafana/loki:2.4.2
      imageID: >-
        docker.io/grafana/loki@sha256:b3af8ead67d7e80fec05029f783784df897e92b6dba31fe4b33ab4ea3e989573
      containerID: >-
        containerd://7545732da1d82ef04a34de13edbf3d512b256751d63fdb9b4889316bac09ffda
      started: false
  qosClass: BestEffort
spec:
  volumes:
    - name: storage
      persistentVolumeClaim:
        claimName: storage-loki-0
    - name: config
      secret:
        secretName: loki
        defaultMode: 420
    - name: kube-api-access-r2rz6
      projected:
        sources:
          - serviceAccountToken:
              expirationSeconds: 3607
              path: token
          - configMap:
              name: kube-root-ca.crt
              items:
                - key: ca.crt
                  path: ca.crt
          - downwardAPI:
              items:
                - path: namespace
                  fieldRef:
                    apiVersion: v1
                    fieldPath: metadata.namespace
        defaultMode: 420
  containers:
    - name: loki
      image: grafana/loki:2.4.2
      args:
        - '-config.file=/etc/loki/loki.yaml'
      ports:
        - name: http-metrics
          containerPort: 3100
          protocol: TCP
      resources: {}
      volumeMounts:
        - name: config
          mountPath: /etc/loki
        - name: storage
          mountPath: /data
        - name: kube-api-access-r2rz6
          readOnly: true
          mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      livenessProbe:
        httpGet:
          path: /ready
          port: http-metrics
          scheme: HTTP
        initialDelaySeconds: 45
        timeoutSeconds: 1
        periodSeconds: 10
        successThreshold: 1
        failureThreshold: 3
      readinessProbe:
        httpGet:
          path: /ready
          port: http-metrics
          scheme: HTTP
        initialDelaySeconds: 45
        timeoutSeconds: 1
        periodSeconds: 10
        successThreshold: 1
        failureThreshold: 3
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File
      imagePullPolicy: IfNotPresent
      securityContext:
        readOnlyRootFilesystem: true
  restartPolicy: Always
  terminationGracePeriodSeconds: 4800
  dnsPolicy: ClusterFirst
  serviceAccountName: loki
  serviceAccount: loki
  nodeName: dev2-us7j8
  securityContext:
    runAsUser: 10001
    runAsGroup: 10001
    runAsNonRoot: true
    fsGroup: 10001
  hostname: loki-0
  subdomain: loki-headless
  affinity: {}
  schedulerName: default-scheduler
  tolerations:
    - key: environment
      operator: Equal
      value: prod
      effect: NoExecute
    - key: node.kubernetes.io/not-ready
      operator: Exists
      effect: NoExecute
      tolerationSeconds: 300
    - key: node.kubernetes.io/unreachable
      operator: Exists
      effect: NoExecute
      tolerationSeconds: 300
  priority: 0
  enableServiceLinks: true
  preemptionPolicy: PreemptLowerPriority

Disclaimer: I am quite new at Kubernetes and YAML configuration but I am learning!

dannykopping commented 2 years ago

Thank you for your question / support request. We try to keep GitHub issues strictly for bug reports and feature requests.

You may submit questions and support requests in any of the following ways:

I'm closing this issue, but please feel free to reach out in any of the channels listed above.

Abdalla-Alzahabi commented 2 years ago

same issue with me

firepro20 commented 2 years ago

https://community.grafana.com/t/deleting-a-corrupted-directory-for-boltdb-file-too-small/68289

firepro20 commented 2 years ago

I managed to resolve the issue by specifying in loki-values.yaml temp directories in order to bypass the affected directory and boot up the container to manually clean the corrupted file.

I also updated the loki file to the latest date.

schema_config:
      configs:
      - from: "2022-08-18"
        store: boltdb-shipper
        object_store: filesystem
        schema: v11
        index:
          period: 24h
          prefix: index_
storage_config:
      boltdb:
        directory: /data/loki/index
      boltdb_shipper:
        active_index_directory: /data/loki/boltdb-shipper-active/temp
        cache_location: /data/loki/boltdb-shipper-cache/temp
        cache_ttl: 24h
        shared_store: filesystem
      filesystem:
        directory: /data/loki/chunks

After I cleaned the offending file / directory, I checked the system mount storage through dfcommand and noticed it was full. I cleaned the /data directory A cleanup for /data mount has been proposed and is currently in review.

Once the container was backup and grafana/loki was running again, I set the directory paths back to their original value, removing /temp and redeployed.