longhorn / longhorn

Cloud-Native distributed storage built on and for Kubernetes
https://longhorn.io
Apache License 2.0
5.97k stars 588 forks source link

[BUG] engine-image-ei with noexec /var filesystem #9083

Open Bonn93 opened 1 month ago

Bonn93 commented 1 month ago

Describe the bug

Installing 1.6.2 of longhorn on a k3s cluster fails with engine-image-ei pods in CrashLoopBackoff. The cause appears to be that /var/lib/longhorn is always used by this pod. The underlaying hosts have CIS L1 hardening items applied, and that /var/lib has the noexec option on the mount.

Trying to modify the installation by using the default-data-dir and also editing any mounts to /var/lib does not help. Longhorn can use and seems to create some data/folders under a desired path such as /data howver the engine-image-ei pods still appear to use /var/lib/longhorn

To Reproduce

longhorn-system   engine-image-ei-b907910b-2bz7r                      0/1     CrashLoopBackOff   23 (2m30s ago)   65m
longhorn-system   engine-image-ei-b907910b-2s9jf                      0/1     CrashLoopBackOff   23 (2m5s ago)    65m
longhorn-system   engine-image-ei-b907910b-cnnh4                      0/1     CrashLoopBackOff   23 (110s ago)    65m
longhorn-system   engine-image-ei-b907910b-fgz7l                      0/1     CrashLoopBackOff   23 (2m25s ago)   65m
longhorn-system   engine-image-ei-b907910b-g6qcd                      0/1     CrashLoopBackOff   23 (2m25s ago)   65m
longhorn-system   engine-image-ei-b907910b-qjh2p                      0/1     CrashLoopBackOff   23 (2m20s ago)   65m
longhorn-system   engine-image-ei-b907910b-x8c8l                      0/1     CrashLoopBackOff   23 (110s ago)    65m
longhorn-system   csi-attacher-74bdc4bff-4nhg2                        1/1     Running            0                64m
longhorn-system   csi-attacher-74bdc4bff-rmqss                        1/1     Running            0                64m
longhorn-system   csi-attacher-74bdc4bff-s2qx4                        1/1     Running            0                64m
longhorn-system   csi-provisioner-58cc84b487-2wzqd                    1/1     Running            0                64m
longhorn-system   csi-provisioner-58cc84b487-krkrx                    1/1     Running            0                64m
longhorn-system   csi-provisioner-58cc84b487-wxb72                    1/1     Running            0                64m
longhorn-system   csi-resizer-786bccbfd9-7smn9                        1/1     Running            0                64m
longhorn-system   csi-resizer-786bccbfd9-gxskc                        1/1     Running            0                64m
longhorn-system   csi-resizer-786bccbfd9-v8hhc                        1/1     Running            0                64m
longhorn-system   csi-snapshotter-68b686dc4-gcn2v                     1/1     Running            0                64m
longhorn-system   csi-snapshotter-68b686dc4-hlhl4                     1/1     Running            0                64m
longhorn-system   csi-snapshotter-68b686dc4-nkw4h                     1/1     Running            0                64m
longhorn-system   engine-image-ei-b907910b-2bz7r                      0/1     CrashLoopBackOff   23 (2m30s ago)   65m
longhorn-system   engine-image-ei-b907910b-2s9jf                      0/1     CrashLoopBackOff   23 (2m5s ago)    65m
longhorn-system   engine-image-ei-b907910b-cnnh4                      0/1     CrashLoopBackOff   23 (110s ago)    65m
longhorn-system   engine-image-ei-b907910b-fgz7l                      0/1     CrashLoopBackOff   23 (2m25s ago)   65m
longhorn-system   engine-image-ei-b907910b-g6qcd                      0/1     CrashLoopBackOff   23 (2m25s ago)   65m
longhorn-system   engine-image-ei-b907910b-qjh2p                      0/1     CrashLoopBackOff   23 (2m20s ago)   65m
longhorn-system   engine-image-ei-b907910b-x8c8l                      0/1     CrashLoopBackOff   23 (110s ago)    65m
longhorn-system   instance-manager-1fcb4a0df0fc14b2edb52c5aa1cffa1a   1/1     Running            0                65m
longhorn-system   instance-manager-3a79544966305408c40bce3658873f7f   1/1     Running            0                65m
longhorn-system   instance-manager-3eafe182640072b0f9c19d0ca1e500d5   1/1     Running            0                65m
longhorn-system   instance-manager-52f790faf3725d782e8773df46875a5a   1/1     Running            0                65m
longhorn-system   instance-manager-9109a8c190f4b5c44d698bb66b8daaca   1/1     Running            0                64m
longhorn-system   instance-manager-9d1b98d75ec7614eca8a433fba366ec1   1/1     Running            0                65m
longhorn-system   instance-manager-d3df5f1e2b88999472290d2844e377e5   1/1     Running            0                65m
longhorn-system   longhorn-csi-plugin-52bcp                           3/3     Running            0                64m
longhorn-system   longhorn-csi-plugin-5jvw8                           3/3     Running            0                64m
longhorn-system   longhorn-csi-plugin-fkvnw                           3/3     Running            0                64m
longhorn-system   longhorn-csi-plugin-gcvtz                           3/3     Running            0                64m
longhorn-system   longhorn-csi-plugin-jr66b                           3/3     Running            0                64m
longhorn-system   longhorn-csi-plugin-lt46m                           3/3     Running            0                64m
longhorn-system   longhorn-csi-plugin-ttgww                           3/3     Running            0                64m
longhorn-system   longhorn-driver-deployer-759d5d7c9f-bjmpr           1/1     Running            0                65m
longhorn-system   longhorn-manager-4r6xg                              2/2     Running            1 (65m ago)      65m
longhorn-system   longhorn-manager-76mrw                              2/2     Running            2 (65m ago)      65m
longhorn-system   longhorn-manager-8cmc6                              2/2     Running            1 (65m ago)      65m
longhorn-system   longhorn-manager-gblpx                              2/2     Running            1 (65m ago)      65m
longhorn-system   longhorn-manager-j5j8t                              2/2     Running            1 (65m ago)      65m
longhorn-system   longhorn-manager-z4jlc                              2/2     Running            1 (65m ago)      65m
longhorn-system   longhorn-manager-zvggj                              2/2     Running            1 (65m ago)      65m
longhorn-system   longhorn-ui-578c78f46-7cqr9                         1/1     Running            1 (65m ago)      65m
longhorn-system   longhorn-ui-578c78f46-hh74w                         1/1     Running            1 (65m ago)      65m

Longhorn manager reports:

time="2024-07-19T03:47:01Z" level=warning msg="Failed to clean up all mount points" func="controller.(*BackupTargetController).reconcile" file="backup_target_controller.go:305" controller=longhorn-backup-target cred= error="error clean up all mount points: failed to execute: /var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-master-head/longhorn [/var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-master-head/longhorn backup cleanup-all-mounts], output , stderr : fork/exec /var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-master-head/longhorn

Expected behavior

Longhorn system is installed and stable. engine-image-ei pods are no the crash loop state.

Support bundle for troubleshooting

UI is not currently deployed or accessible, but working on this.

Environment

Additional context

I have tried using the YAML and setting so vars:

# Source: longhorn/templates/default-setting.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: longhorn-default-setting
  namespace: longhorn-system
  labels:
    app.kubernetes.io/name: longhorn
    app.kubernetes.io/instance: longhorn
    app.kubernetes.io/version: v1.6.2
data:
  default-setting.yaml: |-
    priority-class: longhorn-critical
    default-data-path: /data/longhorn

And also trying the change the mounts on

# Source: longhorn/templates/daemonset-sa.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  labels:
    app.kubernetes.io/name: longhorn
    app.kubernetes.io/instance: longhorn
    app.kubernetes.io/version: v1.6.2
    app: longhorn-manager
  name: longhorn-manager
  namespace: longhorn-system
spec:
  selector:
    matchLabels:
      app: longhorn-manager
  template:
    metadata:
      labels:
        app.kubernetes.io/name: longhorn
        app.kubernetes.io/instance: longhorn
        app.kubernetes.io/version: v1.6.2
        app: longhorn-manager
    spec:
      containers:
      - name: longhorn-manager
        image: longhornio/longhorn-manager:v1.6.2
        imagePullPolicy: IfNotPresent
        securityContext:
          privileged: true
        command:
        - longhorn-manager
        - -d
        - daemon
        - --engine-image
        - "longhornio/longhorn-engine:v1.6.2"
        - --instance-manager-image
        - "longhornio/longhorn-instance-manager:v1.6.2"
        - --share-manager-image
        - "longhornio/longhorn-share-manager:v1.6.2"
        - --backing-image-manager-image
        - "longhornio/backing-image-manager:v1.6.2"
        - --support-bundle-manager-image
        - "longhornio/support-bundle-kit:v0.0.37"
        - --manager-image
        - "longhornio/longhorn-manager:v1.6.2"
        - --service-account
        - longhorn-service-account
        - --upgrade-version-check
        ports:
        - containerPort: 9500
          name: manager
        - containerPort: 9501
          name: conversion-wh
        - containerPort: 9502
          name: admission-wh
        - containerPort: 9503
          name: recov-backend
        readinessProbe:
          httpGet:
            path: /v1/healthz
            port: 9501
            scheme: HTTPS
        volumeMounts:
        - name: dev
          mountPath: /host/dev/
        - name: proc
          mountPath: /host/proc/
        - name: longhorn
          mountPath: /var/lib/longhorn/ # <<<<<<<<<<<<<<<<< /data/longhorn does not appear to work and still uses /var/lib
          mountPropagation: Bidirectional
        - name: longhorn-grpc-tls
          mountPath: /tls-files/
derekbit commented 1 month ago

A workaround is removing noexec of /var. Have you tried it?

ejweber commented 1 month ago

Longhorn can use and seems to create some data/folders under a desired path such as /data howver the engine-image-ei pods still appear to use /var/lib/longhorn

I wasn't sure how this was working, so I did a quick check. /var/lib/longhorn is hardcoded as EngineBinaryDirectoryOnHost in longhorn-manager. Longhorn-manager DOES use this directory to execute commands directly, and the path is also used to populate the specs of other components:

https://github.com/longhorn/longhorn-manager/blob/7e0957a220c4718c3280ac3ee67dd22a9a8dcd1b/types/types.go#L96-L97

Currently, there is no way for Longhorn to avoid using /var/lib/longhorn.

Bonn93 commented 1 month ago

Dropping noexec from the mount does indeed work, but quite a hard conversation in a secured environment with CIS and STIG baselines in place. It's something I can work with, but I think getting a method to use a different dir would be ideal.

derekbit commented 1 month ago

I see. We need to improve it. IIRC, we've seen the issues from the feedback several times. cc @innobead