metal-stack / csi-driver-lvm

MIT License
120 stars 25 forks source link

Add support for Talos OS #87

Closed gerhard closed 1 year ago

gerhard commented 1 year ago

/etc on Talos OS is read-only, which means that new PVs will fail to create.

This change makes the hard-coded /etc/lvm configurable via the --hostwritepath flag.

This is a requirement for:


⚠️ This also changes the current /run/lock/lvm to /etc/lvm/lock

gerhard commented 1 year ago

Given the following local diff:

diff --git a/tests/bats/test.bats b/tests/bats/test.bats
index 6f4d684..bc4e003 100644
--- a/tests/bats/test.bats
+++ b/tests/bats/test.bats
@@ -1,7 +1,7 @@
 #!/usr/bin/env bats -p

 @test "deploy csi-lvm-controller" {
-    run helm upgrade --install --repo ${HELM_REPO} csi-driver-lvm csi-driver-lvm --values values.yaml --wait --timeout=120s
+    run helm upgrade --install csi-driver-lvm helm-charts/charts/csi-driver-lvm --values values.yaml --wait --timeout=120s
     [ "$status" -eq 0 ]
 }

diff --git a/tests/values.yaml b/tests/values.yaml
index e9703d7..2888828 100644
--- a/tests/values.yaml
+++ b/tests/values.yaml
@@ -11,3 +11,4 @@ provisionerImage:

 lvm:
   devicePattern: /dev/loop100,/dev/loop101
+  hostWritePath: /var/etc/lvm

Given a local tests/helm-charts repo with my changes from https://github.com/metal-stack/helm-charts/pull/64, make test passed locally:

image

FWIW, the first integration test run failed 👇 - flaky test I am assuming - but the second one passed 👆. I refactored all tests to not use hard-coded sleeps, test a single thing, and leverage kubectl wait --for=jsonpath<CONDITION> --timeout=10s as much as possible.

image

Next, I am going to deploy all these changes in my Talos OS cluster and check that PVs are now created as expected. That will be the test that matters the most to me.

I am unable to check if the GitHub workflow still works. Will need one of the maintainers to approve this PR before it runs.

Is there anything that needs to change before this gets merged?


After this goes in, I can finish https://github.com/metal-stack/helm-charts/pull/64.

gerhard commented 1 year ago

I deployed https://github.com/users/gerhard/packages/container/csi-driver-lvm/87639324?tag=2023-04-23.2 & https://github.com/users/gerhard/packages/container/csi-driver-lvm-provisioner/87639357?tag=2023-04-23.2 in Talos OS v1.4.0 running K8s v1.26.4 and I am able to provision, resize & delete PVs as expected.

This is what the PV looks like:

# k get pv/pvc-9833f5f8-c33b-4722-9dd2-2dad4b948f45 -o yaml
apiVersion: v1
kind: PersistentVolume
metadata:
  annotations:
    pv.kubernetes.io/provisioned-by: lvm.csi.metal-stack.io
  creationTimestamp: "2023-04-23T18:22:25Z"
  finalizers:
  - kubernetes.io/pv-protection
  name: pvc-9833f5f8-c33b-4722-9dd2-2dad4b948f45
  resourceVersion: "957454"
  uid: a01ef0a6-1253-4c8a-bb9d-8340cd1d5581
spec:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 1G
  claimRef:
    apiVersion: v1
    kind: PersistentVolumeClaim
    name: mariadb
    namespace: wp
    resourceVersion: "957400"
    uid: 9833f5f8-c33b-4722-9dd2-2dad4b948f45
  csi:
    driver: lvm.csi.metal-stack.io
    fsType: xfs
    volumeAttributes:
      RequiredBytes: "1000000000"
      fsType: xfs
      storage.kubernetes.io/csiProvisionerIdentity: 1682259423049-8081-lvm.csi.metal-stack.io
      type: linear
    volumeHandle: pvc-9833f5f8-c33b-4722-9dd2-2dad4b948f45
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: topology.lvm.csi/node
          operator: In
          values:
          - h12-london
  persistentVolumeReclaimPolicy: Delete
  storageClassName: csi-driver-lvm-linear-xfs
  volumeMode: Filesystem
status:
  phase: Bound

And this is the related PVC:

# k get pvc/mariadb -n wp -o yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"v1","kind":"PersistentVolumeClaim","metadata":{"annotations":{},"labels":{"app":"wordpress"},"name":"mariadb","namespace":"wp"},"spec":{"accessModes":["ReadWriteOnce"],"resources":{"requests":{"storage":"1G"}},"storageClassName":"csi-driver-lvm-linear-xfs"}}
    pv.kubernetes.io/bind-completed: "yes"
    pv.kubernetes.io/bound-by-controller: "yes"
    volume.beta.kubernetes.io/storage-provisioner: lvm.csi.metal-stack.io
    volume.kubernetes.io/storage-provisioner: lvm.csi.metal-stack.io
  creationTimestamp: "2023-04-23T18:22:19Z"
  finalizers:
  - kubernetes.io/pvc-protection
  labels:
    app: wordpress
  name: mariadb
  namespace: wp
  resourceVersion: "957456"
  uid: 9833f5f8-c33b-4722-9dd2-2dad4b948f45
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 1G
  storageClassName: csi-driver-lvm-linear-xfs
  volumeMode: Filesystem
  volumeName: pvc-9833f5f8-c33b-4722-9dd2-2dad4b948f45
status:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 1G
  phase: Bound

This is what vgdisplay returns:

nsenter --uts --ipc --mount --pid --target 1 vgdisplay

  --- Volume group ---
  VG Name               csi-lvm
  System ID
  Format                lvm2
  Metadata Areas        1
  Metadata Sequence No  20
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                2
  Open LV               2
  Max PV                0
  Cur PV                1
  Act PV                1
  VG Size               <465.76 GiB
  PE Size               4.00 MiB
  Total PE              119234
  Alloc PE / Size       478 / <1.87 GiB
  Free  PE / Size       118756 / 463.89 GiB
  VG UUID               5XscWv-q1Ba-xg2a-XvQz-Jvuz-cgiC-9wzMEN

And here is pvs & lvs:

nsenter --uts --ipc --mount --pid --target 1 pvs
  PV         VG      Fmt  Attr PSize    PFree
  /dev/sda   csi-lvm lvm2 a--  <465.76g 463.89g

nsenter --uts --ipc --mount --pid --target 1 lvs
  LV                                       VG      Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  pvc-9833f5f8-c33b-4722-9dd2-2dad4b948f45 csi-lvm -wi-ao---- 956.00m
  pvc-99830868-a06a-4dff-904f-483e0dfb5121 csi-lvm -wi-ao---- 956.00m

Docker Build GitHub Actions workflow passed on my fork: https://github.com/gerhard/csi-driver-lvm/actions/runs/4775196008

What else needs to happen before this gets merged?

gerhard commented 1 year ago

Thanks for the approve @majst01! I improved the commit messages, all code changes remained the same.

Just waiting for your 👀 & 👍 @mwennrich

gerhard commented 1 year ago

WDYT @mwennrich ?

jleeh commented 1 year ago

Thanks for making this change @gerhard. I just wanted to use this on a talos cluster and hit the issue, stumbled across this PR. Can confirm both your chart and this change is working for me on our cluster also.

gerhard commented 1 year ago

So glad that this helped @jleeh.

Thanks for the 👍 @mwennrich. Can this be merged now & a new release cut?

Once that is done, I can finish https://github.com/metal-stack/csi-driver-lvm/issues/86 so that a new chart version can be made available.

FWIW, I have deployed this into production last weekend. So far, so good 🤞

Screenshot 2023-04-28 at 07 18 00
gerhard commented 1 year ago

⭐️⭐️⭐️⭐️⭐️ @majst01

Gerrit91 commented 1 year ago

Awesome additions! Thank you!