StoragePools not recovering after node loss - Self Heal Option?

inviscid commented 1 year ago

I'm not sure if I've have everything configured correctly since my StoragePools do not recover when nodes are lost. Since a node has no guarantee to have the same name after a loss (GKE node creation), the StoragePool continues to look for the original node name and goes into a state where other nodes are now available (the replacements) but continues to think the original nodes will come back online.

Is there a way to configure Piraeus to self-heal the StoragePools using whatever available nodes meet the NodeSelector criteria? If nodes are offline for some period of time can they be automatically removed from the StoragePool since they are likely gone forever.

Original test storage pools:

+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| StoragePool          | Node                                    | Driver   | PoolName               | FreeCapacity | TotalCapacity | CanSnapshots | State | SharedName |
|=======================================================================================================================================================================|
| DfltDisklessStorPool | gke-gp-test2-default-pool-529cc55a-b4hl | DISKLESS |                        |              |               | False        | Ok    |            |
| DfltDisklessStorPool | gke-gp-test2-default-pool-529cc55a-dvs9 | DISKLESS |                        |              |               | False        | Ok    |            |
| DfltDisklessStorPool | gke-gp-test2-default-pool-529cc55a-eqrj | DISKLESS |                        |              |               | False        | Ok    |            |
| DfltDisklessStorPool | gke-gp-test2-greenplum-165811ea-3g2x    | DISKLESS |                        |              |               | False        | Ok    |            |
| DfltDisklessStorPool | gke-gp-test2-greenplum-165811ea-8b5r    | DISKLESS |                        |              |               | False        | Ok    |            |
| DfltDisklessStorPool | gke-gp-test2-greenplum-165811ea-vd4m    | DISKLESS |                        |              |               | False        | Ok    |            |
| lvm-thin             | gke-gp-test2-greenplum-165811ea-3g2x    | LVM_THIN | vg_local_ssds/thinpool |   374.77 GiB |    374.81 GiB | True         | Ok    |            |
| lvm-thin             | gke-gp-test2-greenplum-165811ea-8b5r    | LVM_THIN | vg_local_ssds/thinpool |   374.77 GiB |    374.81 GiB | True         | Ok    |            |
| lvm-thin             | gke-gp-test2-greenplum-165811ea-vd4m    | LVM_THIN | vg_local_ssds/thinpool |   374.77 GiB |    374.81 GiB | True         | Ok    |            |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+

After doing some chaos enginerring and simulating loss of nodes:

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| StoragePool          | Node                                    | Driver   | PoolName               | FreeCapacity | TotalCapacity | CanSnapshots | State   | SharedName |
|=========================================================================================================================================================================|
| DfltDisklessStorPool | gke-gp-test2-default-pool-529cc55a-b4hl | DISKLESS |                        |              |               | False        | Ok      |            |
| DfltDisklessStorPool | gke-gp-test2-default-pool-529cc55a-dvs9 | DISKLESS |                        |              |               | False        | Ok      |            |
| DfltDisklessStorPool | gke-gp-test2-default-pool-529cc55a-eqrj | DISKLESS |                        |              |               | False        | Ok      |            |
| DfltDisklessStorPool | gke-gp-test2-greenplum-165811ea-3g2x    | DISKLESS |                        |              |               | False        | Warning |            |
| DfltDisklessStorPool | gke-gp-test2-greenplum-165811ea-8b5r    | DISKLESS |                        |              |               | False        | Ok      |            |
| DfltDisklessStorPool | gke-gp-test2-greenplum-165811ea-vd4m    | DISKLESS |                        |              |               | False        | Warning |            |
| lvm-thin             | gke-gp-test2-greenplum-165811ea-3g2x    | LVM_THIN | vg_local_ssds/thinpool |              |               | True         | Warning |            |
| lvm-thin             | gke-gp-test2-greenplum-165811ea-8b5r    | LVM_THIN | vg_local_ssds/thinpool |   374.81 GiB |    374.81 GiB | True         | Ok      |            |
| lvm-thin             | gke-gp-test2-greenplum-165811ea-vd4m    | LVM_THIN | vg_local_ssds/thinpool |              |               | True         | Warning |            |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

inviscid commented 1 year ago

After a little more investigation I see that even for the nodes that come back up with the same name, the linstor-csi-node daemonset never enters a Ready state because the linstore-wait-node-online container keeps thinking the node is not online.

time="2023-03-10T15:04:03Z" level=info msg="not ready" error="satellite gke-gp-test2-greenplum-165811ea-3g2x is not ONLINE: OFFLINE" version=refs/tags/v0.2.1
time="2023-03-10T15:04:13Z" level=info msg="not ready" error="satellite gke-gp-test2-greenplum-165811ea-3g2x is not ONLINE: OFFLINE" version=refs/tags/v0.2.1

Without the linstor-csi-node daemonset running, I suspect this is why Piraeus can't recover the StoragePool properly after node losses. I'm not sure why the linstore-wait-node-online container thinks the node is offline when it appears to be online as far as I can tell.

inviscid commented 1 year ago

I discovered the problem is when the LinstorSatellite pod fails, there is no automated recovery. The pod remains in a failed state until manual intervention. Since the LinstorSatellite linstor-satellte container isn't online, the linstor-csi-node remains in a pending state.

Should the LinstorSatellite try to self-heal so manual intervention is not required?

As an alternative we could always set a CronJob to run every minute to delete failed pods. That feels a bit heavy handed though.

inviscid commented 1 year ago

Besides the self healing option above when pods fail, I also discovered that pods are not being rescheduled to available nodes with a replica of the data. Is there an additional scheduler that must be specified to allow pod scheduling on a node with one of its replicas?

I then forced the pod to schedule on the node with its replica by cordoning the other nodes temporarily and it fails to mount the volume with the following error:

MountVolume.SetUp failed for volume "pvc-40b7b35a-6d9c-425a-ac5e-40e46e59dee1" : rpc error: code = Internal desc = NodePublishVolume failed for pvc-40b7b35a-6d9c-425a-ac5e-40e46e59dee1: failed to set source device readwrite: exit status 1

It feels like there must be more configuration required to allow pods to reschedule on nodes with their replica when original leader node goes down. Any insight would be greatly appreciated.

WanzenBug commented 1 year ago

Strange, the Operator should actually try to recreate the Satellite Pods if they enter a failed state. Perhaps there is a bug in the logic there. What's the state of the Pods after rotating nodes?

And the last error looks like DRBD refuses the mount because it thinks there is more-up-to-date data somewhere else. What storage class parameters did you use? I think this might be because 2 nodes where unreachable, so you potentially lost 2 out of 3 copies of the data, which for DRBD would make the last remaining copy outdated without manual intervention.

inviscid commented 1 year ago

I want to emphasize that Piraeus is fantastic when it is fully operational. The PVs are bound to pods quickly (a few seconds) and the local disk speed is very near raw speed.

If I can just get it to remain stable throughout node lifecycles it is going to be a perfect solution.

I can consistently get the bad behavior by following these steps:

Ensure 3 Piraeus LinstorSatellite nodes are deployed and reporting ready
Deploy a stateful set that utilizes the storage class defined below (primary + replica)
Once the stateful set is deployed, find the node with the bound PV to pod and cordon the node
Expand the node pool to 4 nodes and wait until the 4th node has all Piraeus pods running and reporting healthy
Delete the statefulset pod to force it to reschedule to another node since its current node is now cordoned
The pod will enter a pending state and report that it unable to attach the disk

I get the same bad behavior if I don't expand the node pool and just keep the original 3 nodes too. It seems like there is no signal to detach the original volume so the pod will successfully get rescheduled to another node but the volume is never ready to attach because it is still on the original (now cordoned) node.

More details from my testing:

My storage class looks like:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: piraeus-double
provisioner: linstor.csi.linbit.com
allowVolumeExpansion: true
volumeBindingMode: WaitForFirstConsumer
parameters:
  autoPlace: "2"
  storagePool: lvm-thin
  csi.storage.k8s.io/fstype: xfs
  property.linstor.csi.linbit.com/DrbdOptions/auto-quorum: suspend-io
  property.linstor.csi.linbit.com/DrbdOptions/Resource/on-no-data-accessible: suspend-io
  property.linstor.csi.linbit.com/DrbdOptions/Resource/on-suspended-primary-outdated: force-secondary
  property.linstor.csi.linbit.com/DrbdOptions/Net/rr-conflict: retry-connect

The storage cluster seems to be in some strange state. I can observe that K8s is reporting 4 bound PVC/PV combinations but when I look at the Piraeus utilities, I only observe two volumes in use. The two volumes showing in the "Unused" state are the ones that the two pods can't bind to.

NAME                           STATUS   VOLUME                                     CAPACITY     ACCESS MODES   STORAGECLASS     AGE
gp-test01-pgdata-master-0      Bound    pvc-553a298a-4541-49d4-8f67-f2b28ff76abe   19531250Ki   RWO            piraeus-double   2d17h
gp-test01-pgdata-segment-a-0   Bound    pvc-a134464b-43d1-4612-bb48-4cd01ba1216c   19531250Ki   RWO            piraeus-double   2d17h
gp-test01-pgdata-segment-a-1   Bound    pvc-0b41b448-3206-48f6-9ea2-01a73110f820   19531250Ki   RWO            piraeus-double   2d17h
gp-test01-pgdata-segment-a-2   Bound    pvc-40b7b35a-6d9c-425a-ac5e-40e46e59dee1   19531250Ki   RWO            piraeus-double   2d17h

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Node                                 | Resource                                 | StoragePool          | VolNr | MinorNr | DeviceName    | Allocated | InUse  |    State |
|==========================================================================================================================================================================|
| gke-gp-test2-greenplum-165811ea-8b5r | pvc-0b41b448-3206-48f6-9ea2-01a73110f820 | lvm-thin             |     0 |    1003 | /dev/drbd1003 | 19.08 MiB | InUse  | UpToDate |
| gke-gp-test2-greenplum-165811ea-8b5r | pvc-40b7b35a-6d9c-425a-ac5e-40e46e59dee1 | DfltDisklessStorPool |     0 |    1004 | /dev/drbd1004 |           | Unused | Diskless |
| gke-gp-test2-greenplum-165811ea-vd4m | pvc-40b7b35a-6d9c-425a-ac5e-40e46e59dee1 | lvm-thin             |     0 |    1004 | /dev/drbd1004 |  3.82 MiB | Unused | UpToDate |
| gke-gp-test2-greenplum-165811ea-0p7r | pvc-553a298a-4541-49d4-8f67-f2b28ff76abe | lvm-thin             |     0 |    1000 | /dev/drbd1000 | 19.08 MiB | InUse  | UpToDate |
| gke-gp-test2-greenplum-165811ea-0p7r | pvc-a134464b-43d1-4612-bb48-4cd01ba1216c | DfltDisklessStorPool |     0 |    1001 | /dev/drbd1001 |           | Unused | Diskless |
| gke-gp-test2-greenplum-165811ea-vd4m | pvc-a134464b-43d1-4612-bb48-4cd01ba1216c | lvm-thin             |     0 |    1001 | /dev/drbd1001 |  3.82 MiB | Unused | UpToDate |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I expanded the node pool to confirm a new node comes up normally and it does.

+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| StoragePool          | Node                                    | Driver   | PoolName               | FreeCapacity | TotalCapacity | CanSnapshots | State | SharedName |
|=======================================================================================================================================================================|
| DfltDisklessStorPool | gke-gp-test2-default-pool-529cc55a-b4hl | DISKLESS |                        |              |               | False        | Ok    |            |
| DfltDisklessStorPool | gke-gp-test2-default-pool-529cc55a-dvs9 | DISKLESS |                        |              |               | False        | Ok    |            |
| DfltDisklessStorPool | gke-gp-test2-default-pool-529cc55a-eqrj | DISKLESS |                        |              |               | False        | Ok    |            |
| DfltDisklessStorPool | gke-gp-test2-greenplum-165811ea-0p7r    | DISKLESS |                        |              |               | False        | Ok    |            |
| DfltDisklessStorPool | gke-gp-test2-greenplum-165811ea-3zvm    | DISKLESS |                        |              |               | False        | Ok    |            |
| DfltDisklessStorPool | gke-gp-test2-greenplum-165811ea-8b5r    | DISKLESS |                        |              |               | False        | Ok    |            |
| DfltDisklessStorPool | gke-gp-test2-greenplum-165811ea-vd4m    | DISKLESS |                        |              |               | False        | Ok    |            |
| lvm-thin             | gke-gp-test2-greenplum-165811ea-0p7r    | LVM_THIN | vg_local_ssds/thinpool |   374.77 GiB |    374.81 GiB | True         | Ok    |            |
| lvm-thin             | gke-gp-test2-greenplum-165811ea-3zvm    | LVM_THIN | vg_local_ssds/thinpool |   374.81 GiB |    374.81 GiB | True         | Ok    |            |
| lvm-thin             | gke-gp-test2-greenplum-165811ea-8b5r    | LVM_THIN | vg_local_ssds/thinpool |   374.77 GiB |    374.81 GiB | True         | Ok    |            |
| lvm-thin             | gke-gp-test2-greenplum-165811ea-vd4m    | LVM_THIN | vg_local_ssds/thinpool |   374.77 GiB |    374.81 GiB | True         | Ok    |            |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+

WanzenBug commented 1 year ago

I wonder what's going on here. What I find particularly strange is that your storage class has autoPlace: "2", but it looks like there is actually only one replica of the data?

What is the exact message when you kubectl describe the Pod? It seems to have created diskless resources, which I assume is done by CSI during the attach process. It then runs into the "failed to set source device readwrite: exit status 1" failure, but the resource looks normal enough.

Perhaps easiest would be to collect an SOS report in LINSTOR: kubectl exec deploy/linstor-controller -- linstor sos-report create and then copy the resulting tar.gz from the pod and attach it here to analyze.

inviscid commented 1 year ago

The events from the kubectl describe are:

Events:
  Type     Reason              Age   From                     Message
  ----     ------              ----  ----                     -------
  Normal   Scheduled           16m   default-scheduler        Successfully assigned gptest/segment-a-1 to gke-gp-test2-greenplum-165811ea-wl2c
  Warning  FailedAttachVolume  16m   attachdetach-controller  Multi-Attach error for volume "pvc-a3dab82f-60d7-4696-9bd4-feeaaf50272e" Volume is already exclusively attached to one node and can't be attached to another
  Warning  FailedMount         15m   kubelet                  MountVolume.SetUp failed for volume "pvc-a3dab82f-60d7-4696-9bd4-feeaaf50272e" : rpc error: code = Internal desc = NodePublishVolume failed for pvc-a3dab82f-60d7-4696-9bd4-feeaaf50272e: mount failed: exit status 32
Mounting command: mount
Mounting arguments: -t xfs -o _netdev,nouuid /dev/drbd1006 /var/lib/kubelet/pods/a5a248a0-d736-4bf6-b245-7c356fe158b9/volumes/kubernetes.io~csi/pvc-a3dab82f-60d7-4696-9bd4-feeaaf50272e/mount
Output: mount: /var/lib/kubelet/pods/a5a248a0-d736-4bf6-b245-7c356fe158b9/volumes/kubernetes.io~csi/pvc-a3dab82f-60d7-4696-9bd4-feeaaf50272e/mount: can't read superblock on /dev/drbd1006.
  Warning  FailedMount             14m (x2 over 15m)    kubelet                  MountVolume.WaitForAttach failed for volume "pvc-a3dab82f-60d7-4696-9bd4-feeaaf50272e" : volume pvc-a3dab82f-60d7-4696-9bd4-feeaaf50272e has GET error for volume attachment csi-39e86b8dc8d5799c51edfe1fd17a6b5ca5a8c9925d4a744ea25f7d377965691f: volumeattachments.storage.k8s.io "csi-39e86b8dc8d5799c51edfe1fd17a6b5ca5a8c9925d4a744ea25f7d377965691f" is forbidden: User "system:node:gke-gp-test2-greenplum-165811ea-wl2c" cannot get resource "volumeattachments" in API group "storage.k8s.io" at the cluster scope: no relationship found between node 'gke-gp-test2-greenplum-165811ea-wl2c' and this object
  Normal   SuccessfulAttachVolume  14m (x2 over 15m)    attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-a3dab82f-60d7-4696-9bd4-feeaaf50272e"
  Warning  FailedMount             11m                  kubelet                  Unable to attach or mount volumes: unmounted volumes=[gp-test01-pgdata], unattached volumes=[kube-api-access-zxbkg ssh-key-volume config-volume gp-test01-pgdata cgroups podinfo]: timed out waiting for the condition
  Warning  FailedMount             9m26s                kubelet                  Unable to attach or mount volumes: unmounted volumes=[gp-test01-pgdata], unattached volumes=[config-volume gp-test01-pgdata cgroups podinfo kube-api-access-zxbkg ssh-key-volume]: timed out waiting for the condition
  Warning  FailedMount             2m36s (x2 over 13m)  kubelet                  Unable to attach or mount volumes: unmounted volumes=[gp-test01-pgdata], unattached volumes=[cgroups podinfo kube-api-access-zxbkg ssh-key-volume config-volume gp-test01-pgdata]: timed out waiting for the condition
  Warning  FailedMount             79s (x12 over 15m)   kubelet                  MountVolume.SetUp failed for volume "pvc-a3dab82f-60d7-4696-9bd4-feeaaf50272e" : rpc error: code = Internal desc = NodePublishVolume failed for pvc-a3dab82f-60d7-4696-9bd4-feeaaf50272e: failed to set source device readwrite: exit status 1
  Warning  FailedMount             22s (x3 over 7m9s)   kubelet                  Unable to attach or mount volumes: unmounted volumes=[gp-test01-pgdata], unattached volumes=[ssh-key-volume config-volume gp-test01-pgdata cgroups podinfo kube-api-access-zxbkg]: timed out waiting for the condition

The SOS report is attached.

sos_2023-03-14_21-41-08.tar.gz

inviscid commented 1 year ago

A quick update...

I left the pod pending overnight to see if eventually the volume would detach from the original node but it did not.
I then deleted the pending pod to see whether a reschedule of the pod might trigger the volume attachment but it did not.
I uncordened the original node to allow the pod to reschedule back to its original node where the volume is attached but it is still stuck in pending with the following error:

MountVolume.SetUp failed for volume "pvc-a3dab82f-60d7-4696-9bd4-feeaaf50272e" : rpc error: code = Internal desc = NodePublishVolume failed for pvc-a3dab82f-60d7-4696-9bd4-feeaaf50272e: failed to set source device readwrite: exit status 1

WanzenBug commented 1 year ago

I see a specific error in DRBD:

[  +0.000250] sd 1:0:1:0: [sdb] tag#381 request not aligned to the logical block size
[  +0.007850] blk_update_request: I/O error, dev sdb, sector 242176 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0

Which happens on both remaining diskful nodes. I believe this is a bug that was recently identified in DRBD, but for now no fix was released.

The issue is that DRBD on the new node reports a hardware sector size of 512bytes, but your disks seem to use a different size, probably 4k. When the diskless tries to mount the volume, the mount command tries to read in blocks of 512 bytes, which triggers unaligned reads on the diskful node. Then, the diskful nodes detach the disk, because the lower layer complains about the unaligned reads.

I don't know of a workaround other than to wait for a fixed DRBD version :/

inviscid commented 1 year ago

I did confirm the sector size is 4K on all the local disks. I tried to find the DRDB issue related to this but I was unsuccessful. Do you happen to have the issue # so I can monitor it.

root@gke-gp-test2-greenplum-165811ea-qqjq:/# LC_ALL=C fdisk -l /dev/sdb                     
Disk /dev/sdb: 375 GiB, 402653184000 bytes, 98304000 sectors
Disk model: EphemeralDisk   
Units: sectors of 1 * 4096 = 4096 bytes
Sector size (logical/physical): 4096 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes

root@gke-gp-test2-greenplum-165811ea-rz6s:/# LC_ALL=C fdisk -l /dev/sdb                     
Disk /dev/sdb: 375 GiB, 402653184000 bytes, 98304000 sectors
Disk model: EphemeralDisk   
Units: sectors of 1 * 4096 = 4096 bytes
Sector size (logical/physical): 4096 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes

root@gke-gp-test2-greenplum-165811ea-wl2c:/# LC_ALL=C fdisk -l /dev/sdb                     
Disk /dev/sdb: 375 GiB, 402653184000 bytes, 98304000 sectors
Disk model: EphemeralDisk   
Units: sectors of 1 * 4096 = 4096 bytes
Sector size (logical/physical): 4096 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes

WanzenBug commented 1 year ago

I think the issue is not on the public github page. I think there will be a new RC next week which fixes the issue. I'll keep you posted.

inviscid commented 1 year ago

@WanzenBug Just checking back in to see of the DRDB RC was released.

WanzenBug commented 1 year ago

No. You can subscribe here: https://lists.linbit.com/mailman/listinfo/drbd-announce to get the announcement directly.

inviscid commented 1 year ago

@WanzenBug I did subscribe to the drdb-announce list but have not seen any information on a bug fix release related to a sector size detection bug. Sorry to be the squeaky wheel on this but really looking forward to testing and using as soon as possible.

A secondary question... Will I need to specify something different in the Piraeus operator to have it grab the latest version of DRDB when it builds it in the daemonset? Currently it is building v9.2.2 but I don't see a specification of that in the operator images so I'm assuming it must have some way to grab the latest.

inviscid commented 1 year ago

Well, my timing was perfect. Within an hour of posting the above I received notice that v9.2.3 RC1 is out.

How do I tell the daemonset to use that v9.2.3 RC1 instead of v9.2.2? Thx...

WanzenBug commented 1 year ago

Since we usually don't build images for RCs, you'll want to:

Clone https://github.com/piraeusdatastore/piraeus/tree/master/dockerfiles/drbd-driver-loader
Edit VERSION-9.2.env to reference the RC
Run make update upload REGISTRY=<some-registry>. We use quay.io/piraeusdatastore for <some-registry>, but you will need to use your own. You can also set PLATFORMS=linux/amd64 and DF=Dockerfile.jammy if you only need to build for one OS.
Write a patch that uses your new image: There is a how-to here
Any existing nodes with DRBD loaded need to unload it first, try running rmmod drbd_transport_tcp drbd.

inviscid commented 1 year ago

I couldn't get the process described above to load the correct image so I waited until v9.2.3 was released to try again.

I specified v9.2.3 in the piraeus configmap but when I view the logs in the DRDB loader process it looks like it is continuing to load v9.2.2.

components:
      linstor-controller:
        tag: v1.20.3
        image: piraeus-server
      linstor-satellite:
        tag: v1.20.3
        image: piraeus-server
      linstor-csi:
        tag: v0.22.1
        image: piraeus-csi
      drbd-reactor:
        tag: v1.0.0
        image: drbd-reactor
      ha-controller:
        tag: v1.1.2
        image: piraeus-ha-controller
      drbd-shutdown-guard:
        tag: v1.0.0
        image: drbd-shutdown-guard
      drbd-module-loader:
        tag: v9.2.3

This results in:

DRBD version loaded:
version: 9.2.2 (api:2/proto:86-121)

I thought this might have just been a messaging problem and v9.2.3 was actually in place. However, I lost quorum pretty soon after forming the test cluster. How can I force this to use the new v9.2.3 DRDB version without trying all the overrides above?

Thx...

WanzenBug commented 1 year ago

Does the Pod use the expected v9.2.3 image? If so, you need to unload 9.2.2 first. The injector does not do that automatically. Run rmmod drbd_transport_tcp drbd on every affected node.

inviscid commented 1 year ago

It seems like it is completely ignoring the v9.2.3 specification in the configmap. When I view the logs in the drdb-module-loader container, it appears to immediately launch the v9.2.2 build.

I've scaled the node pool to zero and back up. Reloaded all the configmaps, restarted pods, etc... but have not been able to get the loader to use v9.2.3.

This is likely, as you asked above, because the image specified for the drdb-module-loader in the pod manifest is:

initContainers:
    - name: drbd-module-loader
      image: quay.io/piraeusdatastore/drbd9-jammy:v9.2.2
      resources: {}
      volumeMounts:
        - name: lib-modules
          readOnly: true
          mountPath: /lib/modules
        - name: usr-src
          readOnly: true
          mountPath: /usr/src
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File
      imagePullPolicy: IfNotPresent
      securityContext:
        capabilities:
          add:
            - SYS_MODULE
          drop:
            - ALL

I'm not sure how to tell it to use the v9.2.3 image at this point.

inviscid commented 1 year ago

Quick follow-up. I decided to completely remove the Piraeus installation and then reinstall/redeploy it again. The good news is that now it is picking up the v9.2.3 specification in the configmap. The bad news is that I'm not sure how well we can upgrade things in place if changes like this require a full tear down.

Admittedly I'm still just getting my feet under me with this but it felt like config changes were being ignored even with the usual Kubernetes hard restart processes. Anyway, hopefully getting closer to testing performance and failover now.

WanzenBug commented 1 year ago

I think you may have needed to restart the piraeus-operator-controller-manager Pod. It only loads the config map once on start-up. In normal operation the file will only change when also deploying a new operator version, so the Pod will get restarted automatically.

inviscid commented 1 year ago

Pretty sure I missed restarting the operator pod as you said. I'm still not able to get a stable cluster running for more than a few minutes. I really think this could be the solution we are looking for but I just can't get a test cluster set up.

I'm not ready to throw in the towel yet but is there support (paid?) that could help with getting a test rig running so I can validate performance, stability, failover, node loss recovery, snapshots, snapshot recovery, etc...?

inviscid commented 1 year ago

I have scheduled some time with Linbit SDS support to review things. Hopefully, we can get a test rig running.

piraeusdatastore / piraeus-operator

StoragePools not recovering after node loss - Self Heal Option? #424