Open inviscid opened 1 year ago
After a little more investigation I see that even for the nodes that come back up with the same name, the linstor-csi-node
daemonset never enters a Ready
state because the linstore-wait-node-online
container keeps thinking the node is not online.
time="2023-03-10T15:04:03Z" level=info msg="not ready" error="satellite gke-gp-test2-greenplum-165811ea-3g2x is not ONLINE: OFFLINE" version=refs/tags/v0.2.1
time="2023-03-10T15:04:13Z" level=info msg="not ready" error="satellite gke-gp-test2-greenplum-165811ea-3g2x is not ONLINE: OFFLINE" version=refs/tags/v0.2.1
Without the linstor-csi-node
daemonset running, I suspect this is why Piraeus can't recover the StoragePool properly after node losses. I'm not sure why the linstore-wait-node-online
container thinks the node is offline when it appears to be online as far as I can tell.
I discovered the problem is when the LinstorSatellite
pod fails, there is no automated recovery. The pod remains in a failed state until manual intervention. Since the LinstorSatellite
linstor-satellte
container isn't online, the linstor-csi-node
remains in a pending state.
Should the LinstorSatellite
try to self-heal so manual intervention is not required?
As an alternative we could always set a CronJob to run every minute to delete failed pods. That feels a bit heavy handed though.
Besides the self healing option above when pods fail, I also discovered that pods are not being rescheduled to available nodes with a replica of the data. Is there an additional scheduler that must be specified to allow pod scheduling on a node with one of its replicas?
I then forced the pod to schedule on the node with its replica by cordoning the other nodes temporarily and it fails to mount the volume with the following error:
MountVolume.SetUp failed for volume "pvc-40b7b35a-6d9c-425a-ac5e-40e46e59dee1" : rpc error: code = Internal desc = NodePublishVolume failed for pvc-40b7b35a-6d9c-425a-ac5e-40e46e59dee1: failed to set source device readwrite: exit status 1
It feels like there must be more configuration required to allow pods to reschedule on nodes with their replica when original leader node goes down. Any insight would be greatly appreciated.
Strange, the Operator should actually try to recreate the Satellite Pods if they enter a failed state. Perhaps there is a bug in the logic there. What's the state of the Pods after rotating nodes?
And the last error looks like DRBD refuses the mount because it thinks there is more-up-to-date data somewhere else. What storage class parameters did you use? I think this might be because 2 nodes where unreachable, so you potentially lost 2 out of 3 copies of the data, which for DRBD would make the last remaining copy outdated without manual intervention.
I want to emphasize that Piraeus is fantastic when it is fully operational. The PVs are bound to pods quickly (a few seconds) and the local disk speed is very near raw speed.
If I can just get it to remain stable throughout node lifecycles it is going to be a perfect solution.
I can consistently get the bad behavior by following these steps:
I get the same bad behavior if I don't expand the node pool and just keep the original 3 nodes too. It seems like there is no signal to detach the original volume so the pod will successfully get rescheduled to another node but the volume is never ready to attach because it is still on the original (now cordoned) node.
More details from my testing:
My storage class looks like:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: piraeus-double
provisioner: linstor.csi.linbit.com
allowVolumeExpansion: true
volumeBindingMode: WaitForFirstConsumer
parameters:
autoPlace: "2"
storagePool: lvm-thin
csi.storage.k8s.io/fstype: xfs
property.linstor.csi.linbit.com/DrbdOptions/auto-quorum: suspend-io
property.linstor.csi.linbit.com/DrbdOptions/Resource/on-no-data-accessible: suspend-io
property.linstor.csi.linbit.com/DrbdOptions/Resource/on-suspended-primary-outdated: force-secondary
property.linstor.csi.linbit.com/DrbdOptions/Net/rr-conflict: retry-connect
The storage cluster seems to be in some strange state. I can observe that K8s is reporting 4 bound PVC/PV combinations but when I look at the Piraeus utilities, I only observe two volumes in use. The two volumes showing in the "Unused" state are the ones that the two pods can't bind to.
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
gp-test01-pgdata-master-0 Bound pvc-553a298a-4541-49d4-8f67-f2b28ff76abe 19531250Ki RWO piraeus-double 2d17h
gp-test01-pgdata-segment-a-0 Bound pvc-a134464b-43d1-4612-bb48-4cd01ba1216c 19531250Ki RWO piraeus-double 2d17h
gp-test01-pgdata-segment-a-1 Bound pvc-0b41b448-3206-48f6-9ea2-01a73110f820 19531250Ki RWO piraeus-double 2d17h
gp-test01-pgdata-segment-a-2 Bound pvc-40b7b35a-6d9c-425a-ac5e-40e46e59dee1 19531250Ki RWO piraeus-double 2d17h
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Node | Resource | StoragePool | VolNr | MinorNr | DeviceName | Allocated | InUse | State |
|==========================================================================================================================================================================|
| gke-gp-test2-greenplum-165811ea-8b5r | pvc-0b41b448-3206-48f6-9ea2-01a73110f820 | lvm-thin | 0 | 1003 | /dev/drbd1003 | 19.08 MiB | InUse | UpToDate |
| gke-gp-test2-greenplum-165811ea-8b5r | pvc-40b7b35a-6d9c-425a-ac5e-40e46e59dee1 | DfltDisklessStorPool | 0 | 1004 | /dev/drbd1004 | | Unused | Diskless |
| gke-gp-test2-greenplum-165811ea-vd4m | pvc-40b7b35a-6d9c-425a-ac5e-40e46e59dee1 | lvm-thin | 0 | 1004 | /dev/drbd1004 | 3.82 MiB | Unused | UpToDate |
| gke-gp-test2-greenplum-165811ea-0p7r | pvc-553a298a-4541-49d4-8f67-f2b28ff76abe | lvm-thin | 0 | 1000 | /dev/drbd1000 | 19.08 MiB | InUse | UpToDate |
| gke-gp-test2-greenplum-165811ea-0p7r | pvc-a134464b-43d1-4612-bb48-4cd01ba1216c | DfltDisklessStorPool | 0 | 1001 | /dev/drbd1001 | | Unused | Diskless |
| gke-gp-test2-greenplum-165811ea-vd4m | pvc-a134464b-43d1-4612-bb48-4cd01ba1216c | lvm-thin | 0 | 1001 | /dev/drbd1001 | 3.82 MiB | Unused | UpToDate |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
I expanded the node pool to confirm a new node comes up normally and it does.
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| StoragePool | Node | Driver | PoolName | FreeCapacity | TotalCapacity | CanSnapshots | State | SharedName |
|=======================================================================================================================================================================|
| DfltDisklessStorPool | gke-gp-test2-default-pool-529cc55a-b4hl | DISKLESS | | | | False | Ok | |
| DfltDisklessStorPool | gke-gp-test2-default-pool-529cc55a-dvs9 | DISKLESS | | | | False | Ok | |
| DfltDisklessStorPool | gke-gp-test2-default-pool-529cc55a-eqrj | DISKLESS | | | | False | Ok | |
| DfltDisklessStorPool | gke-gp-test2-greenplum-165811ea-0p7r | DISKLESS | | | | False | Ok | |
| DfltDisklessStorPool | gke-gp-test2-greenplum-165811ea-3zvm | DISKLESS | | | | False | Ok | |
| DfltDisklessStorPool | gke-gp-test2-greenplum-165811ea-8b5r | DISKLESS | | | | False | Ok | |
| DfltDisklessStorPool | gke-gp-test2-greenplum-165811ea-vd4m | DISKLESS | | | | False | Ok | |
| lvm-thin | gke-gp-test2-greenplum-165811ea-0p7r | LVM_THIN | vg_local_ssds/thinpool | 374.77 GiB | 374.81 GiB | True | Ok | |
| lvm-thin | gke-gp-test2-greenplum-165811ea-3zvm | LVM_THIN | vg_local_ssds/thinpool | 374.81 GiB | 374.81 GiB | True | Ok | |
| lvm-thin | gke-gp-test2-greenplum-165811ea-8b5r | LVM_THIN | vg_local_ssds/thinpool | 374.77 GiB | 374.81 GiB | True | Ok | |
| lvm-thin | gke-gp-test2-greenplum-165811ea-vd4m | LVM_THIN | vg_local_ssds/thinpool | 374.77 GiB | 374.81 GiB | True | Ok | |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
I wonder what's going on here. What I find particularly strange is that your storage class has autoPlace: "2"
, but it looks like there is actually only one replica of the data?
What is the exact message when you kubectl describe
the Pod? It seems to have created diskless resources, which I assume is done by CSI during the attach process. It then runs into the "failed to set source device readwrite: exit status 1" failure, but the resource looks normal enough.
Perhaps easiest would be to collect an SOS report in LINSTOR: kubectl exec deploy/linstor-controller -- linstor sos-report create
and then copy the resulting tar.gz from the pod and attach it here to analyze.
The events from the kubectl describe
are:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 16m default-scheduler Successfully assigned gptest/segment-a-1 to gke-gp-test2-greenplum-165811ea-wl2c
Warning FailedAttachVolume 16m attachdetach-controller Multi-Attach error for volume "pvc-a3dab82f-60d7-4696-9bd4-feeaaf50272e" Volume is already exclusively attached to one node and can't be attached to another
Warning FailedMount 15m kubelet MountVolume.SetUp failed for volume "pvc-a3dab82f-60d7-4696-9bd4-feeaaf50272e" : rpc error: code = Internal desc = NodePublishVolume failed for pvc-a3dab82f-60d7-4696-9bd4-feeaaf50272e: mount failed: exit status 32
Mounting command: mount
Mounting arguments: -t xfs -o _netdev,nouuid /dev/drbd1006 /var/lib/kubelet/pods/a5a248a0-d736-4bf6-b245-7c356fe158b9/volumes/kubernetes.io~csi/pvc-a3dab82f-60d7-4696-9bd4-feeaaf50272e/mount
Output: mount: /var/lib/kubelet/pods/a5a248a0-d736-4bf6-b245-7c356fe158b9/volumes/kubernetes.io~csi/pvc-a3dab82f-60d7-4696-9bd4-feeaaf50272e/mount: can't read superblock on /dev/drbd1006.
Warning FailedMount 14m (x2 over 15m) kubelet MountVolume.WaitForAttach failed for volume "pvc-a3dab82f-60d7-4696-9bd4-feeaaf50272e" : volume pvc-a3dab82f-60d7-4696-9bd4-feeaaf50272e has GET error for volume attachment csi-39e86b8dc8d5799c51edfe1fd17a6b5ca5a8c9925d4a744ea25f7d377965691f: volumeattachments.storage.k8s.io "csi-39e86b8dc8d5799c51edfe1fd17a6b5ca5a8c9925d4a744ea25f7d377965691f" is forbidden: User "system:node:gke-gp-test2-greenplum-165811ea-wl2c" cannot get resource "volumeattachments" in API group "storage.k8s.io" at the cluster scope: no relationship found between node 'gke-gp-test2-greenplum-165811ea-wl2c' and this object
Normal SuccessfulAttachVolume 14m (x2 over 15m) attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-a3dab82f-60d7-4696-9bd4-feeaaf50272e"
Warning FailedMount 11m kubelet Unable to attach or mount volumes: unmounted volumes=[gp-test01-pgdata], unattached volumes=[kube-api-access-zxbkg ssh-key-volume config-volume gp-test01-pgdata cgroups podinfo]: timed out waiting for the condition
Warning FailedMount 9m26s kubelet Unable to attach or mount volumes: unmounted volumes=[gp-test01-pgdata], unattached volumes=[config-volume gp-test01-pgdata cgroups podinfo kube-api-access-zxbkg ssh-key-volume]: timed out waiting for the condition
Warning FailedMount 2m36s (x2 over 13m) kubelet Unable to attach or mount volumes: unmounted volumes=[gp-test01-pgdata], unattached volumes=[cgroups podinfo kube-api-access-zxbkg ssh-key-volume config-volume gp-test01-pgdata]: timed out waiting for the condition
Warning FailedMount 79s (x12 over 15m) kubelet MountVolume.SetUp failed for volume "pvc-a3dab82f-60d7-4696-9bd4-feeaaf50272e" : rpc error: code = Internal desc = NodePublishVolume failed for pvc-a3dab82f-60d7-4696-9bd4-feeaaf50272e: failed to set source device readwrite: exit status 1
Warning FailedMount 22s (x3 over 7m9s) kubelet Unable to attach or mount volumes: unmounted volumes=[gp-test01-pgdata], unattached volumes=[ssh-key-volume config-volume gp-test01-pgdata cgroups podinfo kube-api-access-zxbkg]: timed out waiting for the condition
The SOS report is attached.
A quick update...
MountVolume.SetUp failed for volume "pvc-a3dab82f-60d7-4696-9bd4-feeaaf50272e" : rpc error: code = Internal desc = NodePublishVolume failed for pvc-a3dab82f-60d7-4696-9bd4-feeaaf50272e: failed to set source device readwrite: exit status 1
I see a specific error in DRBD:
[ +0.000250] sd 1:0:1:0: [sdb] tag#381 request not aligned to the logical block size
[ +0.007850] blk_update_request: I/O error, dev sdb, sector 242176 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Which happens on both remaining diskful nodes. I believe this is a bug that was recently identified in DRBD, but for now no fix was released.
The issue is that DRBD on the new node reports a hardware sector size of 512bytes, but your disks seem to use a different size, probably 4k. When the diskless tries to mount the volume, the mount command tries to read in blocks of 512 bytes, which triggers unaligned reads on the diskful node. Then, the diskful nodes detach the disk, because the lower layer complains about the unaligned reads.
I don't know of a workaround other than to wait for a fixed DRBD version :/
I did confirm the sector size is 4K on all the local disks. I tried to find the DRDB issue related to this but I was unsuccessful. Do you happen to have the issue # so I can monitor it.
root@gke-gp-test2-greenplum-165811ea-qqjq:/# LC_ALL=C fdisk -l /dev/sdb
Disk /dev/sdb: 375 GiB, 402653184000 bytes, 98304000 sectors
Disk model: EphemeralDisk
Units: sectors of 1 * 4096 = 4096 bytes
Sector size (logical/physical): 4096 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
root@gke-gp-test2-greenplum-165811ea-rz6s:/# LC_ALL=C fdisk -l /dev/sdb
Disk /dev/sdb: 375 GiB, 402653184000 bytes, 98304000 sectors
Disk model: EphemeralDisk
Units: sectors of 1 * 4096 = 4096 bytes
Sector size (logical/physical): 4096 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
root@gke-gp-test2-greenplum-165811ea-wl2c:/# LC_ALL=C fdisk -l /dev/sdb
Disk /dev/sdb: 375 GiB, 402653184000 bytes, 98304000 sectors
Disk model: EphemeralDisk
Units: sectors of 1 * 4096 = 4096 bytes
Sector size (logical/physical): 4096 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
I think the issue is not on the public github page. I think there will be a new RC next week which fixes the issue. I'll keep you posted.
@WanzenBug Just checking back in to see of the DRDB RC was released.
No. You can subscribe here: https://lists.linbit.com/mailman/listinfo/drbd-announce to get the announcement directly.
@WanzenBug I did subscribe to the drdb-announce list but have not seen any information on a bug fix release related to a sector size detection bug. Sorry to be the squeaky wheel on this but really looking forward to testing and using as soon as possible.
A secondary question... Will I need to specify something different in the Piraeus operator to have it grab the latest version of DRDB when it builds it in the daemonset? Currently it is building v9.2.2 but I don't see a specification of that in the operator images so I'm assuming it must have some way to grab the latest.
Well, my timing was perfect. Within an hour of posting the above I received notice that v9.2.3 RC1 is out.
How do I tell the daemonset to use that v9.2.3 RC1 instead of v9.2.2? Thx...
Since we usually don't build images for RCs, you'll want to:
VERSION-9.2.env
to reference the RCmake update upload REGISTRY=<some-registry>
. We use quay.io/piraeusdatastore
for <some-registry>
, but you will need to use your own. You can also set PLATFORMS=linux/amd64
and DF=Dockerfile.jammy
if you only need to build for one OS.rmmod drbd_transport_tcp drbd
.I couldn't get the process described above to load the correct image so I waited until v9.2.3 was released to try again.
I specified v9.2.3 in the piraeus configmap but when I view the logs in the DRDB loader process it looks like it is continuing to load v9.2.2.
components:
linstor-controller:
tag: v1.20.3
image: piraeus-server
linstor-satellite:
tag: v1.20.3
image: piraeus-server
linstor-csi:
tag: v0.22.1
image: piraeus-csi
drbd-reactor:
tag: v1.0.0
image: drbd-reactor
ha-controller:
tag: v1.1.2
image: piraeus-ha-controller
drbd-shutdown-guard:
tag: v1.0.0
image: drbd-shutdown-guard
drbd-module-loader:
tag: v9.2.3
This results in:
DRBD version loaded:
version: 9.2.2 (api:2/proto:86-121)
I thought this might have just been a messaging problem and v9.2.3 was actually in place. However, I lost quorum pretty soon after forming the test cluster. How can I force this to use the new v9.2.3 DRDB version without trying all the overrides above?
Thx...
Does the Pod use the expected v9.2.3
image? If so, you need to unload 9.2.2 first. The injector does not do that automatically. Run rmmod drbd_transport_tcp drbd
on every affected node.
It seems like it is completely ignoring the v9.2.3
specification in the configmap. When I view the logs in the drdb-module-loader
container, it appears to immediately launch the v9.2.2 build.
I've scaled the node pool to zero and back up. Reloaded all the configmaps, restarted pods, etc... but have not been able to get the loader to use v9.2.3.
This is likely, as you asked above, because the image specified for the drdb-module-loader
in the pod manifest is:
initContainers:
- name: drbd-module-loader
image: quay.io/piraeusdatastore/drbd9-jammy:v9.2.2
resources: {}
volumeMounts:
- name: lib-modules
readOnly: true
mountPath: /lib/modules
- name: usr-src
readOnly: true
mountPath: /usr/src
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
imagePullPolicy: IfNotPresent
securityContext:
capabilities:
add:
- SYS_MODULE
drop:
- ALL
I'm not sure how to tell it to use the v9.2.3 image at this point.
Quick follow-up. I decided to completely remove the Piraeus installation and then reinstall/redeploy it again. The good news is that now it is picking up the v9.2.3
specification in the configmap. The bad news is that I'm not sure how well we can upgrade things in place if changes like this require a full tear down.
Admittedly I'm still just getting my feet under me with this but it felt like config changes were being ignored even with the usual Kubernetes hard restart processes. Anyway, hopefully getting closer to testing performance and failover now.
I think you may have needed to restart the piraeus-operator-controller-manager
Pod. It only loads the config map once on start-up. In normal operation the file will only change when also deploying a new operator version, so the Pod will get restarted automatically.
Pretty sure I missed restarting the operator pod as you said. I'm still not able to get a stable cluster running for more than a few minutes. I really think this could be the solution we are looking for but I just can't get a test cluster set up.
I'm not ready to throw in the towel yet but is there support (paid?) that could help with getting a test rig running so I can validate performance, stability, failover, node loss recovery, snapshots, snapshot recovery, etc...?
I have scheduled some time with Linbit SDS support to review things. Hopefully, we can get a test rig running.
I'm not sure if I've have everything configured correctly since my StoragePools do not recover when nodes are lost. Since a node has no guarantee to have the same name after a loss (GKE node creation), the StoragePool continues to look for the original node name and goes into a state where other nodes are now available (the replacements) but continues to think the original nodes will come back online.
Is there a way to configure Piraeus to self-heal the StoragePools using whatever available nodes meet the NodeSelector criteria? If nodes are offline for some period of time can they be automatically removed from the StoragePool since they are likely gone forever.
Original test storage pools:
After doing some chaos enginerring and simulating loss of nodes: