piraeusdatastore / piraeus-ha-controller

High Availability Controller for stateful workloads using storage provisioned by Piraeus
Apache License 2.0
15 stars 8 forks source link

PV not associated to a PVC, nothing to do #7

Closed kvaps closed 2 years ago

kvaps commented 3 years ago

Hi, we faced with the problem when piraeus-ha-controller does not reconciling failing volume attachments due to "PV not associated to a PVC, nothing to do"

time="2021-06-04T15:09:03Z" level=trace msg=update name=nextcloud-nfs-server-provisioner-0 namespace=nextcloud-nfs-server-provisioner-0 resource=Pod
time="2021-06-04T15:09:05Z" level=trace msg="start reconciling failing volume attachments"
time="2021-06-04T15:09:05Z" level=trace msg="finished reconciling failing volume attachments"
time="2021-06-04T15:09:09Z" level=trace msg="Pod watch resource version updated" resource-version=78535013
time="2021-06-04T15:09:13Z" level=debug msg="curl -X 'GET' -H 'Accept: application/json' 'https://linstor-controller:3371/v1/resource-definitions/pvc-efb31302-5feb-4dbe-93f5-8994eb08c6ca/resources'"
time="2021-06-04T15:09:13Z" level=trace msg="lost pv" lostPV=pvc-efb31302-5feb-4dbe-93f5-8994eb08c6ca
time="2021-06-04T15:09:13Z" level=trace msg="start reconciling failing volume attachments"
time="2021-06-04T15:09:13Z" level=info msg="processing failing pv" pv=pvc-efb31302-5feb-4dbe-93f5-8994eb08c6ca
time="2021-06-04T15:09:13Z" level=debug msg="PV not associated to a PVC, nothing to do" pv=pvc-efb31302-5feb-4dbe-93f5-8994eb08c6ca
time="2021-06-04T15:09:13Z" level=trace msg="finished reconciling failing volume attachments"
# kubectl get pvc -n nfs data-nextcloud-nfs-server-provisioner-0 -o yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  annotations:
    pv.kubernetes.io/bind-completed: "yes"
    pv.kubernetes.io/bound-by-controller: "yes"
    volume.beta.kubernetes.io/storage-provisioner: linstor.csi.linbit.com
  creationTimestamp: "2021-04-08T12:32:41Z"
  finalizers:
  - kubernetes.io/pvc-protection
  labels:
    app: nfs-server-provisioner
    release: nextcloud
  name: data-nextcloud-nfs-server-provisioner-0
  namespace: nfs
  resourceVersion: "27438361"
  uid: efb31302-5feb-4dbe-93f5-8994eb08c6ca
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 50Gi
  storageClassName: linstor-1
  volumeMode: Filesystem
  volumeName: pvc-efb31302-5feb-4dbe-93f5-8994eb08c6ca
status:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 50Gi
  phase: Bound
# kubectl get pv -o yaml pvc-efb31302-5feb-4dbe-93f5-8994eb08c6ca
apiVersion: v1
kind: PersistentVolume
metadata:
  annotations:
    pv.kubernetes.io/provisioned-by: linstor.csi.linbit.com
  creationTimestamp: "2021-04-08T12:32:43Z"
  finalizers:
  - kubernetes.io/pv-protection
  - external-attacher/linstor-csi-linbit-com
  name: pvc-efb31302-5feb-4dbe-93f5-8994eb08c6ca
  resourceVersion: "27438405"
  uid: 5064a61a-88a1-47a7-a0bd-80669bf857f8
spec:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 50Gi
  claimRef:
    apiVersion: v1
    kind: PersistentVolumeClaim
    name: data-nextcloud-nfs-server-provisioner-0
    namespace: nfs
    resourceVersion: "27438312"
    uid: efb31302-5feb-4dbe-93f5-8994eb08c6ca
  csi:
    driver: linstor.csi.linbit.com
    fsType: ext4
    volumeAttributes:
      storage.kubernetes.io/csiProvisionerIdentity: 1617814042512-8081-linstor.csi.linbit.com
    volumeHandle: pvc-efb31302-5feb-4dbe-93f5-8994eb08c6ca
  mountOptions:
  - errors=remount-ro
  persistentVolumeReclaimPolicy: Delete
  storageClassName: linstor-1
  volumeMode: Filesystem
status:
  phase: Bound
# kubectl get volumeattachments.storage.k8s.io -o yaml
apiVersion: storage.k8s.io/v1
kind: VolumeAttachment
metadata:
  annotations:
    csi.alpha.kubernetes.io/node-id: m1c29
  creationTimestamp: "2021-06-04T15:01:46Z"
  finalizers:
  - external-attacher/linstor-csi-linbit-com
  name: csi-0c36bf3aa3e14cee55d5e4f944e16a3e408d87aaad9ab86cc0255fdb08f40206
  resourceVersion: "78528835"
  uid: 94b6f060-5b73-49ed-a948-584f7c25e137
spec:
  attacher: linstor.csi.linbit.com
  nodeName: m1c29
  source:
    persistentVolumeName: pvc-efb31302-5feb-4dbe-93f5-8994eb08c6ca
status:
  attached: true

any ideas?

kvaps commented 3 years ago

Sometimes I also see another error: PV without volume attachment, nothing to do

time="2021-06-04T15:26:33Z" level=trace msg=update name=nextcloud-nfs-server-provisioner-0 namespace=nextcloud-nfs-server-provisioner-0 resource=Pod
time="2021-06-04T15:26:33Z" level=trace msg=remove name=nextcloud-nfs-server-provisioner-0 namespace=nextcloud-nfs-server-provisioner-0 resource=Pod
time="2021-06-04T15:26:33Z" level=debug msg="curl -X 'GET' -H 'Accept: application/json' 'https://linstor-controller:3371/v1/resource-definitions/pvc-efb31302-5feb-4dbe-93f5-8994eb08c6ca/resources'"
time="2021-06-04T15:26:33Z" level=trace msg="lost pv" lostPV=pvc-efb31302-5feb-4dbe-93f5-8994eb08c6ca
time="2021-06-04T15:26:33Z" level=trace msg="start reconciling failing volume attachments"
time="2021-06-04T15:26:33Z" level=info msg="processing failing pv" pv=pvc-efb31302-5feb-4dbe-93f5-8994eb08c6ca
time="2021-06-04T15:26:33Z" level=debug msg="PV without volume attachment, nothing to do" pv=pvc-efb31302-5feb-4dbe-93f5-8994eb08c6ca
time="2021-06-04T15:26:33Z" level=trace msg="finished reconciling failing volume attachments"
time="2021-06-04T15:26:33Z" level=trace msg=update name=csi-0c36bf3aa3e14cee55d5e4f944e16a3e408d87aaad9ab86cc0255fdb08f40206 resource=VA
time="2021-06-04T15:26:33Z" level=trace msg=update name=csi-0c36bf3aa3e14cee55d5e4f944e16a3e408d87aaad9ab86cc0255fdb08f40206 resource=VA
time="2021-06-04T15:26:33Z" level=trace msg=update name=csi-0c36bf3aa3e14cee55d5e4f944e16a3e408d87aaad9ab86cc0255fdb08f40206 resource=VA
kvaps commented 3 years ago
# linstor -m rd l -r pvc-efb31302-5feb-4dbe-93f5-8994eb08c6ca
[
  {
    "rsc_dfns": [
      {
        "rsc_name": "pvc-efb31302-5feb-4dbe-93f5-8994eb08c6ca",
        "rsc_dfn_uuid": "6a484c67-62c4-4d5c-9859-86c4e57ef7e9",
        "vlm_dfns": [],
        "rsc_dfn_props": [
          {
            "key": "DrbdPrimarySetOn",
            "value": "M1C29"
          },
          {
            "key": "Aux/csi-volume-annotations",
            "value": "{\"name\":\"pvc-efb31302-5feb-4dbe-93f5-8994eb08c6ca\",\"id\":\"pvc-efb31302-5feb-4dbe-93f5-8994eb08c6ca\",\"createdBy\":\"linstor.csi.linbit.com\",\"creationTime\":\"2021-04-08T12:32:41.068078808Z\",\"sizeBytes\":53687091200,\"readonly\":false,\"parameters\":{\"disklessStoragePool\":\"DfltDisklessStorPool\",\"resourceGroup\":\"linstor-1\",\"storagePool\":\"thindata\"}}"
          },
          {
            "key": "DrbdOptions/Resource/on-no-quorum",
            "value": "io-error"
          },
          {
            "key": "DrbdOptions/Resource/quorum",
            "value": "majority"
          }
        ],
        "rsc_dfn_port": 7003,
        "rsc_dfn_secret": "YebNHTSQyZVngVC/F3bz"
      }
    ]
  }
]
vholer commented 2 years ago

I'm experiencing very same issue quite easily, current HA controller doesn't look to be very reliable.

WanzenBug commented 2 years ago

Yeah, as I've written down in #13 the current HA controller is easily confused, and relies too much on LINSTOR to always be working. That's why I created #14, which still needs more testing, but should solve these issues, and be generally much more reliable (and also a bit more "aggressive") in failing over workloads

WanzenBug commented 2 years ago

This should be resolved by the rewrite for 1.0.0.