openebs / dynamic-localpv-provisioner

Dynamically deploy Stateful Persistent Node-Local Volumes & Filesystems for Kubernetes that is provisioned from simple Local-Hostpath /root storage.
https://openebs.io
Apache License 2.0
143 stars 63 forks source link

[bug] failed to delete large pv, thus making node unschedulable. #181

Open bernardgut opened 7 months ago

bernardgut commented 7 months ago

Describe the bug: After You purposefully create a large (80% of the node ephemeral storage) pvc to test the behavior of the localpv-provisioner, the provisioner successfully creates it, but fails to delete the pv after the pvc is removed, thus leaving the node with diskPressure=true and preventing further scheduling of pods on the node. Manual deletion of the pv on kubernetes leaves the data on disk and persists the issue. On Talos 1.7.0 using Openebs-localpv-provisioner (Helm) and the default Talos deployment instructions in the docs.

Expected behaviour: The provisioner successfully deletes the pv after the pvc is deleted and/or successfully deletes the data after the pv is manually deleted, the diskPressure=true is removed and the node resumes operations.

Steps to reproduce the bug:

The output of the following commands will help us better understand what's going on: These are the logs of the localpv-provisioner container after the deletion. They run a loop of the following

...
I0429 21:12:46.177550       1 controller.go:1509] delete "pvc-600bcff4-c26f-43c4-bebb-6b989110c715": started
2024-04-29T21:12:46.177Z    INFO    app/provisioner_hostpath.go:270 Get the Node Object with label {map[kubernetes.io/hostname:v2]}
I0429 21:12:46.181181       1 provisioner_hostpath.go:282] Deleting volume pvc-600bcff4-c26f-43c4-bebb-6b989110c715 at v2:/var/openebs/local/pvc-600bcff4-c26f-43c4-bebb-6b989110c715
2024-04-29T21:14:46.664Z    ERROR   app/provisioner.go:188      {"eventcode": "local.pv.delete.failure", "msg": "Failed to delete Local PV", "rname": "pvc-600bcff4-c26f-43c4-bebb-6b989110c715", "reason": "failed to delete host path", "storagetype": "local-hostpath"}
github.com/openebs/dynamic-localpv-provisioner/cmd/provisioner-localpv/app.(*Provisioner).Delete
    /go/src/github.com/openebs/dynamic-localpv-provisioner/cmd/provisioner-localpv/app/provisioner.go:188
sigs.k8s.io/sig-storage-lib-external-provisioner/v9/controller.(*ProvisionController).deleteVolumeOperation
    /go/pkg/mod/sigs.k8s.io/sig-storage-lib-external-provisioner/v9@v9.0.3/controller/controller.go:1511
sigs.k8s.io/sig-storage-lib-external-provisioner/v9/controller.(*ProvisionController).syncVolume
    /go/pkg/mod/sigs.k8s.io/sig-storage-lib-external-provisioner/v9@v9.0.3/controller/controller.go:1115
sigs.k8s.io/sig-storage-lib-external-provisioner/v9/controller.(*ProvisionController).syncVolumeHandler
    /go/pkg/mod/sigs.k8s.io/sig-storage-lib-external-provisioner/v9@v9.0.3/controller/controller.go:1045
sigs.k8s.io/sig-storage-lib-external-provisioner/v9/controller.(*ProvisionController).processNextVolumeWorkItem.func1
    /go/pkg/mod/sigs.k8s.io/sig-storage-lib-external-provisioner/v9@v9.0.3/controller/controller.go:987
sigs.k8s.io/sig-storage-lib-external-provisioner/v9/controller.(*ProvisionController).processNextVolumeWorkItem
    /go/pkg/mod/sigs.k8s.io/sig-storage-lib-external-provisioner/v9@v9.0.3/controller/controller.go:1004
sigs.k8s.io/sig-storage-lib-external-provisioner/v9/controller.(*ProvisionController).runVolumeWorker
    /go/pkg/mod/sigs.k8s.io/sig-storage-lib-external-provisioner/v9@v9.0.3/controller/controller.go:905
sigs.k8s.io/sig-storage-lib-external-provisioner/v9/controller.(*ProvisionController).Run.func1.3
    /go/pkg/mod/sigs.k8s.io/sig-storage-lib-external-provisioner/v9@v9.0.3/controller/controller.go:857
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
    /go/pkg/mod/k8s.io/apimachinery@v0.25.16/pkg/util/wait/wait.go:157
k8s.io/apimachinery/pkg/util/wait.BackoffUntil
    /go/pkg/mod/k8s.io/apimachinery@v0.25.16/pkg/util/wait/wait.go:158
k8s.io/apimachinery/pkg/util/wait.JitterUntil
    /go/pkg/mod/k8s.io/apimachinery@v0.25.16/pkg/util/wait/wait.go:135
k8s.io/apimachinery/pkg/util/wait.Until
    /go/pkg/mod/k8s.io/apimachinery@v0.25.16/pkg/util/wait/wait.go:92
E0429 21:14:46.664896       1 controller.go:1519] delete "pvc-600bcff4-c26f-43c4-bebb-6b989110c715": volume deletion failed: failed to delete volume pvc-600bcff4-c26f-43c4-bebb-6b989110c715: failed to delete volume pvc-600bcff4-c26f-43c4-bebb-6b989110c715: clean up volume pvc-600bcff4-c26f-43c4-bebb-6b989110c715 failed: create process timeout after 120 seconds
E0429 21:14:46.664948       1 controller.go:995] Giving up syncing volume "pvc-600bcff4-c26f-43c4-bebb-6b989110c715" because failures 15 >= threshold 15
E0429 21:14:46.664972       1 controller.go:1007] error syncing volume "pvc-600bcff4-c26f-43c4-bebb-6b989110c715": failed to delete volume pvc-600bcff4-c26f-43c4-bebb-6b989110c715: failed to delete volume pvc-600bcff4-c26f-43c4-bebb-6b989110c715: clean up volume pvc-600bcff4-c26f-43c4-bebb-6b989110c715 failed: create process timeout after 120 seconds
I0429 21:14:46.665321       1 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolume", Namespace:"", Name:"pvc-600bcff4-c26f-43c4-bebb-6b989110c715", UID:"4f6f7084-bdb3-4ea5-89cc-ed217aa78da1", APIVersion:"v1", ResourceVersion:"931637", FieldPath:""}): type: 'Warning' reason: 'VolumeFailedDelete' failed to delete volume pvc-600bcff4-c26f-43c4-bebb-6b989110c715: failed to delete volume pvc-600bcff4-c26f-43c4-bebb-6b989110c715: clean up volume pvc-600bcff4-c26f-43c4-bebb-6b989110c715 failed: create process timeout after 120 seconds
I0429 21:27:46.178478       1 controller.go:1509] delete "pvc-600bcff4-c26f-43c4-bebb-6b989110c715": started
...
**Anything else we need to know?:**
NA

**Environment details:**
- OpenEBS version (use `kubectl get po -n openebs --show-labels`): see above
- Kubernetes version (use `kubectl version`): Server Version: v1.29.3
- Cloud provider or hardware configuration: Talos 1.7.0. on Proxmox nodes with "ssd emulation":

talosctl -n v1 disks
NODE DEV MODEL SERIAL TYPE UUID WWID MODALIAS NAME SIZE BUS_PATH SUBSYSTEM READ_ONLY SYSTEM_DISK 10.2.0.8 /dev/sda QEMU HARDDISK - SSD - - scsi:t-0x00 - 22 GB /pci0000:00/0000:00:05.0/0000:01:01.0/virtio1/host2/target2:0:0/2:0:0:0/ /sys/class/block *


- OS (e.g: `cat /etc/os-release`): Talos 1.7.0
- kernel (e.g: `uname -a`): `Linux v1 6.6.28-talos #1 SMP Thu Apr 18 16:21:02 UTC 2024 x86_64 Linux`
- others:
tiagolobocastro commented 1 month ago

@niladrih do we need to add toleration for DiskPressure to the cleanup Pod?

D1StrX commented 1 month ago

I'm getting the same error on openebs/provisioner-localpv:4.1.1:

E1027 15:04:33.648888       1 controller.go:1007] error syncing volume "pvc-43c22848-1f2c-4471-9201-77ff5179c25c": failed to delete volume pvc-43c22848-1f2c-4471-9201-77ff5179c25c: failed to delete volume pvc-43c22848-1f2c-4471-9201-77ff5179c25c: clean up volume pvc-43c22848-1f2c-4471-9201-77ff5179c25c failed: create process timeout after 120 seconds
I1027 15:04:33.648953       1 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolume", Namespace:"", Name:"pvc-43c22848-1f2c-4471-9201-77ff5179c25c", UID:"a1fdb419-d2c6-4a05-90cd-7c437d439bab", APIVersion:"v1", ResourceVersion:"159262224", FieldPath:""}): type: 'Warning' reason: 'VolumeFailedDelete' failed to delete volume pvc-43c22848-1f2c-4471-9201-77ff5179c25c: failed to delete volume pvc-43c22848-1f2c-4471-9201-77ff5179c25c: clean up volume pvc-43c22848-1f2c-4471-9201-77ff5179c25c failed: create process timeout after 120 seconds
2024-10-27T15:04:33.653Z        ERROR   app/provisioner.go:174          {"eventcode": "local.pv.delete.failure", "msg": "Failed to delete Local PV", "rname": "pvc-0c25df70-d565-4172-ae84-c79432cac3f5", "reason": "failed to delete host path", "storagetype": "local-hostpath"}
github.com/openebs/dynamic-localpv-provisioner/cmd/provisioner-localpv/app.(*Provisioner).Delete
        /go/src/github.com/openebs/dynamic-localpv-provisioner/cmd/provisioner-localpv/app/provisioner.go:174
sigs.k8s.io/sig-storage-lib-external-provisioner/v9/controller.(*ProvisionController).deleteVolumeOperation
        /go/pkg/mod/sigs.k8s.io/sig-storage-lib-external-provisioner/v9@v9.0.3/controller/controller.go:1511
sigs.k8s.io/sig-storage-lib-external-provisioner/v9/controller.(*ProvisionController).syncVolume
        /go/pkg/mod/sigs.k8s.io/sig-storage-lib-external-provisioner/v9@v9.0.3/controller/controller.go:1115
sigs.k8s.io/sig-storage-lib-external-provisioner/v9/controller.(*ProvisionController).syncVolumeHandler
        /go/pkg/mod/sigs.k8s.io/sig-storage-lib-external-provisioner/v9@v9.0.3/controller/controller.go:1045
sigs.k8s.io/sig-storage-lib-external-provisioner/v9/controller.(*ProvisionController).processNextVolumeWorkItem.func1
        /go/pkg/mod/sigs.k8s.io/sig-storage-lib-external-provisioner/v9@v9.0.3/controller/controller.go:987
sigs.k8s.io/sig-storage-lib-external-provisioner/v9/controller.(*ProvisionController).processNextVolumeWorkItem
        /go/pkg/mod/sigs.k8s.io/sig-storage-lib-external-provisioner/v9@v9.0.3/controller/controller.go:1004
sigs.k8s.io/sig-storage-lib-external-provisioner/v9/controller.(*ProvisionController).runVolumeWorker
        /go/pkg/mod/sigs.k8s.io/sig-storage-lib-external-provisioner/v9@v9.0.3/controller/controller.go:905
sigs.k8s.io/sig-storage-lib-external-provisioner/v9/controller.(*ProvisionController).Run.func1.3
        /go/pkg/mod/sigs.k8s.io/sig-storage-lib-external-provisioner/v9@v9.0.3/controller/controller.go:857
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
        /go/pkg/mod/k8s.io/apimachinery@v0.25.16/pkg/util/wait/wait.go:157
k8s.io/apimachinery/pkg/util/wait.BackoffUntil
        /go/pkg/mod/k8s.io/apimachinery@v0.25.16/pkg/util/wait/wait.go:158
k8s.io/apimachinery/pkg/util/wait.JitterUntil
        /go/pkg/mod/k8s.io/apimachinery@v0.25.16/pkg/util/wait/wait.go:135
k8s.io/apimachinery/pkg/util/wait.Until
        /go/pkg/mod/k8s.io/apimachinery@v0.25.16/pkg/util/wait/wait.go:92
E1027 15:04:33.653151       1 controller.go:1519] delete "pvc-0c25df70-d565-4172-ae84-c79432cac3f5": volume deletion failed: failed to delete volume pvc-0c25df70-d565-4172-ae84-c79432cac3f5: failed to delete volume pvc-0c25df70-d565-4172-ae84-c79432cac3f5: clean up volume pvc-0c25df70-d565-4172-ae84-c79432cac3f5 failed: create process timeout after 120 seconds
I1027 15:04:33.653273       1 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolume", Namespace:"", Name:"pvc-0c25df70-d565-4172-ae84-c79432cac3f5", UID:"1a09afdb-1288-4428-ac7f-c00dd6f0800d", APIVersion:"v1", ResourceVersion:"161476499", FieldPath:""}): type: 'Warning' reason: 'VolumeFailedDelete' failed to delete volume pvc-0c25df70-d565-4172-ae84-c79432cac3f5: failed to delete volume pvc-0c25df70-d565-4172-ae84-c79432cac3f5: clean up volume pvc-0c25df70-d565-4172-ae84-c79432cac3f5 failed: create process timeout after 120 seconds
W1027 15:04:33.653187       1 controller.go:992] Retrying syncing volume "pvc-0c25df70-d565-4172-ae84-c79432cac3f5" because failures 0 < threshold 15
E1027 15:04:33.653752       1 controller.go:1007] error syncing volume "pvc-0c25df70-d565-4172-ae84-c79432cac3f5": failed to delete volume pvc-0c25df70-d565-4172-ae84-c79432cac3f5: failed to delete volume pvc-0c25df70-d565-4172-ae84-c79432cac3f5: clean up volume pvc-0c25df70-d565-4172-ae84-c79432cac3f5 failed: create process timeout after 120 seconds`