hpe-storage / truenas-csp

TrueNAS Container Storage Provider for HPE CSI Driver for Kubernetes
https://scod.hpedev.io
MIT License
65 stars 8 forks source link

Issues witk k8s 1.28 #43

Closed msilcher closed 4 months ago

msilcher commented 1 year ago

Hi!,

Maybe someone can help me, I've update my k8s homelab to version 1.28.1 (from 1.27.4) and I'm getting the following error on pods that rely on HPE CSI driver:

MountVolume.MountDevice failed for volume "pvc-b6b9bb0e-a422-49d8-a79a-3034d1a719c1" : kubernetes.io/csi: attacher.MountDevice failed to create newCsiDriverClient: driver name csi.hpe.com not found in the list of registered CSI drivers

Looking at the CSI node pod I see the following:

I0829 13:23:31.219372 1 main.go:166] Version: v2.6.1 I0829 13:23:31.219406 1 main.go:167] Running node-driver-registrar in mode=registration I0829 13:23:31.219794 1 main.go:191] Attempting to open a gRPC connection with: "/csi/csi.sock" I0829 13:23:31.219814 1 connection.go:154] Connecting to unix:///csi/csi.sock I0829 13:23:33.325192 1 main.go:198] Calling CSI driver to discover driver name I0829 13:23:33.325203 1 connection.go:183] GRPC call: /csi.v1.Identity/GetPluginInfo I0829 13:23:33.325205 1 connection.go:184] GRPC request: {} I0829 13:23:33.328070 1 connection.go:186] GRPC response: {"name":"csi.hpe.com","vendor_version":"1.3"} I0829 13:23:33.328107 1 connection.go:187] GRPC error: I0829 13:23:33.328112 1 main.go:208] CSI driver name: "csi.hpe.com" I0829 13:23:33.328128 1 node_register.go:53] Starting Registration Server at: /registration/csi.hpe.com-reg.sock I0829 13:23:33.328254 1 node_register.go:62] Registration Server started at: /registration/csi.hpe.com-reg.sock I0829 13:23:33.328303 1 node_register.go:92] Skipping HTTP server because endpoint is set to: "" I0829 13:23:33.674712 1 main.go:102] Received GetInfo call: &InfoRequest{} E0829 13:23:33.674741 1 main.go:107] "Failed to create registration probe file" err="open /var/lib/kubelet/plugins/csi.hpe.com/registration: no such file or directory" registrationProbePath="/var/lib/kubelet/plugins/csi.hpe.com/registration" I0829 13:23:33.674748 1 main.go:109] "Kubelet registration probe created" path="/var/lib/kubelet/plugins/csi.hpe.com/registration" I0829 13:23:33.711252 1 main.go:120] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:true,Error:,}

CIS controller logs:

I0829 13:21:48.488722 1 feature_gate.go:245] feature gates: &{map[]} I0829 13:21:48.488765 1 csi-provisioner.go:150] Version: v3.3.0 I0829 13:21:48.488770 1 csi-provisioner.go:173] Building kube configs for running in cluster... I0829 13:21:48.489029 1 connection.go:154] Connecting to unix:///var/lib/csi/sockets/pluginproxy/csi.sock I0829 13:21:49.489960 1 common.go:111] Probing CSI driver for readiness I0829 13:21:49.489971 1 connection.go:183] GRPC call: /csi.v1.Identity/Probe I0829 13:21:49.489974 1 connection.go:184] GRPC request: {} I0829 13:21:49.492126 1 connection.go:186] GRPC response: {} I0829 13:21:49.492160 1 connection.go:187] GRPC error: I0829 13:21:49.492166 1 connection.go:183] GRPC call: /csi.v1.Identity/GetPluginInfo I0829 13:21:49.492167 1 connection.go:184] GRPC request: {} I0829 13:21:49.492520 1 connection.go:186] GRPC response: {"name":"csi.hpe.com","vendor_version":"1.3"} I0829 13:21:49.492542 1 connection.go:187] GRPC error: I0829 13:21:49.492553 1 csi-provisioner.go:217] Detected CSI driver csi.hpe.com I0829 13:21:49.492559 1 connection.go:183] GRPC call: /csi.v1.Identity/GetPluginCapabilities I0829 13:21:49.492561 1 connection.go:184] GRPC request: {} I0829 13:21:49.492965 1 connection.go:186] GRPC response: {"capabilities":[{"Type":{"Service":{"type":1}}}]} I0829 13:21:49.493028 1 connection.go:187] GRPC error: I0829 13:21:49.493034 1 connection.go:183] GRPC call: /csi.v1.Controller/ControllerGetCapabilities I0829 13:21:49.493036 1 connection.go:184] GRPC request: {} I0829 13:21:49.493460 1 connection.go:186] GRPC response: {"capabilities":[{"Type":{"Rpc":{"type":1}}},{"Type":{"Rpc":{"type":2}}},{"Type":{"Rpc":{"type":3}}},{"Type":{"Rpc":{"type":5}}},{"Type":{"Rpc":{"type":6}}},{"Type":{"Rpc":{"type":7}}},{"Type":{"Rpc":{"type":8}}},{"Type":{"Rpc":{"type":9}}}]} I0829 13:21:49.493732 1 connection.go:187] GRPC error: I0829 13:21:49.493776 1 csi-provisioner.go:286] CSI driver supports PUBLISH_UNPUBLISH_VOLUME, watching VolumeAttachments I0829 13:21:49.494197 1 controller.go:732] Using saving PVs to API server in background I0829 13:21:49.494354 1 reflector.go:221] Starting reflector v1.PersistentVolumeClaim (15m0s) from k8s.io/client-go/informers/factory.go:134 I0829 13:21:49.494364 1 reflector.go:257] Listing and watching v1.PersistentVolumeClaim from k8s.io/client-go/informers/factory.go:134 I0829 13:21:49.494375 1 reflector.go:221] Starting reflector v1.VolumeAttachment (1h0m0s) from k8s.io/client-go/informers/factory.go:134 I0829 13:21:49.494384 1 reflector.go:257] Listing and watching v1.VolumeAttachment from k8s.io/client-go/informers/factory.go:134 I0829 13:21:49.494458 1 reflector.go:221] Starting reflector v1.StorageClass (1h0m0s) from k8s.io/client-go/informers/factory.go:134 I0829 13:21:49.494464 1 reflector.go:257] Listing and watching v1.StorageClass from k8s.io/client-go/informers/factory.go:134 I0829 13:21:49.594383 1 shared_informer.go:285] caches populated I0829 13:21:49.594405 1 shared_informer.go:285] caches populated I0829 13:21:49.594410 1 shared_informer.go:285] caches populated I0829 13:21:49.594417 1 controller.go:811] Starting provisioner controller csi.hpe.com_debian-k8s_1b5149f8-ad9c-40d7-960e-4f4fa491b433! I0829 13:21:49.594441 1 volume_store.go:97] Starting save volume queue I0829 13:21:49.594467 1 clone_controller.go:66] Starting CloningProtection controller I0829 13:21:49.594502 1 clone_controller.go:82] Started CloningProtection controller I0829 13:21:49.594538 1 reflector.go:221] Starting reflector v1.PersistentVolume (15m0s) from sigs.k8s.io/sig-storage-lib-external-provisioner/v8/controller/controller.go:845 I0829 13:21:49.594549 1 reflector.go:257] Listing and watching v1.PersistentVolume from sigs.k8s.io/sig-storage-lib-external-provisioner/v8/controller/controller.go:845 I0829 13:21:49.594566 1 reflector.go:221] Starting reflector v1.StorageClass (15m0s) from sigs.k8s.io/sig-storage-lib-external-provisioner/v8/controller/controller.go:848 I0829 13:21:49.594575 1 reflector.go:257] Listing and watching v1.StorageClass from sigs.k8s.io/sig-storage-lib-external-provisioner/v8/controller/controller.go:848 I0829 13:21:49.694728 1 shared_informer.go:285] caches populated I0829 13:21:49.694801 1 controller.go:860] Started provisioner controller csi.hpe.com_debian-k8s_1b5149f8-ad9c-40d7-960e-4f4fa491b433! I0829 13:28:56.596768 1 reflector.go:559] sigs.k8s.io/sig-storage-lib-external-provisioner/v8/controller/controller.go:848: Watch close - v1.StorageClass total 8 items received I0829 13:29:00.500862 1 reflector.go:559] k8s.io/client-go/informers/factory.go:134: Watch close - v1.VolumeAttachment total 8 items received

It was working fine with 1.26 & 1.27.

Thank you!

datamattsson commented 1 year ago

Looking at the CSI node pod I see the following:

Can you check the CSI driver pod. This is the log for the registrar, which looks ok.

I've tested v1.28 so it shouldn't be a problem. What does kubectl get csidrivers look like?

msilcher commented 1 year ago

"kubectl get csidrivers" looks good to me:

image

I'm attaching the hpe-csi-driver log which is quite big. hpe-csi-driver.log

Thank you!

datamattsson commented 1 year ago

This is peculiar. I haven't seen this before and the logs doesn't reveal any clues. Can you do a restart of the controller and node driver? kubectl rollout restart -n hpe-storage ds/hpe-csi-node deploy/hpe-csi-controller and see if it bites? If you create a new new PVC, does the same issue persist?

msilcher commented 1 year ago

Yes, really strange. I might have messed up something without knowing it. Restarting the controller did not help but creating new PVCs seems to work.

Thanks again for your support!

msilcher commented 1 year ago

I've been playing around a little more and now I come to the following on the CSI driver:

time="2023-08-29T23:29:11Z" level=error msg="GRPC error: rpc error: code = Internal desc = Failed to stage volume Data_K8s_pvc-375cf244-c723-40a0-bc6d-59a9631a8564, err: rpc error: code = Internal desc = Error creating device for volume Data_K8s_pvc-375cf244-c723-40a0-bc6d-59a9631a8564, err: device not found with serial 6589cfc000000b3f51a0d014e297a74f or target " file="utils.go:73"

I deleted reinstalled TrunasCSP from scratch and deleted al PVCs. After that I provisioned a single PVC as a test on k8s version 1.27.4. All looks good so far, deleting the pod get's it recreated and claimind the existing PVC.

It seems that the problem starts after upgrading to 1.28.1. Rolling back to a previous snapshot of the k8s VM (v 1.27.5) also manifests the above issue. The PVC is the same, before and after the upgrade so there should be no mismatch there, at least from the TrueNAS/Storage side.

datamattsson commented 1 year ago

The message err: device not found with serial 6589cfc000000b3f51a0d014e297a74f indicates that the worker node is unable to find the mpath device on the node. The CSI node log you attached prior did not have these messages so maybe that was from a different node in the cluster?

msilcher commented 1 year ago

Strange thing is that I'm working on a single node lab. Also on version 1.28 pods that have PVCs attached get stuck on "terminating" when I delete them (no "finalizers" are holding them back, I checked). I'll continue to investigate as it is probably something related to my specific scenario.

datamattsson commented 1 year ago

It really sounds odd, given there are no finalizers. Keep this thread posted.

On Wed, Aug 30, 2023, 6:47 AM msilcher @.***> wrote:

Strange thing is that I'm working on a single node lab. Also on version 1.28 pods that have PVCs attached get stuck on "terminating" when I delete them (no "finalizers" are holding them back, I checked). I'll continue to investigate as it is probably something related to my specific scenario.

— Reply to this email directly, view it on GitHub https://github.com/hpe-storage/truenas-csp/issues/43#issuecomment-1699224035, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABVTK2E3T3YRDAHNTY33YR3XX475JANCNFSM6AAAAAA4C65TT4 . You are receiving this because you commented.Message ID: @.***>

msilcher commented 1 year ago

New issue with small disks (100Mb):

MountVolume.MountDevice failed for volume "pvc-e5eac187-be0f-48a7-9a26-0bedef795b8a" : rpc error: code = Internal desc = Failed to stage volume Data_K8s_pvc-e5eac187-be0f-48a7-9a26-0bedef795b8a, err: failed to mount device /dev/mapper/mpatht, Error creating filesystem xfs on device with serialNumber 6589cfc00000020ddc79c3a1c9beeaa2, unable to create filesystem: xfs with args [/dev/mapper/mpatht]. Error: command mkfs.xfs failed with rc=1 err=specified blocksize 4096 is less than device physical sector size 8192 switching to logical sector size 512 Filesystem must be larger than 300MB. Usage: mkfs.xfs / blocksize / [-b size=num] / config file / [-c options=xxx] / metadata / [-m crc=0|1,finobt=0|1,uuid=xxx,rmapbt=0|1,reflink=0|1, inobtcount=0|1,bigtime=0|1] / data subvol / [-d agcount=n,agsize=n,file,name=xxx,size=num, (sunit=value,swidth=value|su=num,sw=num|noalign), sectsize=num / force overwrite / [-f] / inode size / [-i perblock=n|size=num,maxpct=n,attr=0|1|2, projid32bit=0|1,sparse=0|1,nrext64=0|1] / no discard / [-K] / log subvol / [-l agnum=n,internal,size=num,logdev=xxx,version=n sunit=value|su=num,sectsize=num,lazy-count=0|1] / label / [-L label (maximum 12 characters)] / naming / [-n size=num,version=2|ci,ftype=0|1] / no-op info only / [-N] / prototype file / [-p fname] / quiet / [-q] / realtime subvol / [-r extsize=num,size=num,rtdev=xxx] / sectorsize / [-s size=num] / version / [-V] devicename is required unless -d name=xxx is given. is xxx (bytes), xxxs (sectors), xxxb (fs blocks), xxxk (xxx KiB), xxxm (xxx MiB), xxxg (xxx GiB), xxxt (xxx TiB) or xxxp (xxx PiB). is xxx (512 byte blocks).

I used the same manifests some time ago without issues, maybe new versions of TrueNAS (or Debian) complain now. Not really critical though.

datamattsson commented 1 year ago

You can specify a different filesystem, ext3/ext4 has less overhead IIRC.

msilcher commented 1 year ago

Regarding the original issue, it seems related to cri-o 1.28.0. I did a test upgrading k8s to 1.281 but staying with previous cri-o version (1.27.1) and there are no issues. After upgrading cri-o 1.27.1 to 1.28.0 problems start. I've attached an extract of the kubelet log where I see "unmount" issues. Besides unmount issues, which I belive is related to the pods stuck in terminating, I also observe that pods mount volumes but do not see any data on them (as they were starting from scratch). Maybe because previous pod did not unmount the volume properly...?

I'll keep digging.

kubelet.log

msilcher commented 1 year ago

Also seeing this on csi-provisioner log:

E0902 16:27:40.008434 1 controller.go:1512] delete "pvc-c6271b5f-fbbd-4e5d-8a49-9b624fa2f1ae": volume deletion failed: persistentvolume pvc-c6271b5f-fbbd-4e5d-8a49-9b624fa2f1ae is still attached to node debian-k8s E0902 16:27:40.008434 1 controller.go:1512] delete "pvc-75a3aca1-9f49-4c87-a1d0-292c7ee14199": volume deletion failed: persistentvolume pvc-75a3aca1-9f49-4c87-a1d0-292c7ee14199 is still attached to node debian-k8s W0902 16:27:40.008459 1 controller.go:989] Retrying syncing volume "pvc-c6271b5f-fbbd-4e5d-8a49-9b624fa2f1ae", failure 8 E0902 16:27:40.008473 1 controller.go:1007] error syncing volume "pvc-c6271b5f-fbbd-4e5d-8a49-9b624fa2f1ae": persistentvolume pvc-c6271b5f-fbbd-4e5d-8a49-9b624fa2f1ae is still attached to node debian-k8s

datamattsson commented 11 months ago

Did you manage to resolve this? TrueNAS CSP v2.4.0 have a good track record on v1.28.

msilcher commented 11 months ago

No, I narrowed it down to cri-o 1.28.x but not luck so far. I ended up reporting the bug but no feedback so far (https://github.com/cri-o/cri-o/issues/7308). I tested the new HPE CSI driver 2.4.0 (togheter with TruenaNAS 2.3.10) but the issue persists, downgrading cri-o to 1.27.1 and recreating the PVCs fixes the issue.
All is related to a "not emty directory" which provokes the provisioner not to release the volume, therefore the pod stays forever in terminating state: kubernetes.io/csi: Unmounter.TearDownAt failed to clean mount dir [/var/lib/kubelet/pods/9e70d42e-fead-4c03-900f-e0f0fe61e64e/volumes/kubernetes.io-csi/pvc-4c72b78c-5333-49bb-8061-28f545a48f89/mount]: kubernetes.io/csi: failed to remove dir [/var/lib/kubelet/pods/9e70d42e-fead-4c03-900f-e0f0fe61e64e/volumes/kubernetes.io-csi/pvc-4c72b78c-5333-49bb-8061-28f545a48f89/mount]: remove /var/lib/kubelet/pods/9e70d42e-fead-4c03-900f-e0f0fe61e64e/volumes/kubernetes.io~csi/pvc-4c72b78c-5333-49bb-8061-28f545a48f89/mount: directory not empty Removing the directory by hand gets things going...

datamattsson commented 11 months ago

Is there anything in the actual directory or is there a race condition?

On Tue, Oct 10, 2023, 1:34 AM msilcher @.***> wrote:

No, I narrowed it down to cri-o 1.28.x but not luck so far. I ended up reporting the bug but no feedback so far (cri-o/cri-o#7308 https://github.com/cri-o/cri-o/issues/7308). I tested the new HPE CSI driver 2.4.0 (togheter with TruenaNAS 2.3.10) but the issue persists, downgrading cri-o to 1.27.1 and recreating the PVCs fixes the issue. All is related to a "not emty directory" which provokes the provisioner not to release the volume, therefore the pod stays forever in terminating state: kubernetes.io/csi http://kubernetes.io/csi: Unmounter.TearDownAt failed to clean mount dir [/var/lib/kubelet/pods/9e70d42e-fead-4c03-900f-e0f0fe61e64e/volumes/kubernetes.io-csi/pvc-4c72b78c-5333-49bb-8061-28f545a48f89/mount]: kubernetes.io/csi http://kubernetes.io/csi: failed to remove dir [/var/lib/kubelet/pods/9e70d42e-fead-4c03-900f-e0f0fe61e64e/volumes/kubernetes.io-csi/pvc-4c72b78c-5333-49bb-8061-28f545a48f89/mount]: remove /var/lib/kubelet/pods/9e70d42e-fead-4c03-900f-e0f0fe61e64e/volumes/kubernetes.io http://kubernetes.io~csi/pvc-4c72b78c-5333-49bb-8061-28f545a48f89/mount: directory not empty

— Reply to this email directly, view it on GitHub https://github.com/hpe-storage/truenas-csp/issues/43#issuecomment-1754695701, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABVTK2BTSQUPVVHDACOQ37TX6UCCZAVCNFSM6AAAAAA4C65TT6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONJUGY4TKNZQGE . You are receiving this because you commented.Message ID: @.***>

msilcher commented 11 months ago

Is there anything in the actual directory or is there a race condition? On Tue, Oct 10, 2023, 1:34 AM msilcher @.> wrote: No, I narrowed it down to cri-o 1.28.x but not luck so far. I ended up reporting the bug but no feedback so far (cri-o/cri-o#7308 <cri-o/cri-o#7308>). I tested the new HPE CSI driver 2.4.0 (togheter with TruenaNAS 2.3.10) but the issue persists, downgrading cri-o to 1.27.1 and recreating the PVCs fixes the issue. All is related to a "not emty directory" which provokes the provisioner not to release the volume, therefore the pod stays forever in terminating state: kubernetes.io/csi http://kubernetes.io/csi: Unmounter.TearDownAt failed to clean mount dir [/var/lib/kubelet/pods/9e70d42e-fead-4c03-900f-e0f0fe61e64e/volumes/kubernetes.io-csi/pvc-4c72b78c-5333-49bb-8061-28f545a48f89/mount]: kubernetes.io/csi http://kubernetes.io/csi: failed to remove dir [/var/lib/kubelet/pods/9e70d42e-fead-4c03-900f-e0f0fe61e64e/volumes/kubernetes.io-csi/pvc-4c72b78c-5333-49bb-8061-28f545a48f89/mount]: remove /var/lib/kubelet/pods/9e70d42e-fead-4c03-900f-e0f0fe61e64e/volumes/kubernetes.io http://kubernetes.io~csi/pvc-4c72b78c-5333-49bb-8061-28f545a48f89/mount: directory not empty — Reply to this email directly, view it on GitHub <#43 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABVTK2BTSQUPVVHDACOQ37TX6UCCZAVCNFSM6AAAAAA4C65TT6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONJUGY4TKNZQGE . You are receiving this because you commented.Message ID: @.>

I recall there was a single file related to the "mount config" or something similar. I really need to check again but I'm on a business trip for the whole week. I could confirm in a few days though.

msilcher commented 10 months ago

Just as a follow up: the issue is related to cri-o in its latest versions. They have proposed a fix which will come out in the next version (https://github.com/cri-o/cri-o/issues/7308 & https://github.com/cri-o/cri-o/pull/7428). I've tested it with a patched static build they provided and it works fine!

Thanks!!

datamattsson commented 4 months ago

Please re-open if it resurface.