hpe-storage / truenas-csp

TrueNAS Container Storage Provider for HPE CSI Driver for Kubernetes
https://scod.hpedev.io
MIT License
65 stars 8 forks source link

Issues with k8s 1.29 #53

Closed msilcher closed 7 months ago

msilcher commented 8 months ago

I was testing k8s 1.29 today and I'm facing provisioning issues. Attaching a few logs. It seems that the truenas-csp pod is restaring its workers. Any feedback is appreciated.

Thanks!

csi-proisioner.txt truenas-csp.txt hpe-csi-driver.txt

msilcher commented 8 months ago

Using a previous snapshot of the k8s VM (when k8s was at version 1.28.4 and was working) I now get this error when trying to mount existing volumes: MountVolume.MountDevice failed for volume "pvc-c0521570-1e18-4e40-b9a6-c5eaa4a30cf6" : rpc error: code = Internal desc = Failed to stage volume Data_K8s_pvc-c0521570-1e18-4e40-b9a6-c5eaa4a30cf6, err: rpc error: code = Internal desc = Error creating device for volume Data_K8s_pvc-c0521570-1e18-4e40-b9a6-c5eaa4a30cf6, err: device not found with serial 6589cfc000000a4ad9bbee637791cafb or target

Is it possible that something changed in TrueNAS after the k8s upgrade to 1.29? I'm using the same existing PVCs/PVs before and after the upgrade and now I get device not found with serial XXX. Vols and targets are there in TrueNAS, don't know what could have happened.

Note: comparing the expected volume serial number in K8S and Extent NAA in TrueNAS, both seem to have the same format but they don't match! This could be the problem but I don't know why they don't match anymore...

datamattsson commented 8 months ago

Looking at the truenas-csp.txt log I find this very peculiar. It's like the container is not even starting? Can you turn on debug logging?

msilcher commented 8 months ago

I detected an issue with the cni I'm using (calico), it seems not to work well with k8s 1.29. Let me check that and then I'll test the CSP again. Thanks and sorry for bothering you.

Any hint about the issue with mismatching serial numbers? volumes were provisioned and working before. I seems the rollback of the VMs snapshot messed things up.

datamattsson commented 8 months ago

Which VM did you roll back? The Kubernetes worker or TrueNAS?

msilcher commented 8 months ago

The kubernetes worker. I always do snapshots before upgrading k8s components because sometimes things go wrong :)

Note: I did an upgrade from 1.28.4 to 1.28.5 on the restored worker node and pvs/pvcs started to work again... strange. Maybe draining the node replaces pods and this helps? Maybe it is related to some other part of upgrade the process...

Where are the serial and other related info from the maped LUNs stored on K8s? I suspect this info gets pulled from TrueNAS via CSP each time the worker starts so I cannot imagine where the mismatch comes from.

datamattsson commented 8 months ago

Where are the serial and other related info from the maped LUNs stored on K8s? I suspect this info gets pulled from TrueNAS via CSP each time the worker starts so I cannot imagine where the mismatch comes from.

The serial number is the serial number of the iSCSI extent mapped to the ZVOL if I recall correctly. The CSP returns it to the CSI driver that then in turn starts the discovery of the iSCSI target and device maps.

msilcher commented 7 months ago

Just as an update: TrueNAS CSP & HPE CSI work fine with k8s 1.29.0. I had an issue with Calico (CNI) and I'm working with their team to get it sorted out for 1.29. This issue can be closed if you like, sorry for any inconvenience.