Closed pchang388 closed 1 year ago
Should longhorn continue to support deprecated annoation style storageClass specifier? Or should longhorn put a disclaimer on 1.5.0 saying that this is a known issue and users should be fixing their deprecated annotations? From my thoughts, this is a regression since I did not see it mentioned (maybe I missed it though).
https://kubernetes.io/docs/concepts/storage/dynamic-provisioning/#using-dynamic-provisioning
Yes, as you said, this is a regression (codes). I think we can fall back to the deprecated volume.beta.kubernetes.io/storage-class
if spec.storageCalss
is not set. WDYT? @innobead @ChanYiLin
_This one is a bit tough to answer, but thinking maybe some users have already went through the experience. Since rollbacks are not supported and I did the rollback to clean up orphan resources during upgrade, is there any concerns with stability or future upgrades due to a rollback being done? Specifically in this case 1.5.0 to 1.4.0 and upgrade back after to 1.5.0
The upgrade path is mainly updating the resources spec/status fields and the pods (codes). If you're worried about it, you can send us a support bundle and I can help check the resources' values.
@pchang388
BTW, should it be **Longhorn Manager** Pods CrashLoop After Upgrade From 1.4.0
?
@ChanYiLin please help with this. Need to get this to 1.5.1.
@pchang388 BTW, should it be
**Longhorn Manager** Pods CrashLoop After Upgrade From 1.4.0
?
Yes, sorry for the confusion - I fixed the details wording to reflect the correction as well. Thanks!
@pchang388 BTW, should it be
**Longhorn Manager** Pods CrashLoop After Upgrade From 1.4.0
?Yes, sorry for the confusion - I fixed the details wording to reflect the correction as well. Thanks!
No worries. Thanks for raising the issue.
[x] Where is the reproduce steps/test steps documented? The reproduce steps/test steps are at:
storageClassName
in annotation instead of spec
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: test
annotations:
volume.beta.kubernetes.io/storage-class: longhorn
spec:
accessModes:
- ReadWriteOnce
# storageClassName: longhorn
resources:
requests:
storage: 1Gi
k describe lhbv -n longhorn-system
[x] Is there a workaround for the issue? If so, where is it documented? The workaround is at:
manually edit pvc to add "storageClassName" to pvc.Spec
[x] Have the backend code been merged (Manager, Engine, Instance Manager, BackupStore etc) (including backport-needed/*
)?
The PR is at
Should longhorn continue to support deprecated annoation style storageClass specifier? Or should longhorn put a disclaimer on 1.5.0 saying that this is a known issue and users should be fixing their deprecated annotations? From my thoughts, this is a regression since I did not see it mentioned (maybe I missed it though).
https://kubernetes.io/docs/concepts/storage/dynamic-provisioning/#using-dynamic-provisioning
Yes, as you said, this is a regression (codes). I think we can fall back to the deprecated
volume.beta.kubernetes.io/storage-class
ifspec.storageCalss
is not set. WDYT? @innobead @ChanYiLin_This one is a bit tough to answer, but thinking maybe some users have already went through the experience. Since rollbacks are not supported and I did the rollback to clean up orphan resources during upgrade, is there any concerns with stability or future upgrades due to a rollback being done? Specifically in this case 1.5.0 to 1.4.0 and upgrade back after to 1.5.0
The upgrade path is mainly updating the resources spec/status fields and the pods (codes). If you're worried about it, you can send us a support bundle and I can help check the resources' values.
hey @derekbit, I noticed an issue following the upgrade. As mentioned it might be related to the rollback I did to fix the orphan issue.
I am no longer able to create new volumes. Existing volumes attached/run normally after the upgrade but when I tried to do a ct install
for unrelated project, the test pvc was stuck in pending phase and volume was created eventually after a ~3-5 minutes. But it never attached and when the ct install
timed out, volume is not being cleaned up even if the pod and pvc is deleted. Just noting - during the time the original issue was taking place and while I encountered this new issue, no backup job was taking place or at least shouldn't have been according to the reoccuring job cron.
Support bundle attached, but seeing alot of errors in general, would definitely appreciate if you could take a look. Currently, I am unsure where to start/look.
supportbundle_cc4fef94-13ea-4569-af93-87480124d212_2023-07-10T05-41-18Z.zip
Please let me know if you prefer to move this conversation to a new issue and I will delete this comment and move the below info there.
Some relevant logs/events:
events
### kubectl events for test pvc ###
5m37s Normal SuccessfulCreate replicaset/test-nginx-pvc-dfb4986fb Created pod: test-nginx-pvc-dfb4986fb-4xx8g
5m36s Warning FailedScheduling pod/test-nginx-pvc-dfb4986fb-4xx8g 0/8 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/8 nodes are available: 8 No preemption victims found for incoming pod..
4m17s Normal ExternalProvisioning persistentvolumeclaim/test-nginx-pvc waiting for a volume to be created, either by external provisioner "driver.longhorn.io" or manually created by system administrator
4m6s Warning ProvisioningFailed persistentvolumeclaim/test-nginx-pvc failed to provision volume with StorageClass "longhorn": rpc error: code = DeadlineExceeded desc = failed to wait for volume creation to complete
4m5s Normal Provisioning persistentvolumeclaim/test-nginx-pvc External provisioner is provisioning volume for claim "default/test-nginx-pvc"
4m5s Normal ProvisioningSucceeded persistentvolumeclaim/test-nginx-pvc Successfully provisioned volume pvc-11a77edd-6744-4b15-8655-061f5bcef15b
4m5s Warning FailedScheduling pod/test-nginx-pvc-dfb4986fb-4xx8g 0/8 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/8 nodes are available: 8 No preemption victims found for incoming pod..
4m2s Normal Scheduled pod/test-nginx-pvc-dfb4986fb-4xx8g Successfully assigned default/test-nginx-pvc-dfb4986fb-4xx8g to k3s-worker-0
5m46s Warning FailedScheduling pod/test-nginx-pvc-dfb4986fb-csrdm 0/8 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/8 nodes are available: 8 No preemption victims found for incoming pod..
2m Warning FailedMount pod/test-nginx-pvc-dfb4986fb-4xx8g Unable to attach or mount volumes: unmounted volumes=[vol1], unattached volumes=[vol1], failed to process volumes=[]: timed out waiting for the condition
110s Warning FailedAttachVolume pod/test-nginx-pvc-dfb4986fb-4xx8g AttachVolume.Attach failed for volume "pvc-11a77edd-6744-4b15-8655-061f5bcef15b" : rpc error: code = Aborted desc = volume pvc-11a77edd-6744-4b15-8655-061f5bcef15b is not ready for workloads
volumes
$ k get volumes.longhorn.io -n longhorn-system
...
...
pvc-c474295d-581c-4e99-8c7b-a848ca010e28 attached healthy 53687091200 k3s-worker-3 152d
pvc-ada7f3d9-78d5-40cf-8962-b00e22072007 attached healthy 10737418240 k3s-worker-0 84d
pvc-11a77edd-6744-4b15-8655-061f5bcef15b 5368709120 21m
test 2147483648 8m51s
provisioner logs
$ k logs -f -n longhorn-system csi-provisioner-65cb5cc4ff-7jnqk | grep -i error
E0710 03:27:38.089578 1 controller.go:1007] error syncing volume "pvc-5360f54f-34df-4976-9ee7-f0777dd499b4": persistentvolume pvc-5360f54f-34df-4976-9ee7-f0777dd499b4 is still attached to node k3s-worker-2
E0710 03:27:39.090764 1 controller.go:1007] error syncing volume "pvc-5360f54f-34df-4976-9ee7-f0777dd499b4": persistentvolume pvc-5360f54f-34df-4976-9ee7-f0777dd499b4 is still attached to node k3s-worker-2
E0710 03:27:41.091643 1 controller.go:1007] error syncing volume "pvc-5360f54f-34df-4976-9ee7-f0777dd499b4": persistentvolume pvc-5360f54f-34df-4976-9ee7-f0777dd499b4 is still attached to node k3s-worker-2
E0710 03:27:45.092498 1 controller.go:1007] error syncing volume "pvc-5360f54f-34df-4976-9ee7-f0777dd499b4": persistentvolume pvc-5360f54f-34df-4976-9ee7-f0777dd499b4 is still attached to node k3s-worker-2
E0710 03:27:53.092627 1 controller.go:1007] error syncing volume "pvc-5360f54f-34df-4976-9ee7-f0777dd499b4": persistentvolume pvc-5360f54f-34df-4976-9ee7-f0777dd499b4 is still attached to node k3s-worker-2
E0710 03:28:09.094148 1 controller.go:1007] error syncing volume "pvc-5360f54f-34df-4976-9ee7-f0777dd499b4": persistentvolume pvc-5360f54f-34df-4976-9ee7-f0777dd499b4 is still attached to node k3s-worker-2
E0710 03:28:41.093804 1 controller.go:1007] error syncing volume "pvc-5360f54f-34df-4976-9ee7-f0777dd499b4": persistentvolume pvc-5360f54f-34df-4976-9ee7-f0777dd499b4 is still attached to node k3s-worker-2
E0710 03:29:45.094373 1 controller.go:1007] error syncing volume "pvc-5360f54f-34df-4976-9ee7-f0777dd499b4": persistentvolume pvc-5360f54f-34df-4976-9ee7-f0777dd499b4 is still attached to node k3s-worker-2
I0710 03:33:10.438872 1 controller.go:1082] Temporary error received, adding PVC 8891335d-2c57-4dd8-8738-e3de09dc3a89 to claims in progress
E0710 03:33:10.438912 1 controller.go:957] error syncing claim "8891335d-2c57-4dd8-8738-e3de09dc3a89": failed to provision volume with StorageClass "longhorn": rpc error: code = DeadlineExceeded desc = failed to wait for volume creation to complete
I0710 03:33:10.438961 1 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"portainer-jhzkjen7ro", Name:"portainer-jhzkjen7ro", UID:"8891335d-2c57-4dd8-8738-e3de09dc3a89", APIVersion:"v1", ResourceVersion:"65540989", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "longhorn": rpc error: code = DeadlineExceeded desc = failed to wait for volume creation to complete
manager logs
$ k logs -f -n longhorn-system longhorn-manager-qwqht | grep -i error
time="2023-07-10T05:27:27Z" level=error msg="Failed to sync Longhorn setting longhorn-system/storage-network" controller=longhorn-setting error="failed to sync setting for longhorn-system/storage-network: failed to apply storage-network setting to Longhorn workloads when there are attached volumes" node=k3s-worker-2
time="2023-07-10T05:27:27Z" level=error msg="Failed to sync Longhorn setting longhorn-system/v2-data-engine" controller=longhorn-setting error="failed to sync setting for longhorn-system/v2-data-engine: cannot apply v2-data-engine setting to Longhorn workloads when there are attached volumes" node=k3s-worker-2
time="2023-07-10T05:27:27Z" level=error msg="Failed to sync Longhorn setting longhorn-system/storage-network" controller=longhorn-setting error="failed to sync setting for longhorn-system/storage-network: failed to apply storage-network setting to Longhorn workloads when there are attached volumes" node=k3s-worker-2
time="2023-07-10T05:27:27Z" level=error msg="Failed to sync Longhorn setting longhorn-system/v2-data-engine" controller=longhorn-setting error="failed to sync setting for longhorn-system/v2-data-engine: cannot apply v2-data-engine setting to Longhorn workloads when there are attached volumes" node=k3s-worker-2
time="2023-07-10T05:27:27Z" level=error msg="Failed to sync Longhorn setting longhorn-system/storage-network" controller=longhorn-setting error="failed to sync setting for longhorn-system/storage-network: failed to apply storage-network setting to Longhorn workloads when there are attached volumes" node=k3s-worker-2
time="2023-07-10T05:27:27Z" level=error msg="Failed to sync Longhorn setting longhorn-system/v2-data-engine" controller=longhorn-setting error="failed to sync setting for longhorn-system/v2-data-engine: cannot apply v2-data-engine setting to Longhorn workloads when there are attached volumes" node=k3s-worker-2
time="2023-07-10T05:27:27Z" level=error msg="Dropping Longhorn setting longhorn-system/storage-network out of the queue" controller=longhorn-setting error="failed to sync setting for longhorn-system/storage-network: failed to apply storage-network setting to Longhorn workloads when there are attached volumes" node=k3s-worker-2
time="2023-07-10T05:27:27Z" level=error msg="Dropping Longhorn setting longhorn-system/v2-data-engine out of the queue" controller=longhorn-setting error="failed to sync setting for longhorn-system/v2-data-engine: cannot apply v2-data-engine setting to Longhorn workloads when there are attached volumes" node=k3s-worker-2
time="2023-07-10T05:52:29Z" level=warning msg="HTTP handling error" error="websocket: close sent"
time="2023-07-10T05:52:29Z" level=error msg="Error in request: websocket: close sent"
time="2023-07-10T05:52:29Z" level=error msg="Failed to write err: websocket: close sent"
I see the related logs
2023-07-10T05:34:14.679046451Z time="2023-07-10T05:34:14Z" level=error msg="Failed to sync Longhorn volume longhorn-system/pvc-11a77edd-6744-4b15-8655-061f5bcef15b" controller=longhorn-volume error="failed to sync longhorn-system/pvc-11a77edd-6744-4b15-8655-061f5bcef15b: create not allowed while custom resource definition is terminating" node=k3s-worker-0
and the stuck customized
- apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
annotations:
controller-gen.kubebuilder.io/version: v0.7.0
meta.helm.sh/release-name: longhorn
meta.helm.sh/release-namespace: longhorn-system
creationTimestamp: "2023-07-08T09:37:04Z"
deletionTimestamp: "2023-07-08T09:56:08Z"
finalizers:
- customresourcecleanup.apiextensions.k8s.io
generation: 1
labels:
app.kubernetes.io/instance: longhorn
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: longhorn
app.kubernetes.io/version: v1.5.0
helm.sh/chart: longhorn-1.5.0
longhorn-manager: "null"
...
name: volumeattachments.longhorn.io
resourceVersion: "65594052"
uid: 58280162-bb41-46d8-916d-c1d107a7abe6
spec:
...
status:
acceptedNames:
kind: VolumeAttachment
listKind: VolumeAttachmentList
plural: volumeattachments
shortNames:
- lhva
singular: volumeattachment
conditions:
- lastTransitionTime: "2023-07-08T09:37:04Z"
message: no conflicts found
reason: NoConflicts
status: "True"
type: NamesAccepted
- lastTransitionTime: "2023-07-08T09:37:04Z"
message: the initial names have been accepted
reason: InitialNamesAccepted
status: "True"
type: Established
- lastTransitionTime: "2023-07-08T09:56:08Z"
message: CustomResource deletion is in progress
reason: InstanceDeletionInProgress
status: "True"
type: Terminating
storedVersions:
- v1beta2
kind: List
metadata:
resourceVersion: "65594102"
It looks hit the upstream issue https://github.com/kubernetes/kubernetes/issues/60538. If you want to try the workaround mentioned in this thread, I would recommend backing up volumes before applying it. I will try if I can reproduce it on my side.
@pchang388 I can reproduce the issue on my side. It's caused by the unexpected rollback.
Reproduce steps
...immutable fields...
CustomResource deletion is in progress
. Then, you cannot create any volume.Solution steps
kubectl -n longhorn-system patch crd volumeattachments.longhorn.io -p '{"metadata":{"finalizers":[]}}' --type=merge
kubectl -n longhorn-system get CustomResourceDefinition -o yaml
@pchang388 Please let me know if it works. Thank you.
A side question, @mantissahz is downgrade prevention only working for the source version is equal to or later than 1.5.0, right?
A side question, @mantissahz is downgrade prevention only working for the source version is equal to or later than 1.5.0, right?
@innobead Yes, it is for the official longhorn-manager
image.
hey @derekbit thank you very much for the suggestion. I was actually trying your suggestion and saw that you also came to the same conclusion
First run the patch, this is not recommended but the way since it can leave behind orphaned resources in k8 datastore, but was done to get back to normal operations for development purposes
Patch is to address deadlock with finalizers for crds as you mentioned
kubectl patch crd/volumeattachments.longhorn.io -p '{"metadata":{"finalizers":[]}}' --type=merge
I then noticed the volumeattachments.longhorn.io crd was missing and reinstalled it by doing a helm template
out and taking the volumeattachments section but your method is much better.
## spot error - notice its gone
k get volumeattachments.longhorn.io -n longhorn-system
error: the server doesn't have a resource type "volumeattachments"
## template output
helm template longhorn longhorn/longhorn -f helm/custom-values.yaml --version 1.5.0 -n longhorn-system > out.yaml
## create again
k apply -f volumeattachment.yaml
I tested it again by doing the patch and doing a helm upgrade instead, that worked as well
$ make upgrade
helm upgrade longhorn longhorn/longhorn -f helm/custom-values.yaml --version 1.5.0 -n longhorn-system
Release "longhorn" has been upgraded. Happy Helming!
NAME: longhorn
LAST DEPLOYED: Mon Jul 10 03:30:09 2023
NAMESPACE: longhorn-system
STATUS: deployed
REVISION: 18
TEST SUITE: None
NOTES:
Longhorn is now installed on the cluster!
Please wait a few minutes for other Longhorn components such as CSI deployments, Engine Images, and Instance Managers to be initialized.
Visit our documentation at https://longhorn.io/docs/
But for 1.5.0
just remember to delete the old deployments again after helm upgrade
:
kubectl delete deployments.apps longhorn-admission-webhook longhorn-conversion-webhook longhorn-recovery-backend -n longhorn-system
So far looks like things are working again, thank you again for your help and quick responses. I hope for no more issues due to the downgrade and definitely won't downgrade again unless there's no other way.
Verified on master-head 20230712
The test steps
v1.4.2
cluster with some orphan resources.storageClassName
in annotation instead of spec
Create a Pod with a PVC that has the storageClassName
specified in the annotation instead of the spec.
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: longhorn-volv-pvc
annotations:
volume.beta.kubernetes.io/storage-class: longhorn
spec:
accessModes:
- ReadWriteOnce
# storageClassName: longhorn
resources:
requests:
storage: 1Gi
---
apiVersion: v1
kind: Pod
metadata:
name: volume-test
namespace: default
spec:
restartPolicy: Always
containers:
- name: volume-test
image: nginx
imagePullPolicy: IfNotPresent
livenessProbe:
exec:
command:
- ls
- /data/lost+found
initialDelaySeconds: 5
periodSeconds: 5
volumeMounts:
- name: volv
mountPath: /data
ports:
- containerPort: 80
volumes:
- name: volv
persistentVolumeClaim:
claimName: longhorn-volv-pvc
kubectl describe lhbv -n longhorn-system
master-head
Result Passed
Describe the bug (🐛 if you encounter this issue)
I first tried to upgrade to
1.5.0
from1.4.0
and encountered the new pods crashing due to orphan resource issue as already documented here: #6246. I tried to delete the orphans directly after getting their ids:k get orphans.longhorn.io -n longhorn-system
. That was not working at the time and even though rollback is not supported, I followed the advice of one of the comments here: https://github.com/longhorn/longhorn/issues/6246#issuecomment-1625207593.I was able to rollback to
1.4.0
, pods came back up and running, and then went to UI and manually delete all orphans before reapplying upgrade again. Applied known worked around after1.5.0
upgrade:Then I was having 2 out of 5 longhorn managers (
-l app=longhorn-manager
) failing due to an error with backup_controller.go (https://github.com/longhorn/longhorn-manager/blob/v1.5.0/controller/backup_controller.go). Log output:I then dug around in the longhorn-manager repo and found this section in the referenced
1.5.0
tagged branch:I could see from that repo and the diff between
1.4.0
version that this appears to be a new logic/code. I checked the PVCs across all the namespaces and did a check forspec.storageClassName
and noticed that one of the pvcs did not have a storage class at all, instead it had an annotation (I'm not familiar with volume annotations yet):volume.beta.kubernetes.io/storage-class: longhorn
, full output of pvc below:As you can see that the
spec.storageClass
is missing as the upstream helm chart for this pvc since it uses the annotation method instead: https://github.com/portainer/k8s/blob/master/charts/portainer/templates/pvc.yamlAccording to K8 docs, excerpt below, the upstream chart (portainer) should adjust their template to use storageClass instead of the old annotation method that is deprecated but still working.
I was able to resolve this issue for now by manually editing pvc and adding a storageClass in spec section, this fixes the NPE and managers came back up fine. These same pvc/pods worked fine in
1.4.0
. I am going to open a PR for portainer to use the storageClass spec field instead of annotation since it's deprecated anyway but this does appear to be a regression.Questions:
1.5.0
saying that this is a known issue and users should be fixing their deprecated annotations? From my thoughts, this is a regression since I did not see it mentioned (maybe I missed it though).1.5.0
to1.4.0
and upgrade back after to1.5.0
To Reproduce
Steps to reproduce the behavior:
1.4.0
to1.5.0
(unsure if rollback caused this issue but might have to do that)spec.storageClass
1.5.0
tagged branchExpected behavior
Since the annotation volume storageclass specifier is deprecated, but still works within K8, longhorn should still support it or put a disclaimer/notice on
1.5.0
upgrade notesLog or Support bundle
If applicable, add the Longhorn managers' log or support bundle when the issue happens. You can generate a Support Bundle using the link at the footer of the Longhorn UI.
Environment
1.5.0
helm
v1.27.3+k3s1
3
5
Ubuntu 22.04 LTS Jammy
4
12
ZFS on mirrored vdevs (ssd)
presented as ext4 to guest VM1Gbps
Proxmox
16
Additional context
Add any other context about the problem here.
Workaround
Manually editing pvc and adding a storageClass in spec section, this fixes the NPE and managers came back up fine.