longhorn / longhorn

Cloud-Native distributed storage built on and for Kubernetes
https://longhorn.io
Apache License 2.0
6.03k stars 595 forks source link

[BUG] Longhorn Manager Pods CrashLoop after upgrade from 1.4.0 to 1.5.0 while backing up volumes #6264

Closed pchang388 closed 1 year ago

pchang388 commented 1 year ago

Describe the bug (🐛 if you encounter this issue)

I first tried to upgrade to 1.5.0 from 1.4.0 and encountered the new pods crashing due to orphan resource issue as already documented here: #6246. I tried to delete the orphans directly after getting their ids: k get orphans.longhorn.io -n longhorn-system. That was not working at the time and even though rollback is not supported, I followed the advice of one of the comments here: https://github.com/longhorn/longhorn/issues/6246#issuecomment-1625207593.

I was able to rollback to 1.4.0, pods came back up and running, and then went to UI and manually delete all orphans before reapplying upgrade again. Applied known worked around after 1.5.0 upgrade:

kubectl delete deployments.apps longhorn-admission-webhook longhorn-conversion-webhook longhorn-recovery-backend -n longhorn-system

Then I was having 2 out of 5 longhorn managers (-l app=longhorn-manager) failing due to an error with backup_controller.go (https://github.com/longhorn/longhorn-manager/blob/v1.5.0/controller/backup_controller.go). Log output:

time="2023-07-09T21:39:56Z" level=error msg="Failed to sync Longhorn setting longhorn-system/storage-network" controller=longhorn-setting error="failed to sync setting for longhorn-system/storage-network: failed to apply storage-network setting to Longhorn workloads when there are attached volumes" node=k3s-worker-2
E0709 21:39:56.214037       1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 2290 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x241cb80?, 0x4125b90})
        /go/src/github.com/longhorn/longhorn-manager/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x99
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc006ac45e0?})
        /go/src/github.com/longhorn/longhorn-manager/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x75
panic({0x241cb80, 0x4125b90})
        /usr/local/go/src/runtime/panic.go:884 +0x213
github.com/longhorn/longhorn-manager/controller.(*BackupController).checkMonitor(0xc008757f40, 0xc00a413900, 0xc0092a3b00, 0xc005cd9500)
        /go/src/github.com/longhorn/longhorn-manager/controller/backup_controller.go:639 +0x1e2
github.com/longhorn/longhorn-manager/controller.(*BackupController).reconcile(0xc008757f40, {0xc008f60a90, 0x17})
        /go/src/github.com/longhorn/longhorn-manager/controller/backup_controller.go:376 +0xa98
github.com/longhorn/longhorn-manager/controller.(*BackupController).syncHandler(0xc008757f40, {0xc008f60a80?, 0x0?})
        /go/src/github.com/longhorn/longhorn-manager/controller/backup_controller.go:179 +0x114
github.com/longhorn/longhorn-manager/controller.(*BackupController).processNextWorkItem(0xc008757f40)
        /go/src/github.com/longhorn/longhorn-manager/controller/backup_controller.go:152 +0xd2
github.com/longhorn/longhorn-manager/controller.(*BackupController).worker(...)
        /go/src/github.com/longhorn/longhorn-manager/controller/backup_controller.go:142
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?)
        /go/src/github.com/longhorn/longhorn-manager/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x3e
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0x8b4226d8927af429?, {0x2bd26c0, 0xc0092b3a70}, 0x1, 0xc0008b2300)
        /go/src/github.com/longhorn/longhorn-manager/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0xb6
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x0?, 0x3b9aca00, 0x0, 0x58?, 0xc0015bcfd0?)
        /go/src/github.com/longhorn/longhorn-manager/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:204 +0x89
k8s.io/apimachinery/pkg/util/wait.Until(0xc0008b2300?, 0xc00667f080?, 0xc0071c6580?)
        /go/src/github.com/longhorn/longhorn-manager/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:161 +0x25
created by github.com/longhorn/longhorn-manager/controller.(*BackupController).Run
        /go/src/github.com/longhorn/longhorn-manager/controller/backup_controller.go:136 +0x1da
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
        panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1f247e2]

goroutine 2290 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc006ac45e0?})
        /go/src/github.com/longhorn/longhorn-manager/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:56 +0xd7
panic({0x241cb80, 0x4125b90})
        /usr/local/go/src/runtime/panic.go:884 +0x213
github.com/longhorn/longhorn-manager/controller.(*BackupController).checkMonitor(0xc008757f40, 0xc00a413900, 0xc0092a3b00, 0xc005cd9500)
        /go/src/github.com/longhorn/longhorn-manager/controller/backup_controller.go:639 +0x1e2
github.com/longhorn/longhorn-manager/controller.(*BackupController).reconcile(0xc008757f40, {0xc008f60a90, 0x17})
        /go/src/github.com/longhorn/longhorn-manager/controller/backup_controller.go:376 +0xa98
github.com/longhorn/longhorn-manager/controller.(*BackupController).syncHandler(0xc008757f40, {0xc008f60a80?, 0x0?})
        /go/src/github.com/longhorn/longhorn-manager/controller/backup_controller.go:179 +0x114
github.com/longhorn/longhorn-manager/controller.(*BackupController).processNextWorkItem(0xc008757f40)
        /go/src/github.com/longhorn/longhorn-manager/controller/backup_controller.go:152 +0xd2
github.com/longhorn/longhorn-manager/controller.(*BackupController).worker(...)
        /go/src/github.com/longhorn/longhorn-manager/controller/backup_controller.go:142
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?)
        /go/src/github.com/longhorn/longhorn-manager/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x3e
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0x8b4226d8927af429?, {0x2bd26c0, 0xc0092b3a70}, 0x1, 0xc0008b2300)
        /go/src/github.com/longhorn/longhorn-manager/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0xb6
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x0?, 0x3b9aca00, 0x0, 0x58?, 0xc0015bcfd0?)
        /go/src/github.com/longhorn/longhorn-manager/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:204 +0x89
k8s.io/apimachinery/pkg/util/wait.Until(0xc0008b2300?, 0xc00667f080?, 0xc0071c6580?)
        /go/src/github.com/longhorn/longhorn-manager/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:161 +0x25
created by github.com/longhorn/longhorn-manager/controller.(*BackupController).Run
        /go/src/github.com/longhorn/longhorn-manager/controller/backup_controller.go:136 +0x1d

I then dug around in the longhorn-manager repo and found this section in the referenced 1.5.0 tagged branch:

// get storage class of the pvc binding with the volume
    kubernetesStatus := &volume.Status.KubernetesStatus
    storageClassName := ""
    if kubernetesStatus.PVCName != "" && kubernetesStatus.LastPVCRefAt == "" {
        pvc, _ := bc.ds.GetPersistentVolumeClaim(kubernetesStatus.Namespace, kubernetesStatus.PVCName)
        if pvc != nil {
            storageClassName = *pvc.Spec.StorageClassName
        }
    }

I could see from that repo and the diff between 1.4.0 version that this appears to be a new logic/code. I checked the PVCs across all the namespaces and did a check for spec.storageClassName and noticed that one of the pvcs did not have a storage class at all, instead it had an annotation (I'm not familiar with volume annotations yet): volume.beta.kubernetes.io/storage-class: longhorn, full output of pvc below:

# Please edit the object below. Lines beginning with a '#' will be ignored,
# and an empty file will abort the edit. If an error occurs while saving this file will be
# reopened with the relevant failures.
#
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  annotations:
    meta.helm.sh/release-name: portainer
    meta.helm.sh/release-namespace: portainer
    pv.kubernetes.io/bind-completed: "yes"
    pv.kubernetes.io/bound-by-controller: "yes"
    volume.beta.kubernetes.io/storage-class: longhorn
    volume.beta.kubernetes.io/storage-provisioner: driver.longhorn.io
    volume.kubernetes.io/storage-provisioner: driver.longhorn.io
  creationTimestamp: "2023-05-09T01:18:13Z"
  finalizers:
  - kubernetes.io/pvc-protection
  labels:
    app.kubernetes.io/instance: portainer
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: portainer
    app.kubernetes.io/version: ce-latest-ee-2.18.4
    helm.sh/chart: portainer-1.0.44
    io.portainer.kubernetes.application.stack: portainer
    recurring-job-group.longhorn.io/default: enabled
  name: portainer
  namespace: portainer
  resourceVersion: "65402379"
  uid: 783fffdd-94cf-4408-b56c-0c6d727b22a1
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi
  volumeMode: Filesystem
  volumeName: pvc-783fffdd-94cf-4408-b56c-0c6d727b22a1
status:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 10Gi
  phase: Bound

As you can see that the spec.storageClass is missing as the upstream helm chart for this pvc since it uses the annotation method instead: https://github.com/portainer/k8s/blob/master/charts/portainer/templates/pvc.yaml

---
kind: "PersistentVolumeClaim"
apiVersion: "v1"
metadata:
  name: {{ template "portainer.fullname" . }}
  namespace: {{ .Release.Namespace }}
  annotations:
  {{- if .Values.persistence.storageClass }}
    volume.beta.kubernetes.io/storage-class: {{ .Values.persistence.storageClass | quote }}
  {{- else }}
    volume.alpha.kubernetes.io/storage-class: "generic"
  {{- end }}
  ....
  ....

According to K8 docs, excerpt below, the upstream chart (portainer) should adjust their template to use storageClass instead of the old annotation method that is deprecated but still working.

In the past, the annotation volume.beta.kubernetes.io/storage-class was used instead of the storageClassName attribute. 

This annotation is still working; however, it will become fully deprecated in a future Kubernetes release.

I was able to resolve this issue for now by manually editing pvc and adding a storageClass in spec section, this fixes the NPE and managers came back up fine. These same pvc/pods worked fine in 1.4.0. I am going to open a PR for portainer to use the storageClass spec field instead of annotation since it's deprecated anyway but this does appear to be a regression.

Questions:

  1. Should longhorn continue to support deprecated annoation style storageClass specifier? Or should longhorn put a disclaimer on 1.5.0 saying that this is a known issue and users should be fixing their deprecated annotations? From my thoughts, this is a regression since I did not see it mentioned (maybe I missed it though).
  2. _This one is a bit tough to answer, but thinking maybe some users have already went through the experience. Since rollbacks are not supported and I did the rollback to clean up orphan resources during upgrade, is there any concerns with stability or future upgrades due to a rollback being done? Specifically in this case 1.5.0 to 1.4.0 and upgrade back after to 1.5.0

To Reproduce

Steps to reproduce the behavior:

  1. Upgrade from 1.4.0 to 1.5.0 (unsure if rollback caused this issue but might have to do that)
  2. Have a PVC that uses annotation volume storageclass method instead of spec.storageClass
  3. See Longhorn manager(s) fail due to NPE as referenced in the code snippet from 1.5.0 tagged branch
  4. Adding a storageClass in spec section will resolve this issue

Expected behavior

Since the annotation volume storageclass specifier is deprecated, but still works within K8, longhorn should still support it or put a disclaimer/notice on 1.5.0 upgrade notes

Log or Support bundle

If applicable, add the Longhorn managers' log or support bundle when the issue happens. You can generate a Support Bundle using the link at the footer of the Longhorn UI.

Environment

Additional context

Add any other context about the problem here.

Workaround

Manually editing pvc and adding a storageClass in spec section, this fixes the NPE and managers came back up fine.

derekbit commented 1 year ago

Should longhorn continue to support deprecated annoation style storageClass specifier? Or should longhorn put a disclaimer on 1.5.0 saying that this is a known issue and users should be fixing their deprecated annotations? From my thoughts, this is a regression since I did not see it mentioned (maybe I missed it though).

https://kubernetes.io/docs/concepts/storage/dynamic-provisioning/#using-dynamic-provisioning

Yes, as you said, this is a regression (codes). I think we can fall back to the deprecated volume.beta.kubernetes.io/storage-class if spec.storageCalss is not set. WDYT? @innobead @ChanYiLin

_This one is a bit tough to answer, but thinking maybe some users have already went through the experience. Since rollbacks are not supported and I did the rollback to clean up orphan resources during upgrade, is there any concerns with stability or future upgrades due to a rollback being done? Specifically in this case 1.5.0 to 1.4.0 and upgrade back after to 1.5.0

The upgrade path is mainly updating the resources spec/status fields and the pods (codes). If you're worried about it, you can send us a support bundle and I can help check the resources' values.

derekbit commented 1 year ago

@pchang388 BTW, should it be **Longhorn Manager** Pods CrashLoop After Upgrade From 1.4.0?

innobead commented 1 year ago

@ChanYiLin please help with this. Need to get this to 1.5.1.

pchang388 commented 1 year ago

@pchang388 BTW, should it be **Longhorn Manager** Pods CrashLoop After Upgrade From 1.4.0?

Yes, sorry for the confusion - I fixed the details wording to reflect the correction as well. Thanks!

derekbit commented 1 year ago

@pchang388 BTW, should it be **Longhorn Manager** Pods CrashLoop After Upgrade From 1.4.0?

Yes, sorry for the confusion - I fixed the details wording to reflect the correction as well. Thanks!

No worries. Thanks for raising the issue.

longhorn-io-github-bot commented 1 year ago

Pre Ready-For-Testing Checklist

pchang388 commented 1 year ago

Should longhorn continue to support deprecated annoation style storageClass specifier? Or should longhorn put a disclaimer on 1.5.0 saying that this is a known issue and users should be fixing their deprecated annotations? From my thoughts, this is a regression since I did not see it mentioned (maybe I missed it though).

https://kubernetes.io/docs/concepts/storage/dynamic-provisioning/#using-dynamic-provisioning

Yes, as you said, this is a regression (codes). I think we can fall back to the deprecated volume.beta.kubernetes.io/storage-class if spec.storageCalss is not set. WDYT? @innobead @ChanYiLin

_This one is a bit tough to answer, but thinking maybe some users have already went through the experience. Since rollbacks are not supported and I did the rollback to clean up orphan resources during upgrade, is there any concerns with stability or future upgrades due to a rollback being done? Specifically in this case 1.5.0 to 1.4.0 and upgrade back after to 1.5.0

The upgrade path is mainly updating the resources spec/status fields and the pods (codes). If you're worried about it, you can send us a support bundle and I can help check the resources' values.

hey @derekbit, I noticed an issue following the upgrade. As mentioned it might be related to the rollback I did to fix the orphan issue.

I am no longer able to create new volumes. Existing volumes attached/run normally after the upgrade but when I tried to do a ct install for unrelated project, the test pvc was stuck in pending phase and volume was created eventually after a ~3-5 minutes. But it never attached and when the ct install timed out, volume is not being cleaned up even if the pod and pvc is deleted. Just noting - during the time the original issue was taking place and while I encountered this new issue, no backup job was taking place or at least shouldn't have been according to the reoccuring job cron.

Support bundle attached, but seeing alot of errors in general, would definitely appreciate if you could take a look. Currently, I am unsure where to start/look.

supportbundle_cc4fef94-13ea-4569-af93-87480124d212_2023-07-10T05-41-18Z.zip

Please let me know if you prefer to move this conversation to a new issue and I will delete this comment and move the below info there.

Some relevant logs/events:

events

### kubectl events for test pvc ###
5m37s       Normal    SuccessfulCreate        replicaset/test-nginx-pvc-dfb4986fb    Created pod: test-nginx-pvc-dfb4986fb-4xx8g
5m36s       Warning   FailedScheduling        pod/test-nginx-pvc-dfb4986fb-4xx8g     0/8 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/8 nodes are available: 8 No preemption victims found for incoming pod..
4m17s       Normal    ExternalProvisioning    persistentvolumeclaim/test-nginx-pvc   waiting for a volume to be created, either by external provisioner "driver.longhorn.io" or manually created by system administrator
4m6s        Warning   ProvisioningFailed      persistentvolumeclaim/test-nginx-pvc   failed to provision volume with StorageClass "longhorn": rpc error: code = DeadlineExceeded desc = failed to wait for volume creation to complete
4m5s        Normal    Provisioning            persistentvolumeclaim/test-nginx-pvc   External provisioner is provisioning volume for claim "default/test-nginx-pvc"
4m5s        Normal    ProvisioningSucceeded   persistentvolumeclaim/test-nginx-pvc   Successfully provisioned volume pvc-11a77edd-6744-4b15-8655-061f5bcef15b
4m5s        Warning   FailedScheduling        pod/test-nginx-pvc-dfb4986fb-4xx8g     0/8 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/8 nodes are available: 8 No preemption victims found for incoming pod..
4m2s        Normal    Scheduled               pod/test-nginx-pvc-dfb4986fb-4xx8g     Successfully assigned default/test-nginx-pvc-dfb4986fb-4xx8g to k3s-worker-0
5m46s       Warning   FailedScheduling        pod/test-nginx-pvc-dfb4986fb-csrdm     0/8 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/8 nodes are available: 8 No preemption victims found for incoming pod..
2m          Warning   FailedMount             pod/test-nginx-pvc-dfb4986fb-4xx8g     Unable to attach or mount volumes: unmounted volumes=[vol1], unattached volumes=[vol1], failed to process volumes=[]: timed out waiting for the condition
110s        Warning   FailedAttachVolume      pod/test-nginx-pvc-dfb4986fb-4xx8g     AttachVolume.Attach failed for volume "pvc-11a77edd-6744-4b15-8655-061f5bcef15b" : rpc error: code = Aborted desc = volume pvc-11a77edd-6744-4b15-8655-061f5bcef15b is not ready for workloads

volumes

$ k get volumes.longhorn.io -n longhorn-system
...
...
pvc-c474295d-581c-4e99-8c7b-a848ca010e28   attached   healthy                  53687091200   k3s-worker-3   152d
pvc-ada7f3d9-78d5-40cf-8962-b00e22072007   attached   healthy                  10737418240   k3s-worker-0   84d
pvc-11a77edd-6744-4b15-8655-061f5bcef15b                                       5368709120                   21m
test                                                                           2147483648                   8m51s

provisioner logs

$ k logs -f -n longhorn-system csi-provisioner-65cb5cc4ff-7jnqk | grep -i error
E0710 03:27:38.089578       1 controller.go:1007] error syncing volume "pvc-5360f54f-34df-4976-9ee7-f0777dd499b4": persistentvolume pvc-5360f54f-34df-4976-9ee7-f0777dd499b4 is still attached to node k3s-worker-2
E0710 03:27:39.090764       1 controller.go:1007] error syncing volume "pvc-5360f54f-34df-4976-9ee7-f0777dd499b4": persistentvolume pvc-5360f54f-34df-4976-9ee7-f0777dd499b4 is still attached to node k3s-worker-2
E0710 03:27:41.091643       1 controller.go:1007] error syncing volume "pvc-5360f54f-34df-4976-9ee7-f0777dd499b4": persistentvolume pvc-5360f54f-34df-4976-9ee7-f0777dd499b4 is still attached to node k3s-worker-2
E0710 03:27:45.092498       1 controller.go:1007] error syncing volume "pvc-5360f54f-34df-4976-9ee7-f0777dd499b4": persistentvolume pvc-5360f54f-34df-4976-9ee7-f0777dd499b4 is still attached to node k3s-worker-2
E0710 03:27:53.092627       1 controller.go:1007] error syncing volume "pvc-5360f54f-34df-4976-9ee7-f0777dd499b4": persistentvolume pvc-5360f54f-34df-4976-9ee7-f0777dd499b4 is still attached to node k3s-worker-2
E0710 03:28:09.094148       1 controller.go:1007] error syncing volume "pvc-5360f54f-34df-4976-9ee7-f0777dd499b4": persistentvolume pvc-5360f54f-34df-4976-9ee7-f0777dd499b4 is still attached to node k3s-worker-2
E0710 03:28:41.093804       1 controller.go:1007] error syncing volume "pvc-5360f54f-34df-4976-9ee7-f0777dd499b4": persistentvolume pvc-5360f54f-34df-4976-9ee7-f0777dd499b4 is still attached to node k3s-worker-2
E0710 03:29:45.094373       1 controller.go:1007] error syncing volume "pvc-5360f54f-34df-4976-9ee7-f0777dd499b4": persistentvolume pvc-5360f54f-34df-4976-9ee7-f0777dd499b4 is still attached to node k3s-worker-2
I0710 03:33:10.438872       1 controller.go:1082] Temporary error received, adding PVC 8891335d-2c57-4dd8-8738-e3de09dc3a89 to claims in progress
E0710 03:33:10.438912       1 controller.go:957] error syncing claim "8891335d-2c57-4dd8-8738-e3de09dc3a89": failed to provision volume with StorageClass "longhorn": rpc error: code = DeadlineExceeded desc = failed to wait for volume creation to complete
I0710 03:33:10.438961       1 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"portainer-jhzkjen7ro", Name:"portainer-jhzkjen7ro", UID:"8891335d-2c57-4dd8-8738-e3de09dc3a89", APIVersion:"v1", ResourceVersion:"65540989", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "longhorn": rpc error: code = DeadlineExceeded desc = failed to wait for volume creation to complete

manager logs

$ k logs -f -n longhorn-system longhorn-manager-qwqht | grep -i error
time="2023-07-10T05:27:27Z" level=error msg="Failed to sync Longhorn setting longhorn-system/storage-network" controller=longhorn-setting error="failed to sync setting for longhorn-system/storage-network: failed to apply storage-network setting to Longhorn workloads when there are attached volumes" node=k3s-worker-2
time="2023-07-10T05:27:27Z" level=error msg="Failed to sync Longhorn setting longhorn-system/v2-data-engine" controller=longhorn-setting error="failed to sync setting for longhorn-system/v2-data-engine: cannot apply v2-data-engine setting to Longhorn workloads when there are attached volumes" node=k3s-worker-2
time="2023-07-10T05:27:27Z" level=error msg="Failed to sync Longhorn setting longhorn-system/storage-network" controller=longhorn-setting error="failed to sync setting for longhorn-system/storage-network: failed to apply storage-network setting to Longhorn workloads when there are attached volumes" node=k3s-worker-2
time="2023-07-10T05:27:27Z" level=error msg="Failed to sync Longhorn setting longhorn-system/v2-data-engine" controller=longhorn-setting error="failed to sync setting for longhorn-system/v2-data-engine: cannot apply v2-data-engine setting to Longhorn workloads when there are attached volumes" node=k3s-worker-2
time="2023-07-10T05:27:27Z" level=error msg="Failed to sync Longhorn setting longhorn-system/storage-network" controller=longhorn-setting error="failed to sync setting for longhorn-system/storage-network: failed to apply storage-network setting to Longhorn workloads when there are attached volumes" node=k3s-worker-2
time="2023-07-10T05:27:27Z" level=error msg="Failed to sync Longhorn setting longhorn-system/v2-data-engine" controller=longhorn-setting error="failed to sync setting for longhorn-system/v2-data-engine: cannot apply v2-data-engine setting to Longhorn workloads when there are attached volumes" node=k3s-worker-2
time="2023-07-10T05:27:27Z" level=error msg="Dropping Longhorn setting longhorn-system/storage-network out of the queue" controller=longhorn-setting error="failed to sync setting for longhorn-system/storage-network: failed to apply storage-network setting to Longhorn workloads when there are attached volumes" node=k3s-worker-2
time="2023-07-10T05:27:27Z" level=error msg="Dropping Longhorn setting longhorn-system/v2-data-engine out of the queue" controller=longhorn-setting error="failed to sync setting for longhorn-system/v2-data-engine: cannot apply v2-data-engine setting to Longhorn workloads when there are attached volumes" node=k3s-worker-2
time="2023-07-10T05:52:29Z" level=warning msg="HTTP handling error" error="websocket: close sent"
time="2023-07-10T05:52:29Z" level=error msg="Error in request: websocket: close sent"
time="2023-07-10T05:52:29Z" level=error msg="Failed to write err: websocket: close sent"
derekbit commented 1 year ago

I see the related logs

2023-07-10T05:34:14.679046451Z time="2023-07-10T05:34:14Z" level=error msg="Failed to sync Longhorn volume longhorn-system/pvc-11a77edd-6744-4b15-8655-061f5bcef15b" controller=longhorn-volume error="failed to sync longhorn-system/pvc-11a77edd-6744-4b15-8655-061f5bcef15b: create not allowed while custom resource definition is terminating" node=k3s-worker-0

and the stuck customized

- apiVersion: apiextensions.k8s.io/v1
  kind: CustomResourceDefinition
  metadata:
    annotations:
      controller-gen.kubebuilder.io/version: v0.7.0
      meta.helm.sh/release-name: longhorn
      meta.helm.sh/release-namespace: longhorn-system
    creationTimestamp: "2023-07-08T09:37:04Z"
    deletionTimestamp: "2023-07-08T09:56:08Z"
    finalizers:
    - customresourcecleanup.apiextensions.k8s.io
    generation: 1
    labels:
      app.kubernetes.io/instance: longhorn
      app.kubernetes.io/managed-by: Helm
      app.kubernetes.io/name: longhorn
      app.kubernetes.io/version: v1.5.0
      helm.sh/chart: longhorn-1.5.0
      longhorn-manager: "null"
 ...
    name: volumeattachments.longhorn.io
    resourceVersion: "65594052"
    uid: 58280162-bb41-46d8-916d-c1d107a7abe6
  spec:
...
  status:
    acceptedNames:
      kind: VolumeAttachment
      listKind: VolumeAttachmentList
      plural: volumeattachments
      shortNames:
      - lhva
      singular: volumeattachment
    conditions:
    - lastTransitionTime: "2023-07-08T09:37:04Z"
      message: no conflicts found
      reason: NoConflicts
      status: "True"
      type: NamesAccepted
    - lastTransitionTime: "2023-07-08T09:37:04Z"
      message: the initial names have been accepted
      reason: InitialNamesAccepted
      status: "True"
      type: Established
    - lastTransitionTime: "2023-07-08T09:56:08Z"
      message: CustomResource deletion is in progress
      reason: InstanceDeletionInProgress
      status: "True"
      type: Terminating
    storedVersions:
    - v1beta2
kind: List
metadata:
  resourceVersion: "65594102"

It looks hit the upstream issue https://github.com/kubernetes/kubernetes/issues/60538. If you want to try the workaround mentioned in this thread, I would recommend backing up volumes before applying it. I will try if I can reproduce it on my side.

derekbit commented 1 year ago

@pchang388 I can reproduce the issue on my side. It's caused by the unexpected rollback.

Reproduce steps

  1. Longhorn v1.4.2 cluster with some orphan resources
  2. Upgrade to v1.5.0, and then stuck at ...immutable fields...
  3. Roll back to v1.5.0
  4. Delete the orphan resources
  5. Upgrade to v1.5.0 successfully
  6. The volumedetachment crd status is stuck at CustomResource deletion is in progress. Then, you cannot create any volume.

Solution steps

  1. kubectl -n longhorn-system patch crd volumeattachments.longhorn.io -p '{"metadata":{"finalizers":[]}}' --type=merge
  2. Check the volumedetachment crd is automatically and sucessfully deleted by kubectl -n longhorn-system get CustomResourceDefinition -o yaml
  3. Re-upgrade the cluster to v1.5.0. (It is already v1.5.0, so the helm upgrade will only update the CRDs and add the missing volumedetachment crd)
derekbit commented 1 year ago

@pchang388 Please let me know if it works. Thank you.

innobead commented 1 year ago

A side question, @mantissahz is downgrade prevention only working for the source version is equal to or later than 1.5.0, right?

mantissahz commented 1 year ago

A side question, @mantissahz is downgrade prevention only working for the source version is equal to or later than 1.5.0, right?

@innobead Yes, it is for the official longhorn-manager image.

pchang388 commented 1 year ago

hey @derekbit thank you very much for the suggestion. I was actually trying your suggestion and saw that you also came to the same conclusion

First run the patch, this is not recommended but the way since it can leave behind orphaned resources in k8 datastore, but was done to get back to normal operations for development purposes

Patch is to address deadlock with finalizers for crds as you mentioned

kubectl patch crd/volumeattachments.longhorn.io -p '{"metadata":{"finalizers":[]}}' --type=merge

I then noticed the volumeattachments.longhorn.io crd was missing and reinstalled it by doing a helm template out and taking the volumeattachments section but your method is much better.

## spot error - notice its gone
k get volumeattachments.longhorn.io  -n longhorn-system
error: the server doesn't have a resource type "volumeattachments"

## template output
helm template longhorn longhorn/longhorn -f helm/custom-values.yaml --version 1.5.0 -n longhorn-system > out.yaml

## create again
k apply -f volumeattachment.yaml

I tested it again by doing the patch and doing a helm upgrade instead, that worked as well

$ make upgrade 
helm upgrade longhorn longhorn/longhorn -f helm/custom-values.yaml --version 1.5.0 -n longhorn-system
Release "longhorn" has been upgraded. Happy Helming!
NAME: longhorn
LAST DEPLOYED: Mon Jul 10 03:30:09 2023
NAMESPACE: longhorn-system
STATUS: deployed
REVISION: 18
TEST SUITE: None
NOTES:
Longhorn is now installed on the cluster!

Please wait a few minutes for other Longhorn components such as CSI deployments, Engine Images, and Instance Managers to be initialized.

Visit our documentation at https://longhorn.io/docs/

But for 1.5.0 just remember to delete the old deployments again after helm upgrade:

kubectl delete deployments.apps longhorn-admission-webhook longhorn-conversion-webhook longhorn-recovery-backend -n longhorn-system

So far looks like things are working again, thank you again for your help and quick responses. I hope for no more issues due to the downgrade and definitely won't downgrade again unless there's no other way.

roger-ryao commented 1 year ago

Verified on master-head 20230712

The test steps

  1. Set up a Longhorn v1.4.2 cluster with some orphan resources.
  2. Create a Pod with PVC with storageClassName in annotation instead of spec Create a Pod with a PVC that has the storageClassName specified in the annotation instead of the spec.
    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
    name: longhorn-volv-pvc
    annotations:
    volume.beta.kubernetes.io/storage-class: longhorn
    spec:
    accessModes:
    - ReadWriteOnce
    # storageClassName: longhorn
    resources:
    requests:
      storage: 1Gi
    ---
    apiVersion: v1
    kind: Pod
    metadata:
    name: volume-test
    namespace: default
    spec:
    restartPolicy: Always
    containers:
    - name: volume-test
    image: nginx
    imagePullPolicy: IfNotPresent
    livenessProbe:
      exec:
        command:
          - ls
          - /data/lost+found
      initialDelaySeconds: 5
      periodSeconds: 5
    volumeMounts:
    - name: volv
      mountPath: /data
    ports:
    - containerPort: 80
    volumes:
    - name: volv
    persistentVolumeClaim:
      claimName: longhorn-volv-pvc
  3. Write some data into the Volume
  4. Backup the Volume
  5. describe the BackupVolume
    kubectl describe lhbv -n longhorn-system
  6. Upgrade to master-head
  7. describe the BackupVolume after upgrade Verify that the storageClassName is present in the BackupVolume CR status.

Result Passed

  1. The storageClassName was present in the BackupVolume CR status after upgraded. Screenshot_20230712_134053