longhorn / longhorn

Cloud-Native distributed storage built on and for Kubernetes
https://longhorn.io
Apache License 2.0
5.96k stars 587 forks source link

[BUG] Some users report that sometime, volume cannot be attach after upgrade Longhorn from v1.4.x to v1.5.x due to unexpected `longhorn-ui` attachment ticket #8339

Open PhanLe1010 opened 4 months ago

PhanLe1010 commented 4 months ago

Describe the bug

Some users report that sometime, volume cannot be attach after upgrade Longhorn from v1.4.x to v1.5.x due to unexpected longhorn-ui attachment ticket. For example, this ticket https://github.com/longhorn/longhorn/discussions/6281 which has the longhorn-ui ticket in one of the Longhorn volumeattachment object:

      longhorn-ui:
        generation: 0
        id: longhorn-ui
        nodeID: k3s-agent-acf067
        parameters:
          disableFrontend: "false"
        type: longhorn-api

Since the user confirmed that there is no manual attach action during this time, we need to investigate to see if there is a bug in the upgrade path from v1.4.x to v1.5.x

To Reproduce

We don't have reproducing steps

Expected behavior

If there is no manual attach action during this time, there should be no longhorn-ui AD ticket

Support bundle for troubleshooting

See ref in the ticket https://github.com/longhorn/longhorn/discussions/6281

Environment

ChanYiLin commented 2 months ago

Hi @PhanLe1010 I have done some research

I am thinking in our upgrade path when upgradeVolumeAttachments() (ref)

       ...
    kubeVAList, err := kubeClient.StorageV1().VolumeAttachments().List(context.TODO(), metav1.ListOptions{})
    if err != nil {
        return errors.Wrapf(err, "failed to list all Kubernetes VolumeAttachments during the Longhorn VolumeAttachment upgrade")
    }
        ...
    for _, kubeVA := range kubeVAs {
        if !kubeVA.DeletionTimestamp.IsZero() {
            continue
        }
        ticketID := kubeVA.Name
        attachmentTickets[ticketID] = &longhorn.AttachmentTicket{
            ID:     ticketID,
            Type:   longhorn.AttacherTypeCSIAttacher,
            NodeID: kubeVA.Spec.NodeName,
            Parameters: map[string]string{
                longhorn.AttachmentParameterDisableFrontend: longhorn.FalseValue,
            },
        }
    }
        ... 
        if vol.Spec.NodeID != "" && len(attachmentTickets) == 0 {
        ticketID := "longhorn-ui"
        attachmentTickets[ticketID] = &longhorn.AttachmentTicket{
            ID:     ticketID,
            Type:   longhorn.AttacherTypeLonghornAPI,
            NodeID: vol.Spec.NodeID,
            Parameters: map[string]string{
                longhorn.AttachmentParameterDisableFrontend: strconv.FormatBool(vol.Spec.DisableFrontend),
            },
        }
    }

Maybe somehow the for the workload and volume on NodeA

And when the pod was recreated or back, it happened to be created on the NodeA again so the csi-attachment ticket was added and has no conflict. That is the only place I can find we add ui ticket to the attachment CR besides the manual attachment

When the NodeA was cordon and drained by the user or the k8s management system, it rescheduled the workload to another NodeB Then the longhorn-ui attachment issue appeared.

This is only my assumption, but I haven't found the way to reproduce it.

ChanYiLin commented 2 months ago

BTW, @PhanLe1010 May I confirmed that there is a workaround for this case? Users can simply delete the ui-attachment ticket so the volume can be successfully attached to another node with the workload pod, is it correct?

PhanLe1010 commented 2 months ago

BTW, @PhanLe1010 May I confirmed that there is a workaround for this case? Users can simply delete the ui-attachment ticket so the volume can be successfully attached to another node with the workload pod, is it correct?

Hi @ChanYiLin Yes, you are correct

PhanLe1010 commented 2 months ago

The theory you mentioned above https://github.com/longhorn/longhorn/issues/8339#issuecomment-2175516408 looks like might be possible to me. I will try to see if there are any potential race and try to reduce when I have sometime too

derekbit commented 2 months ago

There is a workaround for this issue. Can we move it to v1.8.0? @innobead