[BUG] Some users report that sometime, volume cannot be attach after upgrade Longhorn from v1.4.x to v1.5.x due to unexpected `longhorn-ui` attachment ticket

PhanLe1010 commented 4 months ago

Describe the bug

Some users report that sometime, volume cannot be attach after upgrade Longhorn from v1.4.x to v1.5.x due to unexpected longhorn-ui attachment ticket. For example, this ticket https://github.com/longhorn/longhorn/discussions/6281 which has the longhorn-ui ticket in one of the Longhorn volumeattachment object:

      longhorn-ui:
        generation: 0
        id: longhorn-ui
        nodeID: k3s-agent-acf067
        parameters:
          disableFrontend: "false"
        type: longhorn-api

Since the user confirmed that there is no manual attach action during this time, we need to investigate to see if there is a bug in the upgrade path from v1.4.x to v1.5.x

To Reproduce

We don't have reproducing steps

Expected behavior

If there is no manual attach action during this time, there should be no longhorn-ui AD ticket

Support bundle for troubleshooting

See ref in the ticket https://github.com/longhorn/longhorn/discussions/6281

Environment

Longhorn version: upgrade from v1.4.x to v1.5.x

ChanYiLin commented 2 months ago

Hi @PhanLe1010 I have done some research

It seems all the users having this issue had upgraded from v1.4.x(has no volumeattachment) to v1.5.x(has volumeattachment)
They all found the issue when the workload node was cordon and drained and the workload was about to start on the different node
The workload pod failed to start on different node because the there was an ui ticket assigned the volume to the old node.

I am thinking in our upgrade path when upgradeVolumeAttachments() (ref)

       ...
    kubeVAList, err := kubeClient.StorageV1().VolumeAttachments().List(context.TODO(), metav1.ListOptions{})
    if err != nil {
        return errors.Wrapf(err, "failed to list all Kubernetes VolumeAttachments during the Longhorn VolumeAttachment upgrade")
    }
        ...
    for _, kubeVA := range kubeVAs {
        if !kubeVA.DeletionTimestamp.IsZero() {
            continue
        }
        ticketID := kubeVA.Name
        attachmentTickets[ticketID] = &longhorn.AttachmentTicket{
            ID:     ticketID,
            Type:   longhorn.AttacherTypeCSIAttacher,
            NodeID: kubeVA.Spec.NodeName,
            Parameters: map[string]string{
                longhorn.AttachmentParameterDisableFrontend: longhorn.FalseValue,
            },
        }
    }
        ... 
        if vol.Spec.NodeID != "" && len(attachmentTickets) == 0 {
        ticketID := "longhorn-ui"
        attachmentTickets[ticketID] = &longhorn.AttachmentTicket{
            ID:     ticketID,
            Type:   longhorn.AttacherTypeLonghornAPI,
            NodeID: vol.Spec.NodeID,
            Parameters: map[string]string{
                longhorn.AttachmentParameterDisableFrontend: strconv.FormatBool(vol.Spec.DisableFrontend),
            },
        }
    }

Maybe somehow the for the workload and volume on NodeA

workload pod's k8s va was not found (maybe the pod was temporarily gone?)
but the vol.Spec.NodeID still existed (maybe time delay)
so we assigned it with a longhorn-ui ticket to the node NodeA

And when the pod was recreated or back, it happened to be created on the NodeA again so the csi-attachment ticket was added and has no conflict. That is the only place I can find we add ui ticket to the attachment CR besides the manual attachment

When the NodeA was cordon and drained by the user or the k8s management system, it rescheduled the workload to another NodeB Then the longhorn-ui attachment issue appeared.

This is only my assumption, but I haven't found the way to reproduce it.

ChanYiLin commented 2 months ago

BTW, @PhanLe1010 May I confirmed that there is a workaround for this case? Users can simply delete the ui-attachment ticket so the volume can be successfully attached to another node with the workload pod, is it correct?

PhanLe1010 commented 2 months ago

BTW, @PhanLe1010 May I confirmed that there is a workaround for this case? Users can simply delete the ui-attachment ticket so the volume can be successfully attached to another node with the workload pod, is it correct?

Hi @ChanYiLin Yes, you are correct

PhanLe1010 commented 2 months ago

The theory you mentioned above https://github.com/longhorn/longhorn/issues/8339#issuecomment-2175516408 looks like might be possible to me. I will try to see if there are any potential race and try to reduce when I have sometime too

derekbit commented 2 months ago

There is a workaround for this issue. Can we move it to v1.8.0? @innobead

longhorn / longhorn