longhorn / longhorn

Cloud-Native distributed storage built on and for Kubernetes
https://longhorn.io
Apache License 2.0
5.96k stars 587 forks source link

[BUG] GET error for volume attachment on node reboot #4188

Closed diamonwiggins closed 8 months ago

diamonwiggins commented 2 years ago

Describe the bug

After a reboot of a node in a 4 node cluster a user is seeing the following:

Warning  FailedMount  48s (x3 over 4m52s)   kubelet            MountVolume.WaitForAttach failed for volume "pvc-7d2e2124-4b0c-4d79-890a-fcee02a185a1" : volume pvc-7d2e2124-4b0c-4d79-890a-fcee02a185a1 has GET error for volume attachment csi-b21170ee9729a55ec3e64e6bd4ed0a11ac70ac2272e0e3b7bb3f6fdeac262172: volumeattachments.storage.k8s.io "csi-b21170ee9729a55ec3e64e6bd4ed0a11ac70ac2272e0e3b7bb3f6fdeac262172" not found

To recover, the user had to create the volumeattachment object manually for the Pod to mount its storage again

To Reproduce

I have not been able to reproduce this yet unfortunately

Expected behavior

A pod can successfully mount its storage despite a node reboot in the cluster

Log or Support bundle

longhorn-support-bundle_a8118729-480f-4d38-9b91-26a755d2e0cc_2022-06-28T20-34-47Z.zip

Environment

mantissahz commented 2 years ago

Hi @diamonwiggins, ref:issues-2629 Did it 'FailedMount' happen for a long time before you created the volumeattachment object?

diamonwiggins commented 2 years ago

@mantissahz I can get clarification from the end user if the amount of time is relevant. At the very least, 5 minutes had passed, but it's likely that much more time had passed before the user was assisted with manually creating the volume attachment.

Also worth noting, this customer is on 1.1.2 where #2629 is supposedly fixed. Happy to provide any other information that could help track this down.

PhanLe1010 commented 2 years ago

It could take up to 6 or 7 minutes for Kubernetes to retry creating the volumeattachment object.

How long did the node go down?

diamonwiggins commented 2 years ago

@PhanLe1010 The node went down for only minutes. Maybe 5 minutes or so. However it was a full 24 hours before the user manually created the VolumeAttachment objects.

diamonwiggins commented 2 years ago

Is there any additional information I can gather to assist here?

PhanLe1010 commented 2 years ago

@diamonwiggins I don't seem to figure out why the VA is deleted and never been created automatically as you mentioned earlier that the VA removal feature was removed since Longhorn 1.1.2.

I would suggest upgrading to a newer stable Longhorn version (1.1.3 or 1.2.5) and report back when you hit the issue again

diamonwiggins commented 1 year ago

@PhanLe1010 Understood. We've seen a similar issue with another customer after a reboot. The error is slightly different this time with:

MountVolume.WaitForAttach failed for volume "pvc-e25ec426-043d-496d-9ddd-e4920e8c1096" : volume pvc-e25ec426-043d-496d-9ddd-e4920e8c1096 has GET error for volume attachment csi-845807a0d4e3617baaadf26f975d24db606458cb640455aaac527298e9a2c4bd: volumeattachments.storage.k8s.io "csi-845807a0d4e3617baaadf26f975d24db606458cb640455aaac527298e9a2c4bd" is forbidden: User "system:node:ip-10-0-1-200" cannot get resource "volumeattachments" in API group "storage.k8s.io" at the cluster scope: no relationship found between node 'ip-10-0-1-200' and this object

We've confirmed that node names and IP address had not changed, and our customer was able to reproduce this on two separate environments.

longhorn-support-bundle_b27f4748-ec4a-45a1-8d75-04fe278d3584_2022-09-07T18-57-23Z (1).zip

If this warrants a separate github issue let me know and i'll get one opened up.

Environment

innobead commented 1 year ago

cc @joshimoo

rajivml commented 1 year ago

we see this issue so often with longhorn and today also we had a repro where on node restart in a multi-node environment alertmanager statefulset related pods were not able to mount PVCs even after 30-40 minutes and we see this issue with both deployments and statefulsets

This is happening with longhorn 1.3.1 also and this repro is with longhorn 1.3.1 itself

When ever this happens we scaledown the workload replicas to 0 and scale it back such that volume attachment flow gets triggered again but this is not an acceptable solution while running production workloads

       {
            "apiVersion": "v1",
            "count": 60,
            "eventTime": null,
            "firstTimestamp": "2022-10-12T06:05:07Z",
            "involvedObject": {
                "apiVersion": "v1",
                "kind": "Pod",
                "name": "alertmanager-rancher-monitoring-alertmanager-1",
                "namespace": "cattle-monitoring-system",
                "resourceVersion": "83537",
                "uid": "2088fca6-b6cb-458f-8297-44fa477b0e81"
            },
            "kind": "Event",
            "lastTimestamp": "2022-10-12T07:50:59Z",
            "message": "MountVolume.WaitForAttach failed for volume \"pvc-84933541-a66d-4ca2-a710-6db17e6643ba\" : volume pvc-84933541-a66d-4ca2-a710-6db17e6643ba has GET error for volume attachment csi-0c400de43ff27c65fa12afab1248675317dbb2b8fc07ae6582df5ce218fa6ff7: volumeattachments.storage.k8s.io \"csi-0c400de43ff27c65fa12afab1248675317dbb2b8fc07ae6582df5ce218fa6ff7\" is forbidden: User \"system:node:server1\" cannot get resource \"volumeattachments\" in API group \"storage.k8s.io\" at the cluster scope: no relationship found between node 'server1' and this object",
            "metadata": {
                "creationTimestamp": "2022-10-12T06:05:07Z",
                "name": "alertmanager-rancher-monitoring-alertmanager-1.171d3d2e89354c2e",
                "namespace": "cattle-monitoring-system",
                "resourceVersion": "167371",
                "uid": "1874d80b-43e9-4242-9df3-bc39b68c0cc1"
            },
            "reason": "FailedMount",
            "reportingComponent": "",
            "reportingInstance": "",
            "source": {
                "component": "kubelet",
                "host": "server1"
            },
            "type": "Warning"
        },
PhanLe1010 commented 1 year ago

@rajivml

Could you help us troubleshoot by providing the reproducing steps and env information (or provide us an env)?

Environment

rajivml commented 1 year ago

HI @PhanLe1010

We are seeing it on both single node and multi-node environments

I will share an environment for your offline analysis via DM over slack

Longhorn version: 1.3.1 Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Helm Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: RKE2 Number of management node in the cluster: 3 nodes which acts as both master + worker Number of worker node in the cluster: 3 nodes which acts as both master + worker Node config: 32 Core 128GB RAM OS type and version: RHEL CPU per node: 32 Memory per node: 128GB RAM Disk type(e.g. SSD/NVMe): SSD Network bandwidth between the nodes: Azure Provided Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Azure Disks Number of Longhorn volumes in the cluster: Around 20

JoshuaWatt commented 1 year ago

I saw this today also. It was after I upgraded k3s from 1.23.4 -> 1.23.13, but that may be coincidence.

Specifically, I saw the User "USER" cannot get resource "volumeattachments" in API group "storage.k8s.io" at the cluster scope error

PhanLe1010 commented 1 year ago

User "USER" cannot get resource "volumeattachments" in API group "storage.k8s.io" at the cluster scope

This error is not related to this issue. It indicate that the client is missing the RBAC permission. Where did you see that error (from which pods)?

hedefalk commented 1 year ago

I have the same issue. I have a two node RPI cluster. Any time master reboots, I get something like:

  Warning  FailedMount  37s   kubelet            MountVolume.WaitForAttach failed for volume "ghost-db" : volume ghost-db has GET error for volume attachment csi-be2cb4dfc03d99eef9aa0e05cb28e59ac52f0c0c5e832c68d142a2ba76827bdb: volumeattachments.storage.k8s.io "csi-be2cb4dfc03d99eef9aa0e05cb28e59ac52f0c0c5e832c68d142a2ba76827bdb" is forbidden: User "system:node:pi4" cannot get resource "volumeattachments" in API group "storage.k8s.io" at the cluster scope: no relationship found between node 'pi4' and this object
simonreddy2001 commented 1 year ago

Hi I have same issue

MountVolume.WaitForAttach failed for volume "pvc-xx" : volume vol-xx has GET error for volume attachment csi-xx: volumeattachments.storage.k8s.io "csi-xx" is forbidden: User "system:node:ip-xx.compute.internal" cannot get resource "volumeattachments" in API group "storage.k8s.io" at the cluster scope: no relationship found between node 'ip-xx.compute.internal' and this object

But we scaledown the statefulset replicas to 0 and scale it back such that volume attachment flow gets triggered again

Orhideous commented 1 year ago

Also run into this issue. Can confirm that workaround suggested by @simonreddy2001 works.

innobead commented 11 months ago

We need to have a resilience way to recover from this automatically.

cc @derekbit @shuo-wu @PhanLe1010

derekbit commented 11 months ago

@diamonwiggins @hedefalk @simonreddy2001 @Orhideous @rajivml I tried to reproduce the issue using Longhorn v1.3.2 and a StatefulSet with 2 replicas on a 2-node cluster. Reboot the two nodes repeatedly, but still cannot reproduce the issue.

Could you please provide the reproducing steps? If you run into the issue again, could you provide a support bundle as well? Thanks.

derekbit commented 11 months ago

Ref: https://github.com/kubernetes/kubernetes/issues/120571

PhanLe1010 commented 8 months ago

Theoretically, this issue could be very well related to the upstream issue https://github.com/kubernetes/kubernetes/issues/120571.

However, attempting to reproduce using similar steps as in the upstream issue yields no success. The attempted reproducing steps are:

  1. Install Kubernetes v1.25.15+rke2r2/v1.27.5+rke2r1
  2. Install Longhorn v1.5.3 using this longhorn-manager image phanle1010/longhorn-manager:v1.5.3-injected-detach-error. Thi longhorn-manager image added the logic to artificially inject the detach error into longhorn-csi-plugin to simulate a temporary detach error. This is the code https://github.com/PhanLe1010/longhorn-manager/commit/a29962e468806eb8209ad45b53ec4be204d4266d
  3. Deploy this deployment into the cluster
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      annotations:
        deployment.kubernetes.io/revision: '1'
      generation: 7
      labels:
        workload.user.cattle.io/workloadselector: apps.deployment-default-test-dep
      name: test-dep
      namespace: default
    spec:
      progressDeadlineSeconds: 600
      replicas: 1
      revisionHistoryLimit: 10
      selector:
        matchLabels:
          workload.user.cattle.io/workloadselector: apps.deployment-default-test-dep
      strategy:
        rollingUpdate:
          maxSurge: 25%
          maxUnavailable: 25%
        type: RollingUpdate
      template:
        metadata:
          creationTimestamp: null
          labels:
            workload.user.cattle.io/workloadselector: apps.deployment-default-test-dep
          namespace: default
        spec:
          affinity: {}
          containers:
            - image: ubuntu:xenial
              imagePullPolicy: Always
              name: container-0
              resources: {}
              securityContext:
                allowPrivilegeEscalation: false
                privileged: false
                readOnlyRootFilesystem: false
                runAsNonRoot: false
              stdin: true
              terminationMessagePath: /dev/termination-log
              terminationMessagePolicy: File
              tty: true
              volumeMounts:
                - mountPath: /mnt
                  name: vol-7rasu
          dnsPolicy: ClusterFirst
          nodeName: phan-v603-pool2-e46dd713-75pnq
          restartPolicy: Always
          schedulerName: default-scheduler
          securityContext: {}
          terminationGracePeriodSeconds: 30
          volumes:
            - name: vol-7rasu
              persistentVolumeClaim:
                claimName: test-pvc
  4. Use this script to simulate the reproducing steps in the upstream issue

    #!/bin/bash
    
    set -o errexit
    set -o nounset
    set -o pipefail
    set -x
    
    # Inject detach error
    kubectl -n longhorn-system patch -p '{"value": "26"}' --type=merge lhs storage-minimal-available-percentage
    
    # Run application pod
    kubectl scale deployment test-dep --replicas 1
    kubectl wait --for=condition=available deployment/test-dep
    # Delete the app
    kubectl scale deployment test-dep --replicas 0
    kubectl wait --for=delete pod --selector=workload.user.cattle.io/workloadselector=apps.deployment-default-test-dep
    
    # Wait for detach error
    while true; do
        if kubectl get volumeattachment -o json | grep "Simulated detach error"; then
            break
        fi
        echo "Waiting for volumeAttachment to get error..."
        sleep 1
    done
    
    # Kill KCM 
    kubectl -n kube-system delete pod -l component=kube-controller-manager
    sleep 2
    
    # Start a new KCM
    kubectl -n kube-system wait --for condition=Ready=true pod --selector=component=kube-controller-manager
    
    # there is no way how to wait for KCM to process the volumeattachment...
    sleep 13
    
    # Create a new pod *after* KCM started processing volumeattachments
    kubectl scale deployment test-dep --replicas 1
    sleep 1
    kubectl wait --for condition=PodScheduled=true pod --selector=workload.user.cattle.io/workloadselector=apps.deployment-default-test-dep
    
    # Stop injecting errors to detach
    kubectl -n longhorn-system patch -p '{"value": "25"}' --type=merge lhs storage-minimal-available-percentage
    
    # Now, the second pod should start, but it's stuck at "no relationship found between node '127.0.0.1' and this object"
    
  5. Unfortunately, the end result is new pod always is able to come up so cannot reproduce the issue
PhanLe1010 commented 8 months ago

Next action

Even though we are not able to reproduce the upstream issue, from code analysis, I do think that the race condition in the upstream issue COULD be the root cause of this ticket. The upstream issue is fixed in:

Therefore, I think the next step for this ticket would be:

  1. Ask the user to try with these fixed Kubernetes versions to see if the issue still persists. (cc @diamonwiggins Could you try to upgrade Kubenetes to the fixed versions?)
  2. Close this ticket
  3. If the user still hits the issue after upgrading Kubenetes to the fixed versions, we can reopen the ticket

WDYT @derekbit @innobead @ejweber ?

Workaround:

Additionally, from code analysis, I think the workaround may be to scale down the workload, wait for the workload to be fully terminated, then scale back the workload again. Kube-controller-manager should be able to recreate the VolumeAttachment for the new pod

Hanson-Tsai commented 4 months ago

Hi, I have face the similar issue in Kubernetes v1.29

adamcharnock commented 3 months ago

I'm seeing the same behaviour when using Mayastor. In my case I drained the node of Mayastor volumes, restarted the Mayastor pod (openebs-io-engine-xxx), the uncordoned the node to Mayastor volumes. I then noted that some of the Stateful set volumes were stuck in 'ContainerCreating', reporting an io error with no further details.

The OpenEBS CSI controller was reporting:

I0511 17:34:46.274399       1 csi_handler.go:234] Error processing "csi31ad7af564f89fe04d71d5cc0e2240ee1f5b73d9da88f3e933c1b26d9f501219": failed to  detach: could not mark as detached: volumeattachments.storage.k8s.io"csi31ad7af564f89fe04d71d5cc0e2240ee1f5b73d9da88f3e933c1b26d9f501219" not found

Scaling down the StatefulSet to 0, then scaling back up resolved the issue.

I know Mayastore is an entirely different project, but I thought this would be helpful information for 1) anyone also googling their way here, and 2) adding information towards the "is this an upstream issue" question?