[BUG] GET error for volume attachment on node reboot

diamonwiggins commented 2 years ago

Describe the bug

After a reboot of a node in a 4 node cluster a user is seeing the following:

Warning  FailedMount  48s (x3 over 4m52s)   kubelet            MountVolume.WaitForAttach failed for volume "pvc-7d2e2124-4b0c-4d79-890a-fcee02a185a1" : volume pvc-7d2e2124-4b0c-4d79-890a-fcee02a185a1 has GET error for volume attachment csi-b21170ee9729a55ec3e64e6bd4ed0a11ac70ac2272e0e3b7bb3f6fdeac262172: volumeattachments.storage.k8s.io "csi-b21170ee9729a55ec3e64e6bd4ed0a11ac70ac2272e0e3b7bb3f6fdeac262172" not found

To recover, the user had to create the volumeattachment object manually for the Pod to mount its storage again

To Reproduce

I have not been able to reproduce this yet unfortunately

Expected behavior

A pod can successfully mount its storage despite a node reboot in the cluster

Log or Support bundle

longhorn-support-bundle_a8118729-480f-4d38-9b91-26a755d2e0cc_2022-06-28T20-34-47Z.zip

Environment

Longhorn version: 1.1.2
Installation method (e.g. Rancher Catalog App/Helm/Kubectl): kubectl(kURL addon - https://kurl.sh/docs/add-ons/longhorn)
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: https://kurl.sh/docs/install-with-kurl/
- Number of management node in the cluster: 3
- Number of worker node in the cluster: 1
Node config
- OS type and version: Red Hat Enterprise Linux Server 7.9 (Maipo)
- CPU per node: 8
- Memory per node: 64GB
- Disk type(e.g. SSD/NVMe): (Unsure, but can gather this info if needed)
- Network bandwidth between the nodes: (Unsure, but can gather this info if needed)
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): (Unsure, but can gather this info if needed)
Number of Longhorn volumes in the cluster: 5

mantissahz commented 2 years ago

Hi @diamonwiggins, ref:issues-2629 Did it 'FailedMount' happen for a long time before you created the volumeattachment object?

diamonwiggins commented 2 years ago

@mantissahz I can get clarification from the end user if the amount of time is relevant. At the very least, 5 minutes had passed, but it's likely that much more time had passed before the user was assisted with manually creating the volume attachment.

Also worth noting, this customer is on 1.1.2 where #2629 is supposedly fixed. Happy to provide any other information that could help track this down.

PhanLe1010 commented 2 years ago

It could take up to 6 or 7 minutes for Kubernetes to retry creating the volumeattachment object.

How long did the node go down?

diamonwiggins commented 2 years ago

@PhanLe1010 The node went down for only minutes. Maybe 5 minutes or so. However it was a full 24 hours before the user manually created the VolumeAttachment objects.

diamonwiggins commented 2 years ago

Is there any additional information I can gather to assist here?

PhanLe1010 commented 2 years ago

@diamonwiggins I don't seem to figure out why the VA is deleted and never been created automatically as you mentioned earlier that the VA removal feature was removed since Longhorn 1.1.2.

I would suggest upgrading to a newer stable Longhorn version (1.1.3 or 1.2.5) and report back when you hit the issue again

diamonwiggins commented 1 year ago

@PhanLe1010 Understood. We've seen a similar issue with another customer after a reboot. The error is slightly different this time with:

MountVolume.WaitForAttach failed for volume "pvc-e25ec426-043d-496d-9ddd-e4920e8c1096" : volume pvc-e25ec426-043d-496d-9ddd-e4920e8c1096 has GET error for volume attachment csi-845807a0d4e3617baaadf26f975d24db606458cb640455aaac527298e9a2c4bd: volumeattachments.storage.k8s.io "csi-845807a0d4e3617baaadf26f975d24db606458cb640455aaac527298e9a2c4bd" is forbidden: User "system:node:ip-10-0-1-200" cannot get resource "volumeattachments" in API group "storage.k8s.io" at the cluster scope: no relationship found between node 'ip-10-0-1-200' and this object

We've confirmed that node names and IP address had not changed, and our customer was able to reproduce this on two separate environments.

longhorn-support-bundle_b27f4748-ec4a-45a1-8d75-04fe278d3584_2022-09-07T18-57-23Z (1).zip

If this warrants a separate github issue let me know and i'll get one opened up.

Environment

Longhorn version: 1.3.1
Installation method (e.g. Rancher Catalog App/Helm/Kubectl): kubectl(kURL addon - https://kurl.sh/docs/add-ons/longhorn)
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: https://kurl.sh/docs/install-with-kurl/
- Number of management node in the cluster: 1
- Number of worker node in the cluster: n/a
Node config
- OS type and version: Ubuntu 18.04.6 LTS
- CPU per node: 16
- Memory per node: 64GB
- Disk type(e.g. SSD/NVMe): (Unsure, but can gather this info if needed)
- Network bandwidth between the nodes: (Unsure, but can gather this info if needed)
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): AWS
Number of Longhorn volumes in the cluster: 7

innobead commented 1 year ago

cc @joshimoo

rajivml commented 1 year ago

we see this issue so often with longhorn and today also we had a repro where on node restart in a multi-node environment alertmanager statefulset related pods were not able to mount PVCs even after 30-40 minutes and we see this issue with both deployments and statefulsets

This is happening with longhorn 1.3.1 also and this repro is with longhorn 1.3.1 itself

When ever this happens we scaledown the workload replicas to 0 and scale it back such that volume attachment flow gets triggered again but this is not an acceptable solution while running production workloads

       {
            "apiVersion": "v1",
            "count": 60,
            "eventTime": null,
            "firstTimestamp": "2022-10-12T06:05:07Z",
            "involvedObject": {
                "apiVersion": "v1",
                "kind": "Pod",
                "name": "alertmanager-rancher-monitoring-alertmanager-1",
                "namespace": "cattle-monitoring-system",
                "resourceVersion": "83537",
                "uid": "2088fca6-b6cb-458f-8297-44fa477b0e81"
            },
            "kind": "Event",
            "lastTimestamp": "2022-10-12T07:50:59Z",
            "message": "MountVolume.WaitForAttach failed for volume \"pvc-84933541-a66d-4ca2-a710-6db17e6643ba\" : volume pvc-84933541-a66d-4ca2-a710-6db17e6643ba has GET error for volume attachment csi-0c400de43ff27c65fa12afab1248675317dbb2b8fc07ae6582df5ce218fa6ff7: volumeattachments.storage.k8s.io \"csi-0c400de43ff27c65fa12afab1248675317dbb2b8fc07ae6582df5ce218fa6ff7\" is forbidden: User \"system:node:server1\" cannot get resource \"volumeattachments\" in API group \"storage.k8s.io\" at the cluster scope: no relationship found between node 'server1' and this object",
            "metadata": {
                "creationTimestamp": "2022-10-12T06:05:07Z",
                "name": "alertmanager-rancher-monitoring-alertmanager-1.171d3d2e89354c2e",
                "namespace": "cattle-monitoring-system",
                "resourceVersion": "167371",
                "uid": "1874d80b-43e9-4242-9df3-bc39b68c0cc1"
            },
            "reason": "FailedMount",
            "reportingComponent": "",
            "reportingInstance": "",
            "source": {
                "component": "kubelet",
                "host": "server1"
            },
            "type": "Warning"
        },

PhanLe1010 commented 1 year ago

@rajivml

Could you help us troubleshoot by providing the reproducing steps and env information (or provide us an env)?

Environment

Longhorn version:
Installation method (e.g. Rancher Catalog App/Helm/Kubectl):
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version:
- Number of management node in the cluster:
- Number of worker node in the cluster:
Node config
- OS type and version:
- CPU per node:
- Memory per node:
- Disk type(e.g. SSD/NVMe):
- Network bandwidth between the nodes:
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
Number of Longhorn volumes in the cluster:

rajivml commented 1 year ago

HI @PhanLe1010

We are seeing it on both single node and multi-node environments

I will share an environment for your offline analysis via DM over slack

Longhorn version: 1.3.1 Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Helm Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: RKE2 Number of management node in the cluster: 3 nodes which acts as both master + worker Number of worker node in the cluster: 3 nodes which acts as both master + worker Node config: 32 Core 128GB RAM OS type and version: RHEL CPU per node: 32 Memory per node: 128GB RAM Disk type(e.g. SSD/NVMe): SSD Network bandwidth between the nodes: Azure Provided Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Azure Disks Number of Longhorn volumes in the cluster: Around 20

JoshuaWatt commented 1 year ago

I saw this today also. It was after I upgraded k3s from 1.23.4 -> 1.23.13, but that may be coincidence.

Specifically, I saw the User "USER" cannot get resource "volumeattachments" in API group "storage.k8s.io" at the cluster scope error

Longhorn version: 1.3.2
Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Helm
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: k3s 1.23.13
- Number of management node in the cluster: 2
- Number of worker node in the cluster: 0
Node config
- OS type and version: Cent OS 7.9.2009
- CPU per node: 8
- Memory per node: 16G
- Disk type(e.g. SSD/NVMe): SSD (I think?)
- Network bandwidth between the nodes: Unknown, but it's a lot
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): They are VMs hosted on prem, but I'm not sure what the service is
Number of Longhorn volumes in the cluster: 13

PhanLe1010 commented 1 year ago

User "USER" cannot get resource "volumeattachments" in API group "storage.k8s.io" at the cluster scope

This error is not related to this issue. It indicate that the client is missing the RBAC permission. Where did you see that error (from which pods)?

hedefalk commented 1 year ago

I have the same issue. I have a two node RPI cluster. Any time master reboots, I get something like:

  Warning  FailedMount  37s   kubelet            MountVolume.WaitForAttach failed for volume "ghost-db" : volume ghost-db has GET error for volume attachment csi-be2cb4dfc03d99eef9aa0e05cb28e59ac52f0c0c5e832c68d142a2ba76827bdb: volumeattachments.storage.k8s.io "csi-be2cb4dfc03d99eef9aa0e05cb28e59ac52f0c0c5e832c68d142a2ba76827bdb" is forbidden: User "system:node:pi4" cannot get resource "volumeattachments" in API group "storage.k8s.io" at the cluster scope: no relationship found between node 'pi4' and this object

simonreddy2001 commented 1 year ago

Hi I have same issue

MountVolume.WaitForAttach failed for volume "pvc-xx" : volume vol-xx has GET error for volume attachment csi-xx: volumeattachments.storage.k8s.io "csi-xx" is forbidden: User "system:node:ip-xx.compute.internal" cannot get resource "volumeattachments" in API group "storage.k8s.io" at the cluster scope: no relationship found between node 'ip-xx.compute.internal' and this object

But we scaledown the statefulset replicas to 0 and scale it back such that volume attachment flow gets triggered again

Orhideous commented 1 year ago

Also run into this issue. Can confirm that workaround suggested by @simonreddy2001 works.

innobead commented 11 months ago

We need to have a resilience way to recover from this automatically.

cc @derekbit @shuo-wu @PhanLe1010

derekbit commented 11 months ago

@diamonwiggins @hedefalk @simonreddy2001 @Orhideous @rajivml I tried to reproduce the issue using Longhorn v1.3.2 and a StatefulSet with 2 replicas on a 2-node cluster. Reboot the two nodes repeatedly, but still cannot reproduce the issue.

Could you please provide the reproducing steps? If you run into the issue again, could you provide a support bundle as well? Thanks.

derekbit commented 11 months ago

Ref: https://github.com/kubernetes/kubernetes/issues/120571

PhanLe1010 commented 8 months ago

Theoretically, this issue could be very well related to the upstream issue https://github.com/kubernetes/kubernetes/issues/120571.

However, attempting to reproduce using similar steps as in the upstream issue yields no success. The attempted reproducing steps are:

Install Kubernetes v1.25.15+rke2r2/v1.27.5+rke2r1
Install Longhorn v1.5.3 using this longhorn-manager image phanle1010/longhorn-manager:v1.5.3-injected-detach-error. Thi longhorn-manager image added the logic to artificially inject the detach error into longhorn-csi-plugin to simulate a temporary detach error. This is the code https://github.com/PhanLe1010/longhorn-manager/commit/a29962e468806eb8209ad45b53ec4be204d4266d

Deploy this deployment into the cluster

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: '1'
  generation: 7
  labels:
    workload.user.cattle.io/workloadselector: apps.deployment-default-test-dep
  name: test-dep
  namespace: default
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      workload.user.cattle.io/workloadselector: apps.deployment-default-test-dep
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        workload.user.cattle.io/workloadselector: apps.deployment-default-test-dep
      namespace: default
    spec:
      affinity: {}
      containers:
        - image: ubuntu:xenial
          imagePullPolicy: Always
          name: container-0
          resources: {}
          securityContext:
            allowPrivilegeEscalation: false
            privileged: false
            readOnlyRootFilesystem: false
            runAsNonRoot: false
          stdin: true
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          tty: true
          volumeMounts:
            - mountPath: /mnt
              name: vol-7rasu
      dnsPolicy: ClusterFirst
      nodeName: phan-v603-pool2-e46dd713-75pnq
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
      volumes:
        - name: vol-7rasu
          persistentVolumeClaim:
            claimName: test-pvc

Use this script to simulate the reproducing steps in the upstream issue

#!/bin/bash

set -o errexit
set -o nounset
set -o pipefail
set -x

# Inject detach error
kubectl -n longhorn-system patch -p '{"value": "26"}' --type=merge lhs storage-minimal-available-percentage

# Run application pod
kubectl scale deployment test-dep --replicas 1
kubectl wait --for=condition=available deployment/test-dep
# Delete the app
kubectl scale deployment test-dep --replicas 0
kubectl wait --for=delete pod --selector=workload.user.cattle.io/workloadselector=apps.deployment-default-test-dep

# Wait for detach error
while true; do
    if kubectl get volumeattachment -o json | grep "Simulated detach error"; then
        break
    fi
    echo "Waiting for volumeAttachment to get error..."
    sleep 1
done

# Kill KCM 
kubectl -n kube-system delete pod -l component=kube-controller-manager
sleep 2

# Start a new KCM
kubectl -n kube-system wait --for condition=Ready=true pod --selector=component=kube-controller-manager

# there is no way how to wait for KCM to process the volumeattachment...
sleep 13

# Create a new pod *after* KCM started processing volumeattachments
kubectl scale deployment test-dep --replicas 1
sleep 1
kubectl wait --for condition=PodScheduled=true pod --selector=workload.user.cattle.io/workloadselector=apps.deployment-default-test-dep

# Stop injecting errors to detach
kubectl -n longhorn-system patch -p '{"value": "25"}' --type=merge lhs storage-minimal-available-percentage

# Now, the second pod should start, but it's stuck at "no relationship found between node '127.0.0.1' and this object"

Unfortunately, the end result is new pod always is able to come up so cannot reproduce the issue

PhanLe1010 commented 8 months ago

Next action

Even though we are not able to reproduce the upstream issue, from code analysis, I do think that the race condition in the upstream issue COULD be the root cause of this ticket. The upstream issue is fixed in:

Kubernetes v1.26 line: >= v1.26.10
Kubernetes v1.27 line: >= v1.27.7
Kubernetes v1.28 line:>= v1.28.3
Kubernetes v1.29 line:>= v1.29.0

Therefore, I think the next step for this ticket would be:

Ask the user to try with these fixed Kubernetes versions to see if the issue still persists. (cc @diamonwiggins Could you try to upgrade Kubenetes to the fixed versions?)
Close this ticket
If the user still hits the issue after upgrading Kubenetes to the fixed versions, we can reopen the ticket

WDYT @derekbit @innobead @ejweber ?

Workaround:

Additionally, from code analysis, I think the workaround may be to scale down the workload, wait for the workload to be fully terminated, then scale back the workload again. Kube-controller-manager should be able to recreate the VolumeAttachment for the new pod

Hanson-Tsai commented 4 months ago

Hi, I have face the similar issue in Kubernetes v1.29

adamcharnock commented 3 months ago

I'm seeing the same behaviour when using Mayastor. In my case I drained the node of Mayastor volumes, restarted the Mayastor pod (openebs-io-engine-xxx), the uncordoned the node to Mayastor volumes. I then noted that some of the Stateful set volumes were stuck in 'ContainerCreating', reporting an io error with no further details.

The OpenEBS CSI controller was reporting:

I0511 17:34:46.274399       1 csi_handler.go:234] Error processing "csi31ad7af564f89fe04d71d5cc0e2240ee1f5b73d9da88f3e933c1b26d9f501219": failed to  detach: could not mark as detached: volumeattachments.storage.k8s.io"csi31ad7af564f89fe04d71d5cc0e2240ee1f5b73d9da88f3e933c1b26d9f501219" not found

Scaling down the StatefulSet to 0, then scaling back up resolved the issue.

I know Mayastore is an entirely different project, but I thought this would be helpful information for 1) anyone also googling their way here, and 2) adding information towards the "is this an upstream issue" question?

longhorn / longhorn