MountDevice failed log spamming for statefulset volumes after RKE version upgrade

bandak2 commented 1 year ago

RKE version v1.3.18

Docker version 20.10.17-ce

Operating system and kernel SUSE Linux Enterprise Server 15 SP4 -- Kernel 5.14.21-150400.22-default

Type/provider of hosts VMware VM

cluster.yml file:

nodes:
- address: worker1
  port: "22"
  internal_address: ""
  hostname_override: ""
  user: "root"
  ssh_key: ""
  ssh_cert: ""
  ssh_cert_path: ""
  labels: {}
  taints: []
  ssh_key_path: ~/.ssh/id_rsa
  docker_socket: /var/run/docker.sock
  role: [worker]
- address: master
  port: "22"
  internal_address: ""
  hostname_override: ""
  user: "root"
  ssh_key: ""
  ssh_cert: ""
  ssh_cert_path: ""
  labels: {}
  taints: []
  ssh_key_path: ~/.ssh/id_rsa
  docker_socket: /var/run/docker.sock
  role: [controlplane, etcd]
- address: worker2
  port: "22"
  internal_address: ""
  hostname_override: ""
  user: "root"
  ssh_key: ""
  ssh_cert: ""
  ssh_cert_path: ""
  labels: {}
  taints: []
  ssh_key_path: ~/.ssh/id_rsa
  docker_socket: /var/run/docker.sock
  role: [worker]
services:
  etcd:
    image: ""
    extra_args: {}
    extra_binds: []
    extra_env: []
    win_extra_args: {}
    win_extra_binds: []
    win_extra_env: []
    external_urls: []
    ca_cert: ""
    cert: ""
    key: ""
    path: ""
    uid: 0
    gid: 0
    snapshot: true
    retention: 24h
    creation: 12h
    backup_config:
      interval_hours: 12
      retention: 6
  kube-api:
    image: ""
    extra_args: {}
    extra_binds: []
    extra_env: []
    win_extra_args: {}
    win_extra_binds: []
    win_extra_env: []
    service_cluster_ip_range: 10.43.0.0/16
    service_node_port_range: ""
    pod_security_policy: false
    always_pull_images: false
    secrets_encryption_config: null
    audit_log: null
    admission_configuration: null
    event_rate_limit: null
  kube-controller:
    image: ""
    extra_args: {}
    extra_binds: []
    extra_env: []
    win_extra_args: {}
    win_extra_binds: []
    win_extra_env: []
    cluster_cidr: 10.42.0.0/16
    service_cluster_ip_range: 10.43.0.0/16
  scheduler:
    image: ""
    extra_args: {}
    extra_binds: []
    extra_env: []
    win_extra_args: {}
    win_extra_binds: []
    win_extra_env: []
  kubelet:
    image: ""
    extra_args:
      max-pods: 150
    extra_binds: []
    extra_env: []
    win_extra_args: {}
    win_extra_binds: []
    win_extra_env: []
    cluster_domain: cluster.local
    infra_container_image: ""
    cluster_dns_server: 10.43.0.10
    fail_swap_on: false
    generate_serving_certificate: false
  kubeproxy:
    image: ""
    extra_args: {}
    extra_binds: []
    extra_env: []
    win_extra_args: {}
    win_extra_binds: []
    win_extra_env: []
network:
  plugin: "calico"
  options:
    canal_flannel_backend_type: vxlan
  mtu: 0
  node_selector: {}
  update_strategy: null
  tolerations: []
authentication:
  strategy: x509
  sans: []
  webhook: null
addons: ""
addons_include: []
system_images:
  etcd: ""
  alpine: ""
  nginx_proxy: ""
  cert_downloader: ""
  kubernetes_services_sidecar: ""
  kubedns: ""
  dnsmasq: ""
  kubedns_sidecar: ""
  kubedns_autoscaler: ""
  coredns: ""
  coredns_autoscaler: ""
  nodelocal: ""
  kubernetes: ""
  flannel: ""
  flannel_cni: ""
  calico_node: ""
  calico_cni: ""
  calico_controllers: ""
  calico_ctl: ""
  calico_flexvol: ""
  canal_node: ""
  canal_cni: ""
  canal_controllers: ""
  canal_flannel: ""
  canal_flexvol: ""
  weave_node: ""
  weave_cni: ""
  pod_infra_container: ""
  ingress: ""
  ingress_backend: ""
  ingress_webhook: ""
  metrics_server: ""
  windows_pod_infra_container: ""
  aci_cni_deploy_container: ""
  aci_host_container: ""
  aci_opflex_container: ""
  aci_mcast_container: ""
  aci_ovs_container: ""
  aci_controller_container: ""
  aci_gbp_server_container: ""
  aci_opflex_server_container: ""
ssh_key_path: ""
ssh_cert_path: ""
ssh_agent_auth: false
authorization:
  mode: rbac
  options: {}
ignore_docker_version: false
enable_cri_dockerd: null
kubernetes_version: v1.24.9-rancher1-1
private_registries: []
ingress:
  provider: nginx
  options:
    use-forwarded-headers: "true"
  node_selector: {}
  extra_args: {}
  dns_policy: ""
  extra_envs: []
  extra_volumes: []
  extra_volume_mounts: []
  update_strategy: null
  http_port: 0
  https_port: 0
  network_mode: ""
  tolerations: []
  default_backend: null
  default_http_backend_priority_class_name: ""
  nginx_ingress_controller_priority_class_name: ""
  default_ingress_class: null
cluster_name: rke
cloud_provider:
  name: ""
prefix_path: ""
win_prefix_path: ""
addon_job_timeout: 0
bastion_host:
  address: bastion
  port: ""
  user: "root"
  ssh_key: ""
  ssh_key_path: "~/.ssh/id_rsa"
  ssh_cert: ""
  ssh_cert_path: ""
  ignore_proxy_env_vars: false
monitoring:
  provider: ""
  options: {}
  node_selector: {}
  update_strategy: null
  replicas: null
  tolerations: []
  metrics_server_priority_class_name: ""
restore:
  restore: false
  snapshot_name: ""
rotate_encryption_key: false
dns: null

Steps to Reproduce:

While on k8s 1.22.17, and with stateful running with volume mounts, upgrade the rke kubernetes version from 1.22.17 to 1.24.9.
After the upgrade, the stateful pods, and respective volumes from pv are mounted as expected.
However, in the event logs it is observed that the events will be populated with multiple warnings and errors, and that the volume is unmounted and mounted elsewhere. The spamming of the events is hight, ~3800 such events in 4d observation period.

Results:

  "log": "I0316 12:14:58.508866   22179 operation_generator.go:626] \"MountVolume.WaitForAttach succeeded for volume \\\"csivol-d8ed8f41a7\\\" (UniqueName: \\\"kubernetes.io/csi/csi-unity.dellemc.com^csivol-d8ed8f41a7-iSCSI-storagearray-sv_268814\\\") pod \\\"web-0\\\" (UID: \\\"98a7514b-414b-4ddf-ab23-572e606c7244\\\") DevicePath \\\"csi-240ae18dc9f334d896c39711cbf58fd23c51db537b8bd6bcaffc35edf25abe1f\\\"\" pod=\"default/web-0\"\n",
  "stream": "stderr",
  "time": "2023-03-16T12:14:58.513491344Z"
}
{
  "log": "I0316 12:14:59.411402   22179 kubelet.go:2182] \"SyncLoop (probe)\" probe=\"readiness\" status=\"\" pod=\"kube-system/coredns-autoscaler-79dcc864f5-pmwrf\"\n",
  "stream": "stderr",
  "time": "2023-03-16T12:14:59.412760158Z"
}
{
  "log": "I0316 12:14:59.412556   22179 kubelet.go:2182] \"SyncLoop (probe)\" probe=\"readiness\" status=\"ready\" pod=\"kube-system/coredns-autoscaler-79dcc864f5-pmwrf\"\n",
  "stream": "stderr",
  "time": "2023-03-16T12:14:59.412787709Z"
}
{
  "log": "I0316 12:15:00.152054   22179 kubelet.go:2182] \"SyncLoop (probe)\" probe=\"readiness\" status=\"\" pod=\"kube-system/calico-node-vfc9r\"\n",
  "stream": "stderr",
  "time": "2023-03-16T12:15:00.15371283Z"
}
{
  "log": "E0316 12:15:00.286052   22179 csi_attacher.go:344] kubernetes.io/csi: attacher.MountDevice failed: rpc error: code = Internal desc =  runid=34 device already in use and mounted elsewhere. Cannot do private mount\n",
  "stream": "stderr",
  "time": "2023-03-16T12:15:00.286173505Z"
}
{
  "log": "E0316 12:15:00.286292   22179 nestedpendingoperations.go:348] Operation for \"{volumeName:kubernetes.io/csi/csi-unity.dellemc.com^csivol-d8ed8f41a7-iSCSI-storagearray-sv_268814 podName: nodeName:}\" failed. No retries permitted until 2023-03-16 12:15:01.286274144 +0000 UTC m=+15.145119628 (durationBeforeRetry 1s). Error: MountVolume.MountDevice failed for volume \"csivol-d8ed8f41a7\" (UniqueName: \"kubernetes.io/csi/csi-unity.dellemc.com^csivol-d8ed8f41a7-iSCSI-storagearray-sv_268814\") pod \"web-0\" (UID: \"98a7514b-414b-4ddf-ab23-572e606c7244\") : rpc error: code = Internal desc =  runid=34 device already in use and mounted elsewhere. Cannot do private mount\n",
  "stream": "stderr",
  "time": "2023-03-16T12:15:00.28633199Z"
}

SURE-6124

bandak2 commented 1 year ago

Can someone please help us with taking a look at this? Our driver uses kubelet path /var/lib/kubelet update was done using command rke up --config ./cluster.yaml after changing kubernetes_version If any additional info is required, please let us know.

bandak2 commented 1 year ago

Hi, can someone please help with the issue?

bandak2 commented 1 year ago

This is in relation to our Dell CSI storage driver, where an RKE upgrade is leading kubelet spamming mount error for statefulsets. Our CSI driver is in sync with the CSI spec and we would like to understand what during the upgrade process is making the mount calls for already existing mounts. This is leading to Failed mount error spamming in the kubelet logs. If you need additional info, please let us know.

Tejeev commented 1 year ago

Hi @bandak2, I'm working on the internal ticket for this. Could you please attach information on the csi version and any configuration details that might help us reproduce?

Thank you, TJ

bandak2 commented 1 year ago

Hi TJ,

The CSI spec our Dell csi-unity storage driver uses is CSI Spec 1.5. The behavior can be reproduced after we upgrade the RKE k8s version from 1.22.17 to 1.24.9; with our storage drivers installed with statefulsets running with dynamic provisioning (dynamic PVC provisioning). We were running RKE 1.3.18 version for these upgrades.

While on RKE 1.3.18 (running k8s 1.22.17), we had our storage driver installed (Dell CSI-unity version 2.4.0).
Created statefulset (sample can be found below).
This stateful set creates dynamic PVC and binds them to our volumes created by the drivers.
Everything is normal at this point.
upgrade RKE k8s to 1.24.9 using rke commands.
After the upgrade is successful, we see that the stateful pod and mounts are proper and as expected.
However, we are seeing that there are continuous errors reported for the mounts (masked the storage arrayID in the below output) Error: MountVolume.MountDevice failed for volume \"csivol-d8ed8f41a7\" (UniqueName: \"kubernetes.io/csi/csi-unity.dellemc.com^csivol-d8ed8f41a7-iSCSI-storagearray-sv_268814\") pod \"web-0\" (UID: \"98a7514b-414b-4ddf-ab23-572e606c7244\") : rpc error: code = Internal desc = runid=34 device already in use and mounted elsewhere. Cannot do private mount\n",

sample statefulset used:

`--- apiVersion: v1 kind: Service metadata: name: nginx labels: app: nginx spec: ports:

port: 80 name: web clusterIP: None selector: app: nginx

apiVersion: apps/v1 kind: StatefulSet metadata: name: web labels: app: nginx spec: serviceName: "nginx" selector: matchLabels: app: nginx replicas: 2 minReadySeconds: 10 template: metadata: labels: app: nginx spec: containers:
- name: nginx image: registry.k8s.io/nginx-slim:0.8 ports:
  - containerPort: 80 name: web volumeMounts:
  - name: www mountPath: /usr/share/nginx/html volumeClaimTemplates:
metadata: name: www spec: accessModes: [ "ReadWriteOnce" ] resources: requests: storage: 10Gi storageClassName: unity-iscsi `

keerthi652 commented 1 year ago

Hi TJ, Do you have an update on this, or if you need anything from us?

Tejeev commented 1 year ago

Hi Keerthi My apologies for the lapse in updates. Our engineer is still reviewing the issue and working to reproduce it. Just so you know a little about what we're doing, he's checking to see if this can be reproduced in environments with different situations/versions (such as 1.22->1.24 upgraded environs, directly in 1.24.9, and even in a newer version of 1.24 which we are preparing for release). Hopefully, this will turn up some information on what might be different on our end or if there's some common denominator to the failure states.

I'll see if there is any information we can provide him to speed up this part of the process. If anything comes up that you may think is relevant, please don't hesitate to let us know.

Tejeev commented 1 year ago

I've been reviewing things, and it may help improve the productivity of our recreations if we could double check the formatting of the statefulset you provided so as to prevent any misunderstandings on our part from propagating through our tests. Please review my edits below for accuracy:

apiVersion: v1
kind: Service
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  - ports:
    port: 80
    name: web
    clusterIP: None
selector:
  app: nginx

---

apiVersion: apps/v1
  kind: StatefulSet
metadata:
  name: web
labels:
  app: nginx
spec:
  serviceName: "nginx"
  selector:
    matchLabels:
      app: nginx
  replicas: 2
  minReadySeconds: 10
  template:
    metadata:
      labels:
        app: nginx
      spec:
        containers:
        - name: nginx
          image: registry.k8s.io/nginx-slim:0.8
          ports:
          - containerPort: 80
            name: web
          volumeMounts:
          - name: www
            mountPath: /usr/share/nginx/html
            volumeClaimTemplates:
          - metadata:
            name: www
            spec:
              accessModes: [ "ReadWriteOnce" ]
              resources:
                requests:
                  storage: 10Gi
                storageClassName: unity-iscsi

(Took me a sec to figure out but it looks like the code block syntax for github is three back ticks (```))

bandak2 commented 1 year ago

Thanks for the response @Tejeev . We were using the volumeClaimTemplates in the statefulset manifest under the templates section for dynamic PVC generation for the statefulset pods. This generates us a PVC using our csi-unity driver which will be used for the volume mount. I think either way would work for the test.

  volumeClaimTemplates:
  - metadata:
      name: www
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 10Gi
      storageClassName: unity-iscsi

Tejeev commented 1 year ago

Is this correct?

apiVersion: v1
kind: Service
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  - ports:
    port: 80
    name: web
    clusterIP: None
selector:
  app: nginx

---

apiVersion: apps/v1
  kind: StatefulSet
metadata:
  name: web
labels:
  app: nginx
spec:
  serviceName: "nginx"
  selector:
    matchLabels:
      app: nginx
  replicas: 2
  minReadySeconds: 10
  template:
    metadata:
      labels:
        app: nginx
      spec:
        containers:
        - name: nginx
          image: registry.k8s.io/nginx-slim:0.8
          ports:
          - containerPort: 80
            name: web
          volumeMounts:
          - name: www
            mountPath: /usr/share/nginx/html
    volumeClaimTemplates:
    - metadata:
        name: www
      spec:
        accessModes: [ "ReadWriteOnce" ]
        resources:
          requests:
            storage: 10Gi
        storageClassName: unity-iscsi

bandak2 commented 1 year ago

seems like the identation is a little off for the claimtempalte. You can use the following which works for me:

---
apiVersion: v1
kind: Service
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  ports:
  - port: 80
    name: web
  clusterIP: None
  selector:
    app: nginx
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: web
  labels:
    app: nginx
spec:
  serviceName: "nginx"
  selector:
    matchLabels:
      app: nginx
  replicas: 2
  minReadySeconds: 10
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: registry.k8s.io/nginx-slim:0.8
        ports:
        - containerPort: 80
          name: web
        volumeMounts:
        - name: www
          mountPath: /usr/share/nginx/html
  volumeClaimTemplates:
  - metadata:
      name: www
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 10Gi
      storageClassName: unity-iscsi

Tejeev commented 1 year ago

Thanks!

Tejeev commented 1 year ago

I'm told we're having some difficulty securing access to the specific server hardware (and, therefore also the CSI driver) that is being used here. Can you please help us investigate by trying the following:

See if the same issue occurs on a different distribution of the same version of Kubernetes. So, for example, try the same setup with KubeAdm instead of RKE1.
See if the same issue occurs after a fresh installation of K8s v1.24, or if it only happens on upgrade from v1.22 to v1.24.

The key here is to try the above steps on the same hardware with the same CSI drivers, as I think trying it on different hardware with different drivers wouldn't produce useful results for this particular issue.

bandak2 commented 1 year ago

Apologies for the delayed response. It was a long weekend. Response to the points you raised:

for regular deployments, our configuration is based on kubeadm. We can give the scenario a try on them.
It only happens after an upgrade, where the remount triggers. we are finding that the STS mount is checked, and returns success. We've dug into the logs and found that the mount gets retried with a different path once the mount returns success.

We'll revert on point 1.

bandak2 commented 1 year ago

We tried the upgrade scenario with rke v1.3.20 from 1.22 through 1.24, and we've hit the issue once we landed in 1.24. Seems like the staging path for the node stage volume is different for 1.24 when compared through k8s 1.22 and k8s 1.23 when tested with the same version of our driver. Here are the findings

RKE v1.3.20 with k8s v1.22.17-rancher1-2 driver v2.4.0

NODE_STAGE_VOLUME path is different here
StagingTargetPath=/var/lib/kubelet/plugins/kubernetes.io/csi/pv/csivol-50d0f5ffdd/globalmount

upgrading to v1.23.16-rancher2-2..... NODE_STAGE_VOLUME path remained the same

StagingTargetPath:/var/lib/kubelet/plugins/kubernetes.io/csi/pv/csivol-50d0f5ffdd/globalmount
...
...
time="2023-05-17T13:17:22Z" level=debug arrayid=apm00213404195 runid=25 msg="volume already published to target" func="github.com/dell/csi-unity/service.publishVolume()" file="/go/src/csi-unity/service/mount.go:364"

..upgrading from v1.23.16-rancher2-2 to v1.24.13-rancher2-1

StagingTargetPath=/var/lib/kubelet/plugins/kubernetes.io/csi/csi-unity.dellemc.com/faf27cb97b007ad80dd62a9ffe53c84c03f9ded373e527a41654ae32a31c14c7/globalmount

time="2023-05-17T13:35:41Z" level=info msg="/csi.v1.Node/NodeStageVolume: REP 0073: rpc error: code = Internal desc =  runid=73 device already in use and mounted elsewhere. Cannot do private mount"

We are not sure why there is a change in the staging path when we hit k8s 1.24 version.

bfbachmann commented 1 year ago

Hi @bandak2

I've been looking into this issue. As @Tejeev mentioned, it's difficult for us to reproduce because we don't have the Dell hardware required. From going through the code, I understand that this error is happening when NodeStageVolume is called on your CSI (unity) driver. I believe it's kubelet that is making this call and passing an unexpected value as the StagingTargetPath. Would you be able to check the kubelet logs to see what it's doing differently in 1.24 vs 1.22/1.23?

bandak2 commented 1 year ago

We're trying this on vanilla k8s and will update our findings from our end.

snasovich commented 1 year ago

@bandak2 , just curious if there is any update on it?

bandak2 commented 1 year ago

Hi @snasovich, Thanks for checking. Yes, we've tried this on our vanilla k8s (one deployed using kubeadm) from 1.22 -> 1.23 -> 1.24 with our driver installed, and it looks like the issue is not reproducible on vanilla k8s. The orchestrator seems to be calling the NodeUnstageVolume correctly after the upgrade to 1.24, and then remounting it with the new updated path correctly using NodeStageVolume calls. I'll be sharing the logs shortly today after sanitizing them. Just a note, during k8s upgrade the kubeadm way, we cordon and drain the nodes. we follow the procedure mentioned in the Kubernetes guide to upgrade one version at a time: https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-upgrade/

bandak2 commented 1 year ago

There are too many logs to sanitize given that we are including the driver logs. We're reaching out to your team via a different channel where we can share those securely, without having to go through sanitizing all of these.

bandak2 commented 1 year ago

The logs for this have been provided through backend channels. Once you have some insights please let us know.

bfbachmann commented 1 year ago

Hi @bandak2. Thanks for the logs. I've spent some time digging through them and the only meaningful thing I've found so far is (as you mentioned before), the new mount calls to a different path.

Before v1.24, all the mounts are to the mount point /var/lib/kubelet/plugins/kubernetes.io/csi/pv/krv11506-d05ec878ad/globalmount. E.g:

mount -t ext4 -o defaults /dev/dm-0 /var/lib/kubelet/plugins/kubernetes.io/csi/pv/krv11506-d05ec878ad/globalmount

After v1.24, I see that there are still mount calls that use the same mount point, but there are also new ones that use /var/lib/kubelet/plugins/kubernetes.io/csi/csi-unity.dellemc.com/4bb10b984d5ca1bf16a8c79733866361f7061d665cba703eef3307a3426d3b21/globalmount. E.g:

mount -t ext4 -o defaults /dev/dm-1 /var/lib/kubelet/plugins/kubernetes.io/csi/csi-unity.dellemc.com/4bb10b984d5ca1bf16a8c79733866361f7061d665cba703eef3307a3426d3b21/globalmount

I'm not sure if this is a red herring or not, though, because these mounts all seem to succeed eventually. I did however notice that multipathd seems to be running on your nodes and I'm wondering if you might experiencing an issue similar to this one. Is this something you'd be able to check?

bandak2 commented 1 year ago

Thanks @bfbachmann, I'll check on the link of the similar issue to see if there is anything that stands out from our end of things. As for multipath, our multipath config is at /etc/multipath.conf on the node and would have the following contents for Kubernetes or RKE:

defaults {
  user_friendly_names yes
  find_multipaths yes
}

bandak2 commented 1 year ago

While at it, as you can see in the logs, during the k8s upgrade to 1.24, the older path was unpublished and then re-published with the new path. I believe the CO does that. These actions don't seem to happen while upgrading RKE, and the older path stays as the StagingPath. I'll check to see if there are any hints from the multipath aspect.

bfbachmann commented 1 year ago

Hi @bandak2, did you find anything interesting on the multipath front?

rancher / rke

MountDevice failed log spamming for statefulset volumes after RKE version upgrade #3203

port: 80 name: web clusterIP: None selector: app: nginx