Problems attaching PV when disks are added to multipath group

lovey89 commented 1 year ago

What happened: Note: We have not configured anything about multipath ourselves. This behavior is seen on an AKS managed cluster.

We use Kubernetes for CI builds and we attach and detach disk to pods fairly often. Multiple times per day we get a problem that the PV refuses to mount on the worker node. When the problem happens we see the following event:

MountVolume.MountDevice failed for volume "pvc-dc-chameleon-aosp-runner-pvc-1" : rpc error: code = Internal desc = could not format /dev/disk/azure/scsi1/lun5(lun: 5), and mount it at /var/lib/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com/ef0b8a82acb31a16bec8d33eb717e94fa7ff016d4c34f3f9a56734744ffe200c/globalmount, failed with mount failed: exit status 32 Mounting command: mount Mounting arguments: -t ext4 -o defaults /dev/disk/azure/scsi1/lun5 /var/lib/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com/ef0b8a82acb31a16bec8d33eb717e94fa7ff016d4c34f3f9a56734744ffe200c/globalmount Output: mount: /var/lib/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com/ef0b8a82acb31a16bec8d33eb717e94fa7ff016d4c34f3f9a56734744ffe200c/globalmount: /dev/sdg already mounted or mount point busy.

It complains that /dev/sdg is already mounted or mount point is busy. Connecting to the node where the pod is trying to start and run lsblk we see the following output (I guess that the sdg part it the interesting lines though):

NAME     MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINTS
sda        8:0    0   128G  0 disk  
|-sda1     8:1    0 127.9G  0 part  /var/lib/kubelet
|                                   /
|-sda14    8:14   0     4M  0 part  
`-sda15    8:15   0   106M  0 part  /boot/efi
sdb        8:16   0    32G  0 disk  
`-sdb1     8:17   0    32G  0 part  /mnt
sdf        8:80   0   500G  0 disk  /var/lib/kubelet/pods/f4ff239e-968a-4957-93f8-a7722e022ffc/volumes/kubernetes.io~csi/pvc-dc-chameleon-nonhlos-runner-pvc-1/mount
                                    /var/lib/kubelet/pods/f4ff239e-968a-4957-93f8-a7722e022ffc/volumes/kubernetes.io~csi/pvc-dc-chameleon-nonhlos-runner-pvc-1/mount
                                    /var/lib/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com/856d773a2c1c390e6e7a1409ab8dfee330cffb6be53dcfa8e4dbc264659c83b2/globalmount
                                    /var/lib/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com/856d773a2c1c390e6e7a1409ab8dfee330cffb6be53dcfa8e4dbc264659c83b2/globalmount
sdg        8:96   0   500G  0 disk  
`-mpatha 253:0    0   500G  0 mpath 
sdh        8:112  0   500G  0 disk  /var/lib/kubelet/pods/b01de0aa-a408-4da0-987f-8c7f0ec822c3/volumes/kubernetes.io~csi/pvc-0b59f8a2-7fe0-49c4-8822-87ff9326cdcb/mount
                                    /var/lib/kubelet/pods/b01de0aa-a408-4da0-987f-8c7f0ec822c3/volumes/kubernetes.io~csi/pvc-0b59f8a2-7fe0-49c4-8822-87ff9326cdcb/mount
                                    /var/lib/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com/76057489aa1ba434b13a02879ae65dce8d6f3310a7982a7b151d4d3c1c34091f/globalmount
                                    /var/lib/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com/76057489aa1ba434b13a02879ae65dce8d6f3310a7982a7b151d4d3c1c34091f/globalmount
sr0       11:0    1   750K  0 rom

multipath -ll returns:

mpatha (3600224800e0c957b70e59ee5d2bd2f87) dm-0 Msft,Virtual Disk
size=500G features='0' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  `- 1:0:0:5 sdg 8:96 active ready running

/lib/udev/scsi_id -g /dev/sdg returns:

3600224800e0c957b70e59ee5d2bd2f87

cat /etc/multipath/bindings returns:

# Multipath bindings, Version : 1.0
# NOTE: this file is automatically maintained by the multipath program.
# You should not need to edit this file in normal circumstances.
#
# Format:
# alias wwid
#
mpatha 3600224800e0c957b70e59ee5d2bd2f87

cat /etc/multipath/wwids returns:

# Multipath wwids, Version : 1.0
# NOTE: This file is automatically maintained by multipath and multipathd.
# You should not need to edit this file in normal circumstances.
#
# Valid WWIDs:
/3600224800e0c957b70e59ee5d2bd2f87/

cat /etc/multipath.conf returns:

defaults {
    user_friendly_names yes
}

A snippet from journalctl around the time it happened which looks related:

...
...
Feb 07 06:58:24 aks-icijobsz1-14256148-vmss0002ZG kernel: scsi 1:0:0:5: Direct-Access     Msft     Virtual Disk     1.0  PQ: 0 ANSI: 5
Feb 07 06:58:24 aks-icijobsz1-14256148-vmss0002ZG kernel: sd 1:0:0:5: Attached scsi generic sg7 type 0
Feb 07 06:58:24 aks-icijobsz1-14256148-vmss0002ZG kernel: sd 1:0:0:5: [sdg] 1048576000 512-byte logical blocks: (537 GB/500 GiB)
Feb 07 06:58:24 aks-icijobsz1-14256148-vmss0002ZG kernel: sd 1:0:0:5: [sdg] 4096-byte physical blocks
Feb 07 06:58:24 aks-icijobsz1-14256148-vmss0002ZG kernel: sd 1:0:0:5: [sdg] Write Protect is off
Feb 07 06:58:24 aks-icijobsz1-14256148-vmss0002ZG kernel: sd 1:0:0:5: [sdg] Mode Sense: 0f 00 10 00
Feb 07 06:58:24 aks-icijobsz1-14256148-vmss0002ZG kernel: sd 1:0:0:5: [sdg] Write cache: disabled, read cache: enabled, supports DPO and FUA
Feb 07 06:58:24 aks-icijobsz1-14256148-vmss0002ZG kernel: sd 1:0:0:5: [sdg] Attached SCSI disk
Feb 07 06:58:25 aks-icijobsz1-14256148-vmss0002ZG multipathd[232]: mpatha: setting dev_loss_tmo is unsupported for protocol scsi:unspec
Feb 07 06:58:25 aks-icijobsz1-14256148-vmss0002ZG multipathd[232]: mpatha: addmap [0 1048576000 multipath 0 0 2 1 service-time 0 1 1 8:48 1 service-time 0 1 1 8:96 1]
Feb 07 06:58:25 aks-icijobsz1-14256148-vmss0002ZG kernel: device-mapper: multipath service-time: version 0.3.0 loaded
Feb 07 06:58:25 aks-icijobsz1-14256148-vmss0002ZG multipathd[232]: sdg [8:96]: path added to devmap mpatha
Feb 07 06:58:25 aks-icijobsz1-14256148-vmss0002ZG kernel: scsi 1:0:0:4: Direct-Access     Msft     Virtual Disk     1.0  PQ: 0 ANSI: 5
Feb 07 06:58:25 aks-icijobsz1-14256148-vmss0002ZG kernel: sd 1:0:0:4: Attached scsi generic sg8 type 0
Feb 07 06:58:25 aks-icijobsz1-14256148-vmss0002ZG kernel: sd 1:0:0:4: [sdh] 1048576000 512-byte logical blocks: (537 GB/500 GiB)
Feb 07 06:58:25 aks-icijobsz1-14256148-vmss0002ZG kernel: sd 1:0:0:4: [sdh] 4096-byte physical blocks
Feb 07 06:58:25 aks-icijobsz1-14256148-vmss0002ZG kernel: sd 1:0:0:4: [sdh] Write Protect is off
Feb 07 06:58:25 aks-icijobsz1-14256148-vmss0002ZG kernel: sd 1:0:0:4: [sdh] Mode Sense: 0f 00 10 00
Feb 07 06:58:25 aks-icijobsz1-14256148-vmss0002ZG kernel: sd 1:0:0:4: [sdh] Write cache: disabled, read cache: enabled, supports DPO and FUA
Feb 07 06:58:25 aks-icijobsz1-14256148-vmss0002ZG kernel: sd 1:0:0:4: [sdh] Attached SCSI disk
...
Feb 07 06:58:30 aks-icijobsz1-14256148-vmss0002ZG kubelet[1860]: I0207 06:58:30.099708    1860 reconciler.go:357] "operationExecutor.VerifyControllerAttachedVolume started for volume \"pvc-dc-chameleon-aosp-runner-pvc-1\" (UniqueName: \"kubernetes.io/csi/disk.csi.azure.com^/subscriptions/c9b52f6a-7de9-47d1-80
0a-bd0692be2f58/resourceGroups/mc_rg-kubernetes-live-euwest-001_k8s-got-live_westeurope/providers/Microsoft.Compute/disks/pvc-dc-chameleon-aosp-runner-pvc-1\") pod \"runner-tyq825be-project-4964-concurrent-0cd8jx\" (UID: \"1045b38b-f99d-439a-bbc8-9e9dd5bf2974\") " pod="gitlab/runner-tyq825be-project-4964-conc
urrent-0cd8jx"
Feb 07 06:58:30 aks-icijobsz1-14256148-vmss0002ZG kubelet[1860]: I0207 06:58:30.126369    1860 operation_generator.go:1582] "Controller attach succeeded for volume \"pvc-dc-chameleon-aosp-runner-pvc-1\" (UniqueName: \"kubernetes.io/csi/disk.csi.azure.com^/subscriptions/c9b52f6a-7de9-47d1-800a-bd0692be2f58/res
ourceGroups/mc_rg-kubernetes-live-euwest-001_k8s-got-live_westeurope/providers/Microsoft.Compute/disks/pvc-dc-chameleon-aosp-runner-pvc-1\") pod \"runner-tyq825be-project-4964-concurrent-0cd8jx\" (UID: \"1045b38b-f99d-439a-bbc8-9e9dd5bf2974\") device path: \"\"" pod="gitlab/runner-tyq825be-project-4964-concur
rent-0cd8jx"
Feb 07 06:58:30 aks-icijobsz1-14256148-vmss0002ZG kubelet[1860]: I0207 06:58:30.200308    1860 reconciler.go:269] "operationExecutor.MountVolume started for volume \"pvc-dc-chameleon-aosp-runner-pvc-1\" (UniqueName: \"kubernetes.io/csi/disk.csi.azure.com^/subscriptions/c9b52f6a-7de9-47d1-800a-bd0692be2f58/res
ourceGroups/mc_rg-kubernetes-live-euwest-001_k8s-got-live_westeurope/providers/Microsoft.Compute/disks/pvc-dc-chameleon-aosp-runner-pvc-1\") pod \"runner-tyq825be-project-4964-concurrent-0cd8jx\" (UID: \"1045b38b-f99d-439a-bbc8-9e9dd5bf2974\") " pod="gitlab/runner-tyq825be-project-4964-concurrent-0cd8jx"
Feb 07 06:58:30 aks-icijobsz1-14256148-vmss0002ZG kubelet[1860]: I0207 06:58:30.200764    1860 operation_generator.go:616] "MountVolume.WaitForAttach entering for volume \"pvc-dc-chameleon-aosp-runner-pvc-1\" (UniqueName: \"kubernetes.io/csi/disk.csi.azure.com^/subscriptions/c9b52f6a-7de9-47d1-800a-bd0692be2f
58/resourceGroups/mc_rg-kubernetes-live-euwest-001_k8s-got-live_westeurope/providers/Microsoft.Compute/disks/pvc-dc-chameleon-aosp-runner-pvc-1\") pod \"runner-tyq825be-project-4964-concurrent-0cd8jx\" (UID: \"1045b38b-f99d-439a-bbc8-9e9dd5bf2974\") DevicePath \"\"" pod="gitlab/runner-tyq825be-project-4964-co
ncurrent-0cd8jx"
Feb 07 06:58:30 aks-icijobsz1-14256148-vmss0002ZG kubelet[1860]: I0207 06:58:30.211266    1860 operation_generator.go:626] "MountVolume.WaitForAttach succeeded for volume \"pvc-dc-chameleon-aosp-runner-pvc-1\" (UniqueName: \"kubernetes.io/csi/disk.csi.azure.com^/subscriptions/c9b52f6a-7de9-47d1-800a-bd0692be2
f58/resourceGroups/mc_rg-kubernetes-live-euwest-001_k8s-got-live_westeurope/providers/Microsoft.Compute/disks/pvc-dc-chameleon-aosp-runner-pvc-1\") pod \"runner-tyq825be-project-4964-concurrent-0cd8jx\" (UID: \"1045b38b-f99d-439a-bbc8-9e9dd5bf2974\") DevicePath \"csi-b66c066922033f6cc57efaf2746f05732bb99e7c34
1d260ec40fe3ca6126bb07\"" pod="gitlab/runner-tyq825be-project-4964-concurrent-0cd8jx"
Feb 07 06:58:30 aks-icijobsz1-14256148-vmss0002ZG kubelet[1860]: E0207 06:58:30.238719    1860 csi_attacher.go:345] kubernetes.io/csi: attacher.MountDevice failed: rpc error: code = Internal desc = could not format /dev/disk/azure/scsi1/lun5(lun: 5), and mount it at /var/lib/kubelet/plugins/kubernetes.io/csi/
disk.csi.azure.com/ef0b8a82acb31a16bec8d33eb717e94fa7ff016d4c34f3f9a56734744ffe200c/globalmount, failed with mount failed: exit status 32
Feb 07 06:58:30 aks-icijobsz1-14256148-vmss0002ZG kubelet[1860]: Mounting command: mount
Feb 07 06:58:30 aks-icijobsz1-14256148-vmss0002ZG kubelet[1860]: Mounting arguments: -t ext4 -o defaults /dev/disk/azure/scsi1/lun5 /var/lib/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com/ef0b8a82acb31a16bec8d33eb717e94fa7ff016d4c34f3f9a56734744ffe200c/globalmount
Feb 07 06:58:30 aks-icijobsz1-14256148-vmss0002ZG kubelet[1860]: Output: mount: /var/lib/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com/ef0b8a82acb31a16bec8d33eb717e94fa7ff016d4c34f3f9a56734744ffe200c/globalmount: /dev/sdg already mounted or mount point busy.
Feb 07 06:58:30 aks-icijobsz1-14256148-vmss0002ZG kubelet[1860]: E0207 06:58:30.238934    1860 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/csi/disk.csi.azure.com^/subscriptions/c9b52f6a-7de9-47d1-800a-bd0692be2f58/resourceGroups/mc_rg-kubernetes-live-euwest-001_k8s-got-live_westeu
rope/providers/Microsoft.Compute/disks/pvc-dc-chameleon-aosp-runner-pvc-1 podName: nodeName:}" failed. No retries permitted until 2023-02-07 06:58:30.738907572 +0000 UTC m=+320.891870666 (durationBeforeRetry 500ms). Error: MountVolume.MountDevice failed for volume "pvc-dc-chameleon-aosp-runner-pvc-1" (UniqueN
ame: "kubernetes.io/csi/disk.csi.azure.com^/subscriptions/c9b52f6a-7de9-47d1-800a-bd0692be2f58/resourceGroups/mc_rg-kubernetes-live-euwest-001_k8s-got-live_westeurope/providers/Microsoft.Compute/disks/pvc-dc-chameleon-aosp-runner-pvc-1") pod "runner-tyq825be-project-4964-concurrent-0cd8jx" (UID: "1045b38b-f99d-439a-bbc8-9e9dd5bf2974") : rpc error: code = Internal desc = could not format /dev/disk/azure/scsi1/lun5(lun: 5), and mount it at /var/lib/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com/ef0b8a82acb31a16bec8d33eb717e94fa7ff016d4c34f3f9a56734744ffe200c/globalmount, failed with mount failed: exit status 32
Feb 07 06:58:30 aks-icijobsz1-14256148-vmss0002ZG kubelet[1860]: Mounting command: mount
Feb 07 06:58:30 aks-icijobsz1-14256148-vmss0002ZG kubelet[1860]: Mounting arguments: -t ext4 -o defaults /dev/disk/azure/scsi1/lun5 /var/lib/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com/ef0b8a82acb31a16bec8d33eb717e94fa7ff016d4c34f3f9a56734744ffe200c/globalmount
Feb 07 06:58:30 aks-icijobsz1-14256148-vmss0002ZG kubelet[1860]: Output: mount: /var/lib/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com/ef0b8a82acb31a16bec8d33eb717e94fa7ff016d4c34f3f9a56734744ffe200c/globalmount: /dev/sdg already mounted or mount point busy.
Feb 07 06:58:30 aks-icijobsz1-14256148-vmss0002ZG kubelet[1860]: I0207 06:58:30.703389    1860 reconciler.go:357] "operationExecutor.VerifyControllerAttachedVolume started for volume \"pvc-0b59f8a2-7fe0-49c4-8822-87ff9326cdcb\" (UniqueName: \"kubernetes.io/csi/disk.csi.azure.com^/subscriptions/c9b52f6a-7de9-47d1-800a-bd0692be2f58/resourceGroups/mc_rg-kubernetes-live-euwest-001_k8s-got-live_westeurope/providers/Microsoft.Compute/disks/pvc-0b59f8a2-7fe0-49c4-8822-87ff9326cdcb\") pod \"runner-dc8ktqbz-project-4964-concurrent-0sntdb\" (UID: \"b01de0aa-a408-4da0-987f-8c7f0ec822c3\") " pod="gitlab/runner-dc8ktqbz-project-4964-concurrent-0sntdb"
Feb 07 06:58:30 aks-icijobsz1-14256148-vmss0002ZG kubelet[1860]: I0207 06:58:30.717109    1860 operation_generator.go:1582] "Controller attach succeeded for volume \"pvc-0b59f8a2-7fe0-49c4-8822-87ff9326cdcb\" (UniqueName: \"kubernetes.io/csi/disk.csi.azure.com^/subscriptions/c9b52f6a-7de9-47d1-800a-bd0692be2f58/resourceGroups/mc_rg-kubernetes-live-euwest-001_k8s-got-live_westeurope/providers/Microsoft.Compute/disks/pvc-0b59f8a2-7fe0-49c4-8822-87ff9326cdcb\") pod \"runner-dc8ktqbz-project-4964-concurrent-0sntdb\" (UID: \"b01de0aa-a408-4da0-987f-8c7f0ec822c3\") device path: \"\"" pod="gitlab/runner-dc8ktqbz-project-4964-concurrent-0sntdb"
Feb 07 06:58:30 aks-icijobsz1-14256148-vmss0002ZG kubelet[1860]: I0207 06:58:30.804526    1860 reconciler.go:269] "operationExecutor.MountVolume started for volume \"pvc-0b59f8a2-7fe0-49c4-8822-87ff9326cdcb\" (UniqueName: \"kubernetes.io/csi/disk.csi.azure.com^/subscriptions/c9b52f6a-7de9-47d1-800a-bd0692be2f58/resourceGroups/mc_rg-kubernetes-live-euwest-001_k8s-got-live_westeurope/providers/Microsoft.Compute/disks/pvc-0b59f8a2-7fe0-49c4-8822-87ff9326cdcb\") pod \"runner-dc8ktqbz-project-4964-concurrent-0sntdb\" (UID: \"b01de0aa-a408-4da0-987f-8c7f0ec822c3\") " pod="gitlab/runner-dc8ktqbz-project-4964-concurrent-0sntdb"
Feb 07 06:58:30 aks-icijobsz1-14256148-vmss0002ZG kubelet[1860]: I0207 06:58:30.804608    1860 reconciler.go:269] "operationExecutor.MountVolume started for volume \"pvc-dc-chameleon-aosp-runner-pvc-1\" (UniqueName: \"kubernetes.io/csi/disk.csi.azure.com^/subscriptions/c9b52f6a-7de9-47d1-800a-bd0692be2f58/resourceGroups/mc_rg-kubernetes-live-euwest-001_k8s-got-live_westeurope/providers/Microsoft.Compute/disks/pvc-dc-chameleon-aosp-runner-pvc-1\") pod \"runner-tyq825be-project-4964-concurrent-0cd8jx\" (UID: \"1045b38b-f99d-439a-bbc8-9e9dd5bf2974\") " pod="gitlab/runner-tyq825be-project-4964-concurrent-0cd8jx"
Feb 07 06:58:30 aks-icijobsz1-14256148-vmss0002ZG kubelet[1860]: I0207 06:58:30.804952    1860 operation_generator.go:616] "MountVolume.WaitForAttach entering for volume \"pvc-dc-chameleon-aosp-runner-pvc-1\" (UniqueName: \"kubernetes.io/csi/disk.csi.azure.com^/subscriptions/c9b52f6a-7de9-47d1-800a-bd0692be2f58/resourceGroups/mc_rg-kubernetes-live-euwest-001_k8s-got-live_westeurope/providers/Microsoft.Compute/disks/pvc-dc-chameleon-aosp-runner-pvc-1\") pod \"runner-tyq825be-project-4964-concurrent-0cd8jx\" (UID: \"1045b38b-f99d-439a-bbc8-9e9dd5bf2974\") DevicePath \"\"" pod="gitlab/runner-tyq825be-project-4964-concurrent-0cd8jx"
Feb 07 06:58:30 aks-icijobsz1-14256148-vmss0002ZG kubelet[1860]: I0207 06:58:30.804962    1860 operation_generator.go:616] "MountVolume.WaitForAttach entering for volume \"pvc-0b59f8a2-7fe0-49c4-8822-87ff9326cdcb\" (UniqueName: \"kubernetes.io/csi/disk.csi.azure.com^/subscriptions/c9b52f6a-7de9-47d1-800a-bd0692be2f58/resourceGroups/mc_rg-kubernetes-live-euwest-001_k8s-got-live_westeurope/providers/Microsoft.Compute/disks/pvc-0b59f8a2-7fe0-49c4-8822-87ff9326cdcb\") pod \"runner-dc8ktqbz-project-4964-concurrent-0sntdb\" (UID: \"b01de0aa-a408-4da0-987f-8c7f0ec822c3\") DevicePath \"\"" pod="gitlab/runner-dc8ktqbz-project-4964-concurrent-0sntdb"
Feb 07 06:58:30 aks-icijobsz1-14256148-vmss0002ZG kubelet[1860]: I0207 06:58:30.814493    1860 operation_generator.go:626] "MountVolume.WaitForAttach succeeded for volume \"pvc-0b59f8a2-7fe0-49c4-8822-87ff9326cdcb\" (UniqueName: \"kubernetes.io/csi/disk.csi.azure.com^/subscriptions/c9b52f6a-7de9-47d1-800a-bd0692be2f58/resourceGroups/mc_rg-kubernetes-live-euwest-001_k8s-got-live_westeurope/providers/Microsoft.Compute/disks/pvc-0b59f8a2-7fe0-49c4-8822-87ff9326cdcb\") pod \"runner-dc8ktqbz-project-4964-concurrent-0sntdb\" (UID: \"b01de0aa-a408-4da0-987f-8c7f0ec822c3\") DevicePath \"csi-6b7f2f46b58edffe3b8e5e48695be305c37222ec025c263f0c359a9331c91bad\"" pod="gitlab/runner-dc8ktqbz-project-4964-concurrent-0sntdb"
Feb 07 06:58:30 aks-icijobsz1-14256148-vmss0002ZG kubelet[1860]: I0207 06:58:30.817668    1860 operation_generator.go:626] "MountVolume.WaitForAttach succeeded for volume \"pvc-dc-chameleon-aosp-runner-pvc-1\" (UniqueName: \"kubernetes.io/csi/disk.csi.azure.com^/subscriptions/c9b52f6a-7de9-47d1-800a-bd0692be2f58/resourceGroups/mc_rg-kubernetes-live-euwest-001_k8s-got-live_westeurope/providers/Microsoft.Compute/disks/pvc-dc-chameleon-aosp-runner-pvc-1\") pod \"runner-tyq825be-project-4964-concurrent-0cd8jx\" (UID: \"1045b38b-f99d-439a-bbc8-9e9dd5bf2974\") DevicePath \"csi-b66c066922033f6cc57efaf2746f05732bb99e7c341d260ec40fe3ca6126bb07\"" pod="gitlab/runner-tyq825be-project-4964-concurrent-0cd8jx"
Feb 07 06:58:30 aks-icijobsz1-14256148-vmss0002ZG kubelet[1860]: E0207 06:58:30.847852    1860 csi_attacher.go:345] kubernetes.io/csi: attacher.MountDevice failed: rpc error: code = Internal desc = could not format /dev/disk/azure/scsi1/lun5(lun: 5), and mount it at /var/lib/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com/ef0b8a82acb31a16bec8d33eb717e94fa7ff016d4c34f3f9a56734744ffe200c/globalmount, failed with mount failed: exit status 32
Feb 07 06:58:30 aks-icijobsz1-14256148-vmss0002ZG kubelet[1860]: Mounting command: mount
Feb 07 06:58:30 aks-icijobsz1-14256148-vmss0002ZG kubelet[1860]: Mounting arguments: -t ext4 -o defaults /dev/disk/azure/scsi1/lun5 /var/lib/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com/ef0b8a82acb31a16bec8d33eb717e94fa7ff016d4c34f3f9a56734744ffe200c/globalmount
Feb 07 06:58:30 aks-icijobsz1-14256148-vmss0002ZG kubelet[1860]: Output: mount: /var/lib/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com/ef0b8a82acb31a16bec8d33eb717e94fa7ff016d4c34f3f9a56734744ffe200c/globalmount: /dev/sdg already mounted or mount point busy.
Feb 07 06:58:30 aks-icijobsz1-14256148-vmss0002ZG kubelet[1860]: E0207 06:58:30.848043    1860 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/csi/disk.csi.azure.com^/subscriptions/c9b52f6a-7de9-47d1-800a-bd0692be2f58/resourceGroups/mc_rg-kubernetes-live-euwest-001_k8s-got-live_westeurope/providers/Microsoft.Compute/disks/pvc-dc-chameleon-aosp-runner-pvc-1 podName: nodeName:}" failed. No retries permitted until 2023-02-07 06:58:31.848023116 +0000 UTC m=+322.000986210 (durationBeforeRetry 1s). Error: MountVolume.MountDevice failed for volume "pvc-dc-chameleon-aosp-runner-pvc-1" (UniqueName: "kubernetes.io/csi/disk.csi.azure.com^/subscriptions/c9b52f6a-7de9-47d1-800a-bd0692be2f58/resourceGroups/mc_rg-kubernetes-live-euwest-001_k8s-got-live_westeurope/providers/Microsoft.Compute/disks/pvc-dc-chameleon-aosp-runner-pvc-1") pod "runner-tyq825be-project-4964-concurrent-0cd8jx" (UID: "1045b38b-f99d-439a-bbc8-9e9dd5bf2974") : rpc error: code = Internal desc = could not format /dev/disk/azure/scsi1/lun5(lun: 5), and mount it at /var/lib/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com/ef0b8a82acb31a16bec8d33eb717e94fa7ff016d4c34f3f9a56734744ffe200c/globalmount, failed with mount failed: exit status 32
Feb 07 06:58:30 aks-icijobsz1-14256148-vmss0002ZG kubelet[1860]: Mounting command: mount
Feb 07 06:58:30 aks-icijobsz1-14256148-vmss0002ZG kubelet[1860]: Mounting arguments: -t ext4 -o defaults /dev/disk/azure/scsi1/lun5 /var/lib/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com/ef0b8a82acb31a16bec8d33eb717e94fa7ff016d4c34f3f9a56734744ffe200c/globalmount
Feb 07 06:58:30 aks-icijobsz1-14256148-vmss0002ZG kubelet[1860]: Output: mount: /var/lib/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com/ef0b8a82acb31a16bec8d33eb717e94fa7ff016d4c34f3f9a56734744ffe200c/globalmount: /dev/sdg already mounted or mount point busy.
Feb 07 06:58:30 aks-icijobsz1-14256148-vmss0002ZG kernel: EXT4-fs (sdh): mounted filesystem with ordered data mode. Opts: (null). Quota mode: none.
Feb 07 06:58:30 aks-icijobsz1-14256148-vmss0002ZG kubelet[1860]: I0207 06:58:30.899681    1860 operation_generator.go:658] "MountVolume.MountDevice succeeded for volume \"pvc-0b59f8a2-7fe0-49c4-8822-87ff9326cdcb\" (UniqueName: \"kubernetes.io/csi/disk.csi.azure.com^/subscriptions/c9b52f6a-7de9-47d1-800a-bd0692be2f58/resourceGroups/mc_rg-kubernetes-live-euwest-001_k8s-got-live_westeurope/providers/Microsoft.Compute/disks/pvc-0b59f8a2-7fe0-49c4-8822-87ff9326cdcb\") pod \"runner-dc8ktqbz-project-4964-concurrent-0sntdb\" (UID: \"b01de0aa-a408-4da0-987f-8c7f0ec822c3\") device mount path \"/var/lib/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com/76057489aa1ba434b13a02879ae65dce8d6f3310a7982a7b151d4d3c1c34091f/globalmount\"" pod="gitlab/runner-dc8ktqbz-project-4964-concurrent-0sntdb"
Feb 07 06:58:30 aks-icijobsz1-14256148-vmss0002ZG kubelet[1860]: I0207 06:58:30.918526    1860 operation_generator.go:730] "MountVolume.SetUp succeeded for volume \"pvc-0b59f8a2-7fe0-49c4-8822-87ff9326cdcb\" (UniqueName: \"kubernetes.io/csi/disk.csi.azure.com^/subscriptions/c9b52f6a-7de9-47d1-800a-bd0692be2f58/resourceGroups/mc_rg-kubernetes-live-euwest-001_k8s-got-live_westeurope/providers/Microsoft.Compute/disks/pvc-0b59f8a2-7fe0-49c4-8822-87ff9326cdcb\") pod \"runner-dc8ktqbz-project-4964-concurrent-0sntdb\" (UID: \"b01de0aa-a408-4da0-987f-8c7f0ec822c3\") " pod="gitlab/runner-dc8ktqbz-project-4964-concurrent-0sntdb"
Feb 07 06:58:31 aks-icijobsz1-14256148-vmss0002ZG python3[1093]: 2023-02-07T06:58:31.024665Z INFO EnvHandler ExtHandler Set block dev timeout: sdf with timeout: 300
Feb 07 06:58:31 aks-icijobsz1-14256148-vmss0002ZG python3[1093]: 2023-02-07T06:58:31.024893Z INFO EnvHandler ExtHandler Set block dev timeout: sdg with timeout: 300
Feb 07 06:58:31 aks-icijobsz1-14256148-vmss0002ZG python3[1093]: 2023-02-07T06:58:31.025106Z INFO EnvHandler ExtHandler Set block dev timeout: sdh with timeout: 300
Feb 07 06:58:31 aks-icijobsz1-14256148-vmss0002ZG kubelet[1860]: I0207 06:58:31.093659    1860 util.go:30] "No sandbox for pod can be found. Need to start a new one" pod="gitlab/runner-dc8ktqbz-project-4964-concurrent-0sntdb"
...

A snippet from the csi-azuredisk-node-xxxxx pods log around the time it happened which looks related:

I0207 06:58:19.444241       1 utils.go:84] GRPC response: {}
I0207 06:58:30.222924       1 utils.go:77] GRPC call: /csi.v1.Node/NodeStageVolume
I0207 06:58:30.222942       1 utils.go:78] GRPC request: {"publish_context":{"LUN":"5"},"staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com/ef0b8a82acb31a16bec8d33eb717e94fa7ff016d4c34f3f9a56734744ffe200c/globalmount","volume_capability":{"AccessType":{"Mount":{}},"access_mode":{"mode":7}},"volume_context":{"csi.storage.k8s.io/pv/name":"pvc-dc-chameleon-aosp-runner-pvc-1","csi.storage.k8s.io/pvc/name":"dc-chameleon-aosp-runner-pvc-1","csi.storage.k8s.io/pvc/namespace":"gitlab","requestedsizegib":"500","skuname":"Standard_LRS","storage.kubernetes.io/csiProvisionerIdentity":"1632463390283-8081-disk.csi.azure.com"},"volume_id":"/subscriptions/c9b52f6a-7de9-47d1-800a-bd0692be2f58/resourceGroups/mc_rg-kubernetes-live-euwest-001_k8s-got-live_westeurope/providers/Microsoft.Compute/disks/pvc-dc-chameleon-aosp-runner-pvc-1"}
I0207 06:58:30.229644       1 azure_common_linux.go:185] azureDisk - found /dev/disk/azure/scsi1/lun5 by sdg under /dev/disk/azure/scsi1/
I0207 06:58:30.229679       1 nodeserver.go:116] NodeStageVolume: perf optimization is disabled for /dev/disk/azure/scsi1/lun5. perfProfile none accountType Standard_LRS
I0207 06:58:30.230250       1 nodeserver.go:157] NodeStageVolume: formatting /dev/disk/azure/scsi1/lun5 and mounting at /var/lib/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com/ef0b8a82acb31a16bec8d33eb717e94fa7ff016d4c34f3f9a56734744ffe200c/globalmount with mount options([])
I0207 06:58:30.230272       1 mount_linux.go:487] Attempting to determine if disk "/dev/disk/azure/scsi1/lun5" is formatted using blkid with args: ([-p -s TYPE -s PTTYPE -o export /dev/disk/azure/scsi1/lun5])
I0207 06:58:30.232679       1 mount_linux.go:490] Output: "DEVNAME=/dev/disk/azure/scsi1/lun5\nTYPE=ext4\n"
I0207 06:58:30.232694       1 mount_linux.go:376] Checking for issues with fsck on disk: /dev/disk/azure/scsi1/lun5
I0207 06:58:30.236674       1 mount_linux.go:389] `fsck` error fsck from util-linux 2.36.1
/dev/sdg is in use.
e2fsck: Cannot continue, aborting.

I0207 06:58:30.236691       1 mount_linux.go:477] Attempting to mount disk /dev/disk/azure/scsi1/lun5 in ext4 format at /var/lib/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com/ef0b8a82acb31a16bec8d33eb717e94fa7ff016d4c34f3f9a56734744ffe200c/globalmount
I0207 06:58:30.236708       1 mount_linux.go:183] Mounting cmd (mount) with arguments (-t ext4 -o defaults /dev/disk/azure/scsi1/lun5 /var/lib/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com/ef0b8a82acb31a16bec8d33eb717e94fa7ff016d4c34f3f9a56734744ffe200c/globalmount)
E0207 06:58:30.238438       1 mount_linux.go:195] Mount failed: exit status 32
Mounting command: mount
Mounting arguments: -t ext4 -o defaults /dev/disk/azure/scsi1/lun5 /var/lib/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com/ef0b8a82acb31a16bec8d33eb717e94fa7ff016d4c34f3f9a56734744ffe200c/globalmount
Output: mount: /var/lib/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com/ef0b8a82acb31a16bec8d33eb717e94fa7ff016d4c34f3f9a56734744ffe200c/globalmount: /dev/sdg already mounted or mount point busy.

E0207 06:58:30.238477       1 utils.go:82] GRPC error: rpc error: code = Internal desc = could not format /dev/disk/azure/scsi1/lun5(lun: 5), and mount it at /var/lib/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com/ef0b8a82acb31a16bec8d33eb717e94fa7ff016d4c34f3f9a56734744ffe200c/globalmount, failed with mount failed: exit status 32
Mounting command: mount
Mounting arguments: -t ext4 -o defaults /dev/disk/azure/scsi1/lun5 /var/lib/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com/ef0b8a82acb31a16bec8d33eb717e94fa7ff016d4c34f3f9a56734744ffe200c/globalmount
Output: mount: /var/lib/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com/ef0b8a82acb31a16bec8d33eb717e94fa7ff016d4c34f3f9a56734744ffe200c/globalmount: /dev/sdg already mounted or mount point busy.
I0207 06:58:30.827074       1 utils.go:77] GRPC call: /csi.v1.Node/NodeStageVolume
I0207 06:58:30.827094       1 utils.go:78] GRPC request: {"publish_context":{"LUN":"4"},"staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com/76057489aa1ba434b13a02879ae65dce8d6f3310a7982a7b151d4d3c1c34091f/globalmount","volume_capability":{"AccessType":{"Mount":{}},"access_mode":{"mode":7}},"volume_context":{"csi.storage.k8s.io/pv/name":"pvc-0b59f8a2-7fe0-49c4-8822-87ff9326cdcb","csi.storage.k8s.io/pvc/name":"digital-cockpit-nonhlos-runner-pvc-2","csi.storage.k8s.io/pvc/namespace":"gitlab","requestedsizegib":"500","skuname":"StandardSSD_LRS","storage.kubernetes.io/csiProvisionerIdentity":"1644050947020-8081-disk.csi.azure.com"},"volume_id":"/subscriptions/c9b52f6a-7de9-47d1-800a-bd0692be2f58/resourceGroups/mc_rg-kubernetes-live-euwest-001_k8s-got-live_westeurope/providers/Microsoft.Compute/disks/pvc-0b59f8a2-7fe0-49c4-8822-87ff9326cdcb"}
I0207 06:58:30.829636       1 utils.go:77] GRPC call: /csi.v1.Node/NodeStageVolume
I0207 06:58:30.829649       1 utils.go:78] GRPC request: {"publish_context":{"LUN":"5"},"staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com/ef0b8a82acb31a16bec8d33eb717e94fa7ff016d4c34f3f9a56734744ffe200c/globalmount","volume_capability":{"AccessType":{"Mount":{}},"access_mode":{"mode":7}},"volume_context":{"csi.storage.k8s.io/pv/name":"pvc-dc-chameleon-aosp-runner-pvc-1","csi.storage.k8s.io/pvc/name":"dc-chameleon-aosp-runner-pvc-1","csi.storage.k8s.io/pvc/namespace":"gitlab","requestedsizegib":"500","skuname":"Standard_LRS","storage.kubernetes.io/csiProvisionerIdentity":"1632463390283-8081-disk.csi.azure.com"},"volume_id":"/subscriptions/c9b52f6a-7de9-47d1-800a-bd0692be2f58/resourceGroups/mc_rg-kubernetes-live-euwest-001_k8s-got-live_westeurope/providers/Microsoft.Compute/disks/pvc-dc-chameleon-aosp-runner-pvc-1"}
I0207 06:58:30.835088       1 azure_common_linux.go:185] azureDisk - found /dev/disk/azure/scsi1/lun4 by sdh under /dev/disk/azure/scsi1/
I0207 06:58:30.835106       1 nodeserver.go:116] NodeStageVolume: perf optimization is disabled for /dev/disk/azure/scsi1/lun4. perfProfile none accountType StandardSSD_LRS
I0207 06:58:30.835405       1 nodeserver.go:157] NodeStageVolume: formatting /dev/disk/azure/scsi1/lun4 and mounting at /var/lib/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com/76057489aa1ba434b13a02879ae65dce8d6f3310a7982a7b151d4d3c1c34091f/globalmount with mount options([])
I0207 06:58:30.835422       1 mount_linux.go:487] Attempting to determine if disk "/dev/disk/azure/scsi1/lun4" is formatted using blkid with args: ([-p -s TYPE -s PTTYPE -o export /dev/disk/azure/scsi1/lun4])
I0207 06:58:30.838591       1 azure_common_linux.go:185] azureDisk - found /dev/disk/azure/scsi1/lun5 by sdg under /dev/disk/azure/scsi1/
I0207 06:58:30.838608       1 nodeserver.go:116] NodeStageVolume: perf optimization is disabled for /dev/disk/azure/scsi1/lun5. perfProfile none accountType Standard_LRS
I0207 06:58:30.838937       1 nodeserver.go:157] NodeStageVolume: formatting /dev/disk/azure/scsi1/lun5 and mounting at /var/lib/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com/ef0b8a82acb31a16bec8d33eb717e94fa7ff016d4c34f3f9a56734744ffe200c/globalmount with mount options([])
I0207 06:58:30.838950       1 mount_linux.go:487] Attempting to determine if disk "/dev/disk/azure/scsi1/lun5" is formatted using blkid with args: ([-p -s TYPE -s PTTYPE -o export /dev/disk/azure/scsi1/lun5])
I0207 06:58:30.841304       1 mount_linux.go:490] Output: "DEVNAME=/dev/disk/azure/scsi1/lun5\nTYPE=ext4\n"
I0207 06:58:30.841327       1 mount_linux.go:376] Checking for issues with fsck on disk: /dev/disk/azure/scsi1/lun5
I0207 06:58:30.841304       1 mount_linux.go:490] Output: "DEVNAME=/dev/disk/azure/scsi1/lun4\nTYPE=ext4\n"
I0207 06:58:30.841426       1 mount_linux.go:376] Checking for issues with fsck on disk: /dev/disk/azure/scsi1/lun4
I0207 06:58:30.845500       1 mount_linux.go:389] `fsck` error fsck from util-linux 2.36.1
/dev/sdg is in use.
e2fsck: Cannot continue, aborting.

I0207 06:58:30.845521       1 mount_linux.go:477] Attempting to mount disk /dev/disk/azure/scsi1/lun5 in ext4 format at /var/lib/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com/ef0b8a82acb31a16bec8d33eb717e94fa7ff016d4c34f3f9a56734744ffe200c/globalmount
I0207 06:58:30.845536       1 mount_linux.go:183] Mounting cmd (mount) with arguments (-t ext4 -o defaults /dev/disk/azure/scsi1/lun5 /var/lib/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com/ef0b8a82acb31a16bec8d33eb717e94fa7ff016d4c34f3f9a56734744ffe200c/globalmount)
E0207 06:58:30.847580       1 mount_linux.go:195] Mount failed: exit status 32
Mounting command: mount
Mounting arguments: -t ext4 -o defaults /dev/disk/azure/scsi1/lun5 /var/lib/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com/ef0b8a82acb31a16bec8d33eb717e94fa7ff016d4c34f3f9a56734744ffe200c/globalmount
Output: mount: /var/lib/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com/ef0b8a82acb31a16bec8d33eb717e94fa7ff016d4c34f3f9a56734744ffe200c/globalmount: /dev/sdg already mounted or mount point busy.

E0207 06:58:30.847611       1 utils.go:82] GRPC error: rpc error: code = Internal desc = could not format /dev/disk/azure/scsi1/lun5(lun: 5), and mount it at /var/lib/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com/ef0b8a82acb31a16bec8d33eb717e94fa7ff016d4c34f3f9a56734744ffe200c/globalmount, failed with mount failed: exit status 32
Mounting command: mount
Mounting arguments: -t ext4 -o defaults /dev/disk/azure/scsi1/lun5 /var/lib/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com/ef0b8a82acb31a16bec8d33eb717e94fa7ff016d4c34f3f9a56734744ffe200c/globalmount
Output: mount: /var/lib/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com/ef0b8a82acb31a16bec8d33eb717e94fa7ff016d4c34f3f9a56734744ffe200c/globalmount: /dev/sdg already mounted or mount point busy.

What you expected to happen: We expected the PV to be mounted to the pod successfully.

How to reproduce it: I'm not totally sure but multipaths seems to be used when two disks have the same wwid. So maybe try to attach two PVs where the underlying Azure Disks have the same wwid? In the example above it looks like there was only one disk in the multipath group but we also see occurences where there are two disks in the group.

Anything else we need to know?: Running multipath -f mpatha (replace mpatha with whatever was returned by multipath -ll) resolves the problem. The pod will start and work as expected but when a new pod is started later the problem may come back.

We think this problem started after we update our cluster from 1.24.x to 1.25.4

Environment:

CSI Driver version: mcr.microsoft.com/oss/kubernetes-csi/azuredisk-csi:v1.24.0.2
Kubernetes version (use kubectl version): AKS managed: Server Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.4", GitCommit:"b969368e201e1f09440892d03007c62e791091f8", GitTreeState:"clean", BuildDate:"2022-12-16T19:44:08Z", GoVersion:"go1.19.3", Compiler:"gc", Platform:"linux/amd64"}
OS (e.g. from /etc/os-release): Ubuntu 22.04.1 LTS (Jammy Jellyfish)
Kernel (e.g. uname -a): Linux aks-icijobsz1-14256148-vmss0002ZG 5.15.0-1029-azure #36-Ubuntu SMP Mon Dec 5 19:31:08 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Install tools:
Others:

andyzhangx commented 1 year ago

not sure why multipath is configured when the driver has attached a disk, I think if you apt remove multipath-tools, it should work, but still need to find out why multipath is configured when there is data disk attach.

lovey89 commented 1 year ago

Glad to hear that you also find it strange :) Today I tried to disable multipath with systemctl stop multipathd on each node to see if I get a more stable environment. I will let it run like that tomorrow as well to see if that helps (and will report the result here). Unfortunately I will have to do this every time my node pool scales up as I'm not aware of a way to change the VM image for the node running the Kubernetes (since we're using AKS). Otherwise I will also try to remove multipath-tools as you suggested.

Please tell me if I can provide any more information. I'm by no means a multipath expert so I just tried to collect everything I thought could be relevant.

andyzhangx commented 1 year ago

@lovey89 I am not a multipath expert either, multipath is not enabled on my testing aks 22.04 node. @alexeldeib do you happen to know whether there is multipath config change in AgentBaker?

lovey89 commented 1 year ago

So systemctl stop multipathd didn't help as it looks like the service was started when a new disk was attached. But apt remove multipath-tools works better. At least I haven't seen any problems for some time now. Thanks!

I have another cluster for testing purposes and I see that multipathd is running in that cluster as well (same config as the one I mentioned above).

$ systemctl status multipathd
● multipathd.service - Device-Mapper Multipath Device Controller
     Loaded: loaded (/lib/systemd/system/multipathd.service; enabled; vendor preset: enabled)
     Active: active (running) since Tue 2023-01-24 11:14:43 UTC; 2 weeks 0 days ago
TriggeredBy: ● multipathd.socket
   Main PID: 221 (multipathd)
     Status: "up"
      Tasks: 7
     Memory: 21.9M
        CPU: 1min 41.767s
     CGroup: /system.slice/multipathd.service
             └─221 /sbin/multipathd -d -s

andyzhangx commented 1 year ago

I think this issue is already resolved in 5.15.0-1035-azure if you create a new node pool, the default /etc/multipath.conf won't add data disk to multipath group:

# uname -a
Linux aks-agentpool-27648854-vmss000002 5.15.0-1033-azure

# cat /etc/multipath.conf
defaults {
    user_friendly_names yes
}

the original /etc/multipath.conf (5.15.0-1033-azure) config which adds data disk to multipath group, I think Canonical has made some change in upstream Ubuntu 22.04 image:

# uname -a
Linux aks-agentpool-38629629-vmss000000 5.15.0-1035-azure

# cat /etc/multipath.conf
defaults {
    user_friendly_names yes
    find_multipaths     no
}
devices {
    device {
        fast_io_fail_tmo     5
        vendor               "Nimble"
        failback             immediate
        dev_loss_tmo         infinity
        hardware_handler     "1 alua"
        product              "Server"
        prio                 alua
        path_selector        "service-time 0"
        path_checker         tur
        path_grouping_policy group_by_prio
        no_path_retry        30
    }
    device {
        prio                 alua
        path_selector        "round-robin 0"
        vendor               "3PARdata"
        checker              tur
        product              "VV"
        hardware_handler     "1 alua"
        no_path_retry        18
        path_grouping_policy group_by_prio
        failback             immediate
        getuid_callout       "/lib/udev/scsi_id --whitelisted --device=/dev/%n"
        rr_min_io            100
        path_checker         tur
        features             "0"
    }
}

andyzhangx commented 1 year ago

for the old node pool with broken multipath.conf config, write a daemonset to apt remove multipath-tools to workaround

admincasper commented 1 year ago

@andyzhangx @lovey89 Hi, I am getting a similar error to this one, but running apt remove multipath-tools does nothing for me.

MountVolume.MountDevice failed for volume "pvc-dd6e690b-d41e-40e5-a338-39b57cfa7cd5" : rpc error: code = 
Internal desc = could not format /dev/disk/azure/scsi1/lun0(lun: 0), and mount it at /var/lib/kubelet/plugins/kubernetes.io
/csi/disk.csi.azure.com/da3dad2f6eb8873d623943acd4aeb4a22b7e2100e09d21ba88e8c754ac85fd5f/globalmount, 
failed with format of disk "/dev/disk/azure/scsi1/lun0" failed: type:("ext4") target:("/var/lib/kubelet/plugins/kubernetes.io
/csi/disk.csi.azure.com/da3dad2f6eb8873d623943acd4aeb4a22b7e2100e09d21ba88e8c754ac85fd5f/globalmount") 
options:("defaults") errcode:(executable file not found in $PATH) output:()

I get no logs from csi-azuredisk-node-xxxxx:

Defaulted container "liveness-probe" out of: liveness-probe, node-driver-registrar, azuredisk
I1113 12:38:52.992442       1 main.go:149] calling CSI driver to discover driver name
I1113 12:38:52.998085       1 main.go:155] CSI driver name: "disk.csi.azure.com"
I1113 12:38:52.998105       1 main.go:183] ServeMux listening at "0.0.0.0:29603"

I've been troubleshooting for days now. PVC in statefulsets in other environments work. The other environments have the same CSI image versions, node types, node kernel version etc.. So I don't even know what I should test next.

andyzhangx commented 1 year ago

could you run kubectl get csi-azuredisk-node-xxxxx -n kube-system -c azuredisk to get driver logs? @admincasper

and what's the node os ?

admincasper commented 1 year ago

Thanks for reply @andyzhangx ! Of course I should've checked correct container logs! Might repeat myself:

I1113 13:04:12.262639       1 azure_common_linux.go:185] azureDisk - found /dev/disk/azure/scsi1/lun0 by sdc under 
/dev/disk/azure/scsi1/
I1113 13:04:12.262694       1 nodeserver.go:116] NodeStageVolume: perf optimization is disabled for /dev/disk/azure/scsi1
/lun0. perfProfile none accountType Premium_LRS
I1113 13:04:12.263077       1 nodeserver.go:157] NodeStageVolume: formatting /dev/disk/azure/scsi1/lun0 and mounting at
 /var/lib/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com/da3dad2f6eb8873d623943acd4aeb4a22b7e2100e09d21ba88e8c754ac85fd5f/globalmount with mount options([])
I1113 13:04:12.263098       1 mount_linux.go:567] Attempting to determine if disk "/dev/disk/azure/scsi1/lun0" is formatted
 using blkid with args: ([-p -s TYPE -s PTTYPE -o export /dev/disk/azure/scsi1/lun0])
I1113 13:04:12.269318       1 mount_linux.go:570] Output: ""
I1113 13:04:12.269340       1 mount_linux.go:529] Disk "/dev/disk/azure/scsi1/lun0" appears to be unformatted, attempting
 to format as type: "ext4" with options: [-F -m0 /dev/disk/azure/scsi1/lun0]
E1113 13:04:12.270405       1 mount_linux.go:535] format of disk "/dev/disk/azure/scsi1/lun0" failed: type:("ext4") target:
("/var/lib/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com/da3dad2f6eb8873d623943acd4aeb4a22b7e2100e09d21ba88e8c754ac85fd5f/globalmount") options:("defaults") 
errcode:(executable file not found in $PATH) output:() 
E1113 13:04:12.270433       1 utils.go:82] GRPC error: rpc error: code = Internal desc = could not format /dev/disk/azure
/scsi1/lun0(lun: 0), and mount it at /var/lib/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com/da3dad2f6eb8873d623943acd4aeb4a22b7e2100e09d21ba88e8c754ac85fd5f/globalmount, failed with format of disk 
"/dev/disk/azure/scsi1/lun0" failed: type:("ext4") target:("/var/lib/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com
/da3dad2f6eb8873d623943acd4aeb4a22b7e2100e09d21ba88e8c754ac85fd5f/globalmount") options:("defaults") 
errcode:(executable file not found in $PATH) output:() 
I1113 13:06:14.301671       1 utils.go:77] GRPC call: /csi.v1.Node/NodeStageVolume
I1113 13:06:14.301690       1 utils.go:78] GRPC request: {"publish_context":{"LUN":"0"},"staging_target_path":"/var
/lib/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com/da3dad2f6eb8873d623943acd4aeb4a22b7e2100e09d21ba88e8c754ac85fd5f/globalmount","volume_capability":
{"AccessType":{"Mount":{}},"access_mode":{"mode":7}},"volume_context":{"csi.storage.k8s.io/pv/name":"pvc-
dd6e690b-d41e-40e5-a338-39b57cfa7cd5","csi.storage.k8s.io/pvc/name":"statefulset-storage-nikolaitesterting-
0","csi.storage.k8s.io/pvc/namespace":"nikolaitesterting","kind":"Managed","requestedsizegib":"10","skuname":"Premium_LRS","storage.kuber
netes.io/csiProvisionerIdentity":"1698199537920-3303-disk.csi.azure.com"},"volume_id":"//providers
/Microsoft.Compute/disks/pvc-dd6e690b-d41e-40e5-a338-39b57cfa7cd5"}
I1113 13:06:15.378608       1 azure_common_linux.go:185] azureDisk - found /dev/disk/azure/scsi1/lun0 by sdc under 
/dev/disk/azure/scsi1/
I1113 13:06:15.378640       1 nodeserver.go:116] NodeStageVolume: perf optimization is disabled for /dev/disk/azure/scsi1
/lun0. perfProfile none accountType Premium_LRS
I1113 13:06:15.379038       1 nodeserver.go:157] NodeStageVolume: formatting /dev/disk/azure/scsi1/lun0 and mounting at
 /var/lib/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com/da3dad2f6eb8873d623943acd4aeb4a22b7e2100e09d21ba88e8c754ac85fd5f/globalmount with mount options([])
I1113 13:06:15.379058       1 mount_linux.go:567] Attempting to determine if disk "/dev/disk/azure/scsi1/lun0" is formatted
 using blkid with args: ([-p -s TYPE -s PTTYPE -o export /dev/disk/azure/scsi1/lun0])
I1113 13:06:15.384951       1 mount_linux.go:570] Output: ""
I1113 13:06:15.384968       1 mount_linux.go:529] Disk "/dev/disk/azure/scsi1/lun0" appears to be unformatted, attempting
 to format as type: "ext4" with options: [-F -m0 /dev/disk/azure/scsi1/lun0]
E1113 13:06:15.385822       1 mount_linux.go:535] format of disk "/dev/disk/azure/scsi1/lun0" failed: type:("ext4") target:
("/var/lib/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com
/da3dad2f6eb8873d623943acd4aeb4a22b7e2100e09d21ba88e8c754ac85fd5f/globalmount") options:("defaults") 
errcode:(executable file not found in $PATH) output:() 
E1113 13:06:15.385845       1 utils.go:82] GRPC error: rpc error: code = Internal desc = could not format /dev/disk/azure
/scsi1/lun0(lun: 0), and mount it at /var/lib/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com/da3dad2f6eb8873d623943acd4aeb4a22b7e2100e09d21ba88e8c754ac85fd5f/globalmount, failed with format of disk
 "/dev/disk/azure/scsi1/lun0" failed: type:("ext4") target:("/var/lib/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com
/da3dad2f6eb8873d623943acd4aeb4a22b7e2100e09d21ba88e8c754ac85fd5f/globalmount") options:("defaults") 
errcode:(executable file not found in $PATH) output:() 
I1113 13:08:17.490909       1 utils.go:77] GRPC call: /csi.v1.Node/NodeStageVolume
I1113 13:08:17.490930       1 utils.go:78] GRPC request: {"publish_context":{"LUN":"0"},"staging_target_path":"/var
/lib/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com
/da3dad2f6eb8873d623943acd4aeb4a22b7e2100e09d21ba88e8c754ac85fd5f/globalmount","volume_capability":
{"AccessType":{"Mount":{}},"access_mode":{"mode":7}},"volume_context":{"csi.storage.k8s.io/pv/name":"pvc-
dd6e690b-d41e-40e5-a338-39b57cfa7cd5","csi.storage.k8s.io/pvc/name":"statefulset-storage-nikolaitesterting-
0","csi.storage.k8s.io/pvc/namespace":"nikolaitesterting","kind":"Managed","requestedsizegib":"10","skuname":"Premium_LRS","storage.kuber
netes.io/csiProvisionerIdentity":"1698199537920-3303-disk.csi.azure.com"},"volume_id":"//providers
/Microsoft.Compute/disks/pvc-dd6e690b-d41e-40e5-a338-39b57cfa7cd5"}
I1113 13:08:18.594676       1 azure_common_linux.go:185] azureDisk - found /dev/disk/azure/scsi1/lun0 by sdc under 
/dev/disk/azure/scsi1/
I1113 13:08:18.594715       1 nodeserver.go:116] NodeStageVolume: perf optimization is disabled for /dev/disk/azure/scsi1
/lun0. perfProfile none accountType Premium_LRS
I1113 13:08:18.595149       1 nodeserver.go:157] NodeStageVolume: formatting /dev/disk/azure/scsi1/lun0 and mounting at
 /var/lib/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com
/da3dad2f6eb8873d623943acd4aeb4a22b7e2100e09d21ba88e8c754ac85fd5f/globalmount with mount options([])
I1113 13:08:18.595175       1 mount_linux.go:567] Attempting to determine if disk "/dev/disk/azure/scsi1/lun0" is formatted
 using blkid with args: ([-p -s TYPE -s PTTYPE -o export /dev/disk/azure/scsi1/lun0])
I1113 13:08:18.601230       1 mount_linux.go:570] Output: ""
I1113 13:08:18.601252       1 mount_linux.go:529] Disk "/dev/disk/azure/scsi1/lun0" appears to be unformatted, attempting 
to format as type: "ext4" with options: [-F -m0 /dev/disk/azure/scsi1/lun0]
E1113 13:08:18.602255       1 mount_linux.go:535] format of disk "/dev/disk/azure/scsi1/lun0" failed: type:("ext4") target:
("/var/lib/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com
/da3dad2f6eb8873d623943acd4aeb4a22b7e2100e09d21ba88e8c754ac85fd5f/globalmount") options:("defaults") 
errcode:(executable file not found in $PATH) output:() 
E1113 13:08:18.602284       1 utils.go:82] GRPC error: rpc error: code = Internal desc = could not format /dev/disk/azure
/scsi1/lun0(lun: 0), and mount it at /var/lib/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com
/da3dad2f6eb8873d623943acd4aeb4a22b7e2100e09d21ba88e8c754ac85fd5f/globalmount, failed with format of disk 
"/dev/disk/azure/scsi1/lun0" failed: type:("ext4") target:("/var/lib/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com
/da3dad2f6eb8873d623943acd4aeb4a22b7e2100e09d21ba88e8c754ac85fd5f/globalmount") options:("defaults") 
errcode:(executable file not found in $PATH) output:()

node os is linux.

andyzhangx commented 1 year ago

what's the details of linux os? Ubuntu, coreos, etc? have you installed mkfs command on the node, the error said format disk failed with executable file not found in $PATH

admincasper commented 1 year ago

@andyzhangx

node-image-version=AKSUbuntu-2204gen2containerd-202310.19.2 arch=amd64.

I haven't installed anything, it is out-of-the-box AKS.

mkfs is installed on the nodes. When I run mkfs inside node I get

[stderr]
mkfs: no device specified
Try 'mkfs --help' for more information.

andyzhangx commented 1 year ago

@andyzhangx

node-image-version=AKSUbuntu-2204gen2containerd-202310.19.2 arch=amd64.

I haven't installed anything, it is out-of-the-box AKS.

mkfs is installed on the nodes. When I run mkfs inside node I get
[stderr]
mkfs: no device specified
Try 'mkfs --help' for more information.

@admincasper what's your aks version, region? what's the output of mkfs.ext4 on the node?

andyzhangx commented 1 year ago

and what's the output of kubectl get no -o wide ?

admincasper commented 1 year ago

@andyzhangx node-image-version=AKSUbuntu-2204gen2containerd-202310.19.2 arch=amd64. I haven't installed anything, it is out-of-the-box AKS. mkfs is installed on the nodes. When I run mkfs inside node I get
[stderr]
mkfs: no device specified
Try 'mkfs --help' for more information.
@admincasper what's your aks version, region? what's the output of mkfs.ext4 on the node?

@andyzhangx AKS 1.27.3, Westeurope. It works in Test but not in Dev environment. Same AKS versions..

mkfs.ext4: [stderr]\nUsage: mkfs.ext4 [-c|-l filename] [-b block-size] [-C cluster-size]\n\t[-i bytes-per-inode] [-I inode-size] [-J journal-options]\n\t[-G flex-group-size] [-N number-of-inodes] [-d root-directory]\n\t[-m reserved-blocks-percentage] [-o creator-os]\n\t[-g blocks-per-group] [-L volume-label] [-M last-mounted-directory]\n\t[-O feature[,...]] [-r fs-revision] [-E extended-option[,...]]\n\t[-t fs-type] [-T usage-type ] [-U UUID] [-e errors_behavior][-z undo_file]\n\t[-jnqvDFSV] device [blocks-count]\n

kubectl get nodes:

andyzhangx commented 1 year ago

@admincasper that's the same aks version and kernel version 5.15.0-1051-azure in same region? test cluster works but dev env cluster is broken?

admincasper commented 1 year ago

Yes.. We tested this yesterday, but tried to upgrade Dev to latest without it fixing the error.

admincasper commented 1 year ago

@andyzhangx This only applies to statefulsets, deployment works for some reason.. We tried a minimal statefulset but it still wouldn't work.

I will double check if kernel version does anything.

admincasper commented 1 year ago

@andyzhangx The kernel version makes no difference. It works in another cluster in same region with same kernel version.

admincasper commented 1 year ago

@andyzhangx @lovey89 Any clues to how I can troubleshoot this issue further?

admincasper commented 1 year ago

Fixed it, we had installed Twistlock Prisma Cloud client (Daemonset) installed in another namespace which F'd everything up.

admincasper commented 1 year ago

The issue was caused by Kyverno, restricting Twistlock namespace, which resulted in twistlock breaking statefulsets and PVC binding.

kubernetes-sigs / azuredisk-csi-driver

Problems attaching PV when disks are added to multipath group #1717