Closed lovey89 closed 1 year ago
not sure why multipath
is configured when the driver has attached a disk, I think if you apt remove multipath-tools
, it should work, but still need to find out why multipath
is configured when there is data disk attach.
Glad to hear that you also find it strange :) Today I tried to disable multipath with systemctl stop multipathd
on each node to see if I get a more stable environment. I will let it run like that tomorrow as well to see if that helps (and will report the result here). Unfortunately I will have to do this every time my node pool scales up as I'm not aware of a way to change the VM image for the node running the Kubernetes (since we're using AKS). Otherwise I will also try to remove multipath-tools
as you suggested.
Please tell me if I can provide any more information. I'm by no means a multipath expert so I just tried to collect everything I thought could be relevant.
@lovey89 I am not a multipath
expert either, multipath
is not enabled on my testing aks 22.04 node.
@alexeldeib do you happen to know whether there is multipath
config change in AgentBaker?
So systemctl stop multipathd
didn't help as it looks like the service was started when a new disk was attached. But apt remove multipath-tools
works better. At least I haven't seen any problems for some time now. Thanks!
I have another cluster for testing purposes and I see that multipathd
is running in that cluster as well (same config as the one I mentioned above).
$ systemctl status multipathd
● multipathd.service - Device-Mapper Multipath Device Controller
Loaded: loaded (/lib/systemd/system/multipathd.service; enabled; vendor preset: enabled)
Active: active (running) since Tue 2023-01-24 11:14:43 UTC; 2 weeks 0 days ago
TriggeredBy: ● multipathd.socket
Main PID: 221 (multipathd)
Status: "up"
Tasks: 7
Memory: 21.9M
CPU: 1min 41.767s
CGroup: /system.slice/multipathd.service
└─221 /sbin/multipathd -d -s
I think this issue is already resolved in 5.15.0-1035-azure
if you create a new node pool, the default /etc/multipath.conf
won't add data disk to multipath group:
# uname -a
Linux aks-agentpool-27648854-vmss000002 5.15.0-1033-azure
# cat /etc/multipath.conf
defaults {
user_friendly_names yes
}
the original /etc/multipath.conf
(5.15.0-1033-azure
) config which adds data disk to multipath group, I think Canonical has made some change in upstream Ubuntu 22.04 image:
# uname -a
Linux aks-agentpool-38629629-vmss000000 5.15.0-1035-azure
# cat /etc/multipath.conf
defaults {
user_friendly_names yes
find_multipaths no
}
devices {
device {
fast_io_fail_tmo 5
vendor "Nimble"
failback immediate
dev_loss_tmo infinity
hardware_handler "1 alua"
product "Server"
prio alua
path_selector "service-time 0"
path_checker tur
path_grouping_policy group_by_prio
no_path_retry 30
}
device {
prio alua
path_selector "round-robin 0"
vendor "3PARdata"
checker tur
product "VV"
hardware_handler "1 alua"
no_path_retry 18
path_grouping_policy group_by_prio
failback immediate
getuid_callout "/lib/udev/scsi_id --whitelisted --device=/dev/%n"
rr_min_io 100
path_checker tur
features "0"
}
}
for the old node pool with broken multipath.conf
config, write a daemonset to apt remove multipath-tools
to workaround
@andyzhangx @lovey89 Hi, I am getting a similar error to this one, but running apt remove multipath-tools
does nothing for me.
MountVolume.MountDevice failed for volume "pvc-dd6e690b-d41e-40e5-a338-39b57cfa7cd5" : rpc error: code =
Internal desc = could not format /dev/disk/azure/scsi1/lun0(lun: 0), and mount it at /var/lib/kubelet/plugins/kubernetes.io
/csi/disk.csi.azure.com/da3dad2f6eb8873d623943acd4aeb4a22b7e2100e09d21ba88e8c754ac85fd5f/globalmount,
failed with format of disk "/dev/disk/azure/scsi1/lun0" failed: type:("ext4") target:("/var/lib/kubelet/plugins/kubernetes.io
/csi/disk.csi.azure.com/da3dad2f6eb8873d623943acd4aeb4a22b7e2100e09d21ba88e8c754ac85fd5f/globalmount")
options:("defaults") errcode:(executable file not found in $PATH) output:()
I get no logs from csi-azuredisk-node-xxxxx
:
Defaulted container "liveness-probe" out of: liveness-probe, node-driver-registrar, azuredisk
I1113 12:38:52.992442 1 main.go:149] calling CSI driver to discover driver name
I1113 12:38:52.998085 1 main.go:155] CSI driver name: "disk.csi.azure.com"
I1113 12:38:52.998105 1 main.go:183] ServeMux listening at "0.0.0.0:29603"
I've been troubleshooting for days now. PVC in statefulsets in other environments work. The other environments have the same CSI image versions, node types, node kernel version etc.. So I don't even know what I should test next.
could you run kubectl get csi-azuredisk-node-xxxxx -n kube-system -c azuredisk
to get driver logs? @admincasper
and what's the node os ?
Thanks for reply @andyzhangx ! Of course I should've checked correct container logs! Might repeat myself:
I1113 13:04:12.262639 1 azure_common_linux.go:185] azureDisk - found /dev/disk/azure/scsi1/lun0 by sdc under
/dev/disk/azure/scsi1/
I1113 13:04:12.262694 1 nodeserver.go:116] NodeStageVolume: perf optimization is disabled for /dev/disk/azure/scsi1
/lun0. perfProfile none accountType Premium_LRS
I1113 13:04:12.263077 1 nodeserver.go:157] NodeStageVolume: formatting /dev/disk/azure/scsi1/lun0 and mounting at
/var/lib/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com/da3dad2f6eb8873d623943acd4aeb4a22b7e2100e09d21ba88e8c754ac85fd5f/globalmount with mount options([])
I1113 13:04:12.263098 1 mount_linux.go:567] Attempting to determine if disk "/dev/disk/azure/scsi1/lun0" is formatted
using blkid with args: ([-p -s TYPE -s PTTYPE -o export /dev/disk/azure/scsi1/lun0])
I1113 13:04:12.269318 1 mount_linux.go:570] Output: ""
I1113 13:04:12.269340 1 mount_linux.go:529] Disk "/dev/disk/azure/scsi1/lun0" appears to be unformatted, attempting
to format as type: "ext4" with options: [-F -m0 /dev/disk/azure/scsi1/lun0]
E1113 13:04:12.270405 1 mount_linux.go:535] format of disk "/dev/disk/azure/scsi1/lun0" failed: type:("ext4") target:
("/var/lib/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com/da3dad2f6eb8873d623943acd4aeb4a22b7e2100e09d21ba88e8c754ac85fd5f/globalmount") options:("defaults")
errcode:(executable file not found in $PATH) output:()
E1113 13:04:12.270433 1 utils.go:82] GRPC error: rpc error: code = Internal desc = could not format /dev/disk/azure
/scsi1/lun0(lun: 0), and mount it at /var/lib/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com/da3dad2f6eb8873d623943acd4aeb4a22b7e2100e09d21ba88e8c754ac85fd5f/globalmount, failed with format of disk
"/dev/disk/azure/scsi1/lun0" failed: type:("ext4") target:("/var/lib/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com
/da3dad2f6eb8873d623943acd4aeb4a22b7e2100e09d21ba88e8c754ac85fd5f/globalmount") options:("defaults")
errcode:(executable file not found in $PATH) output:()
I1113 13:06:14.301671 1 utils.go:77] GRPC call: /csi.v1.Node/NodeStageVolume
I1113 13:06:14.301690 1 utils.go:78] GRPC request: {"publish_context":{"LUN":"0"},"staging_target_path":"/var
/lib/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com/da3dad2f6eb8873d623943acd4aeb4a22b7e2100e09d21ba88e8c754ac85fd5f/globalmount","volume_capability":
{"AccessType":{"Mount":{}},"access_mode":{"mode":7}},"volume_context":{"csi.storage.k8s.io/pv/name":"pvc-
dd6e690b-d41e-40e5-a338-39b57cfa7cd5","csi.storage.k8s.io/pvc/name":"statefulset-storage-nikolaitesterting-
0","csi.storage.k8s.io/pvc/namespace":"nikolaitesterting","kind":"Managed","requestedsizegib":"10","skuname":"Premium_LRS","storage.kuber
netes.io/csiProvisionerIdentity":"1698199537920-3303-disk.csi.azure.com"},"volume_id":"//providers
/Microsoft.Compute/disks/pvc-dd6e690b-d41e-40e5-a338-39b57cfa7cd5"}
I1113 13:06:15.378608 1 azure_common_linux.go:185] azureDisk - found /dev/disk/azure/scsi1/lun0 by sdc under
/dev/disk/azure/scsi1/
I1113 13:06:15.378640 1 nodeserver.go:116] NodeStageVolume: perf optimization is disabled for /dev/disk/azure/scsi1
/lun0. perfProfile none accountType Premium_LRS
I1113 13:06:15.379038 1 nodeserver.go:157] NodeStageVolume: formatting /dev/disk/azure/scsi1/lun0 and mounting at
/var/lib/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com/da3dad2f6eb8873d623943acd4aeb4a22b7e2100e09d21ba88e8c754ac85fd5f/globalmount with mount options([])
I1113 13:06:15.379058 1 mount_linux.go:567] Attempting to determine if disk "/dev/disk/azure/scsi1/lun0" is formatted
using blkid with args: ([-p -s TYPE -s PTTYPE -o export /dev/disk/azure/scsi1/lun0])
I1113 13:06:15.384951 1 mount_linux.go:570] Output: ""
I1113 13:06:15.384968 1 mount_linux.go:529] Disk "/dev/disk/azure/scsi1/lun0" appears to be unformatted, attempting
to format as type: "ext4" with options: [-F -m0 /dev/disk/azure/scsi1/lun0]
E1113 13:06:15.385822 1 mount_linux.go:535] format of disk "/dev/disk/azure/scsi1/lun0" failed: type:("ext4") target:
("/var/lib/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com
/da3dad2f6eb8873d623943acd4aeb4a22b7e2100e09d21ba88e8c754ac85fd5f/globalmount") options:("defaults")
errcode:(executable file not found in $PATH) output:()
E1113 13:06:15.385845 1 utils.go:82] GRPC error: rpc error: code = Internal desc = could not format /dev/disk/azure
/scsi1/lun0(lun: 0), and mount it at /var/lib/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com/da3dad2f6eb8873d623943acd4aeb4a22b7e2100e09d21ba88e8c754ac85fd5f/globalmount, failed with format of disk
"/dev/disk/azure/scsi1/lun0" failed: type:("ext4") target:("/var/lib/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com
/da3dad2f6eb8873d623943acd4aeb4a22b7e2100e09d21ba88e8c754ac85fd5f/globalmount") options:("defaults")
errcode:(executable file not found in $PATH) output:()
I1113 13:08:17.490909 1 utils.go:77] GRPC call: /csi.v1.Node/NodeStageVolume
I1113 13:08:17.490930 1 utils.go:78] GRPC request: {"publish_context":{"LUN":"0"},"staging_target_path":"/var
/lib/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com
/da3dad2f6eb8873d623943acd4aeb4a22b7e2100e09d21ba88e8c754ac85fd5f/globalmount","volume_capability":
{"AccessType":{"Mount":{}},"access_mode":{"mode":7}},"volume_context":{"csi.storage.k8s.io/pv/name":"pvc-
dd6e690b-d41e-40e5-a338-39b57cfa7cd5","csi.storage.k8s.io/pvc/name":"statefulset-storage-nikolaitesterting-
0","csi.storage.k8s.io/pvc/namespace":"nikolaitesterting","kind":"Managed","requestedsizegib":"10","skuname":"Premium_LRS","storage.kuber
netes.io/csiProvisionerIdentity":"1698199537920-3303-disk.csi.azure.com"},"volume_id":"//providers
/Microsoft.Compute/disks/pvc-dd6e690b-d41e-40e5-a338-39b57cfa7cd5"}
I1113 13:08:18.594676 1 azure_common_linux.go:185] azureDisk - found /dev/disk/azure/scsi1/lun0 by sdc under
/dev/disk/azure/scsi1/
I1113 13:08:18.594715 1 nodeserver.go:116] NodeStageVolume: perf optimization is disabled for /dev/disk/azure/scsi1
/lun0. perfProfile none accountType Premium_LRS
I1113 13:08:18.595149 1 nodeserver.go:157] NodeStageVolume: formatting /dev/disk/azure/scsi1/lun0 and mounting at
/var/lib/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com
/da3dad2f6eb8873d623943acd4aeb4a22b7e2100e09d21ba88e8c754ac85fd5f/globalmount with mount options([])
I1113 13:08:18.595175 1 mount_linux.go:567] Attempting to determine if disk "/dev/disk/azure/scsi1/lun0" is formatted
using blkid with args: ([-p -s TYPE -s PTTYPE -o export /dev/disk/azure/scsi1/lun0])
I1113 13:08:18.601230 1 mount_linux.go:570] Output: ""
I1113 13:08:18.601252 1 mount_linux.go:529] Disk "/dev/disk/azure/scsi1/lun0" appears to be unformatted, attempting
to format as type: "ext4" with options: [-F -m0 /dev/disk/azure/scsi1/lun0]
E1113 13:08:18.602255 1 mount_linux.go:535] format of disk "/dev/disk/azure/scsi1/lun0" failed: type:("ext4") target:
("/var/lib/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com
/da3dad2f6eb8873d623943acd4aeb4a22b7e2100e09d21ba88e8c754ac85fd5f/globalmount") options:("defaults")
errcode:(executable file not found in $PATH) output:()
E1113 13:08:18.602284 1 utils.go:82] GRPC error: rpc error: code = Internal desc = could not format /dev/disk/azure
/scsi1/lun0(lun: 0), and mount it at /var/lib/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com
/da3dad2f6eb8873d623943acd4aeb4a22b7e2100e09d21ba88e8c754ac85fd5f/globalmount, failed with format of disk
"/dev/disk/azure/scsi1/lun0" failed: type:("ext4") target:("/var/lib/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com
/da3dad2f6eb8873d623943acd4aeb4a22b7e2100e09d21ba88e8c754ac85fd5f/globalmount") options:("defaults")
errcode:(executable file not found in $PATH) output:()
node os is linux.
what's the details of linux
os? Ubuntu, coreos, etc?
have you installed mkfs
command on the node, the error said format disk failed with executable file not found in $PATH
@andyzhangx
node-image-version=AKSUbuntu-2204gen2containerd-202310.19.2 arch=amd64.
I haven't installed anything, it is out-of-the-box AKS.
mkfs
is installed on the nodes. When I run mkfs inside node I get
[stderr]
mkfs: no device specified
Try 'mkfs --help' for more information.
@andyzhangx
node-image-version=AKSUbuntu-2204gen2containerd-202310.19.2 arch=amd64.
I haven't installed anything, it is out-of-the-box AKS.
mkfs
is installed on the nodes. When I run mkfs inside node I get[stderr] mkfs: no device specified Try 'mkfs --help' for more information.
@admincasper what's your aks version, region? what's the output of mkfs.ext4
on the node?
and what's the output of kubectl get no -o wide
?
@andyzhangx node-image-version=AKSUbuntu-2204gen2containerd-202310.19.2 arch=amd64. I haven't installed anything, it is out-of-the-box AKS.
mkfs
is installed on the nodes. When I run mkfs inside node I get[stderr] mkfs: no device specified Try 'mkfs --help' for more information.
@admincasper what's your aks version, region? what's the output of
mkfs.ext4
on the node?
@andyzhangx AKS 1.27.3, Westeurope. It works in Test but not in Dev environment. Same AKS versions..
mkfs.ext4: [stderr]\nUsage: mkfs.ext4 [-c|-l filename] [-b block-size] [-C cluster-size]\n\t[-i bytes-per-inode] [-I inode-size] [-J journal-options]\n\t[-G flex-group-size] [-N number-of-inodes] [-d root-directory]\n\t[-m reserved-blocks-percentage] [-o creator-os]\n\t[-g blocks-per-group] [-L volume-label] [-M last-mounted-directory]\n\t[-O feature[,...]] [-r fs-revision] [-E extended-option[,...]]\n\t[-t fs-type] [-T usage-type ] [-U UUID] [-e errors_behavior][-z undo_file]\n\t[-jnqvDFSV] device [blocks-count]\n
kubectl get nodes:
@admincasper that's the same aks version and kernel version 5.15.0-1051-azure
in same region? test cluster works but dev env cluster is broken?
Yes.. We tested this yesterday, but tried to upgrade Dev to latest without it fixing the error.
@andyzhangx This only applies to statefulsets, deployment works for some reason.. We tried a minimal statefulset but it still wouldn't work.
I will double check if kernel version does anything.
@andyzhangx The kernel version makes no difference. It works in another cluster in same region with same kernel version.
@andyzhangx @lovey89 Any clues to how I can troubleshoot this issue further?
Fixed it, we had installed Twistlock Prisma Cloud client (Daemonset) installed in another namespace which F'd everything up.
The issue was caused by Kyverno, restricting Twistlock namespace, which resulted in twistlock breaking statefulsets and PVC binding.
What happened: Note: We have not configured anything about multipath ourselves. This behavior is seen on an AKS managed cluster.
We use Kubernetes for CI builds and we attach and detach disk to pods fairly often. Multiple times per day we get a problem that the PV refuses to mount on the worker node. When the problem happens we see the following event:
It complains that
/dev/sdg
is already mounted or mount point is busy. Connecting to the node where the pod is trying to start and runlsblk
we see the following output (I guess that thesdg
part it the interesting lines though):multipath -ll
returns:/lib/udev/scsi_id -g /dev/sdg
returns:cat /etc/multipath/bindings
returns:cat /etc/multipath/wwids
returns:cat /etc/multipath.conf
returns:A snippet from
journalctl
around the time it happened which looks related:A snippet from the
csi-azuredisk-node-xxxxx
pods log around the time it happened which looks related:What you expected to happen: We expected the PV to be mounted to the pod successfully.
How to reproduce it: I'm not totally sure but multipaths seems to be used when two disks have the same wwid. So maybe try to attach two PVs where the underlying Azure Disks have the same wwid? In the example above it looks like there was only one disk in the multipath group but we also see occurences where there are two disks in the group.
Anything else we need to know?: Running
multipath -f mpatha
(replacempatha
with whatever was returned bymultipath -ll
) resolves the problem. The pod will start and work as expected but when a new pod is started later the problem may come back.We think this problem started after we update our cluster from 1.24.x to 1.25.4
Environment:
mcr.microsoft.com/oss/kubernetes-csi/azuredisk-csi:v1.24.0.2
kubectl version
): AKS managed:Server Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.4", GitCommit:"b969368e201e1f09440892d03007c62e791091f8", GitTreeState:"clean", BuildDate:"2022-12-16T19:44:08Z", GoVersion:"go1.19.3", Compiler:"gc", Platform:"linux/amd64"}
uname -a
):Linux aks-icijobsz1-14256148-vmss0002ZG 5.15.0-1029-azure #36-Ubuntu SMP Mon Dec 5 19:31:08 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux