Closed mpb10 closed 2 months ago
/assign
By the way, these are the logs from the CSI driver when a new XFS volume fails to be mounted:
I0812 20:16:00.182375 1 driver.go:69] "Driver Information" Driver="ebs.csi.aws.com" Version="v1.33.0"
I0812 20:16:00.182570 1 driver.go:138] "Listening for connections" address="/csi/csi.sock"
I0812 20:17:21.329399 1 mount_linux.go:634] Attempting to determine if disk "/dev/nvme1n1" is formatted using blkid with args: ([-p -s TYPE -s PTTYPE -o export /dev/nvme1n1])
I0812 20:17:21.345634 1 mount_linux.go:637] Output: ""
I0812 20:17:21.345656 1 mount_linux.go:572] Disk "/dev/nvme1n1" appears to be unformatted, attempting to format as type: "xfs" with options: [-f /dev/nvme1n1]
I0812 20:17:21.690774 1 mount_linux.go:583] Disk successfully formatted (mkfs): xfs - /dev/nvme1n1 /var/lib/kubelet/plugins/kubernetes.io/csi/ebs.csi.aws.com/044d53b8de23a158c12416d02e5f15f2e4e960decdf224387c8ab41902205426/globalmount
I0812 20:17:21.690808 1 mount_linux.go:601] Attempting to mount disk /dev/nvme1n1 in xfs format at /var/lib/kubelet/plugins/kubernetes.io/csi/ebs.csi.aws.com/044d53b8de23a158c12416d02e5f15f2e4e960decdf224387c8ab41902205426/globalmount
I0812 20:17:21.690917 1 mount_linux.go:249] Detected OS without systemd
I0812 20:17:21.690927 1 mount_linux.go:224] Mounting cmd (mount) with arguments (-t xfs -o nouuid,defaults /dev/nvme1n1 /var/lib/kubelet/plugins/kubernetes.io/csi/ebs.csi.aws.com/044d53b8de23a158c12416d02e5f15f2e4e960decdf224387c8ab41902205426/globalmount)
E0812 20:17:21.698938 1 mount_linux.go:236] Mount failed: exit status 32
Mounting command: mount
Mounting arguments: -t xfs -o nouuid,defaults /dev/nvme1n1 /var/lib/kubelet/plugins/kubernetes.io/csi/ebs.csi.aws.com/044d53b8de23a158c12416d02e5f15f2e4e960decdf224387c8ab41902205426/globalmount
Output: mount: /var/lib/kubelet/plugins/kubernetes.io/csi/ebs.csi.aws.com/044d53b8de23a158c12416d02e5f15f2e4e960decdf224387c8ab41902205426/globalmount: wrong fs type, bad option, bad superblock on /dev/nvme1n1, missing codepage or helper program, or other error.
E0812 20:17:21.699007 1 driver.go:108] "GRPC error" err=<
rpc error: code = Internal desc = could not format "/dev/nvme1n1" and mount it at "/var/lib/kubelet/plugins/kubernetes.io/csi/ebs.csi.aws.com/044d53b8de23a158c12416d02e5f15f2e4e960decdf224387c8ab41902205426/globalmount": mount failed: exit status 32
Mounting command: mount
Mounting arguments: -t xfs -o nouuid,defaults /dev/nvme1n1 /var/lib/kubelet/plugins/kubernetes.io/csi/ebs.csi.aws.com/044d53b8de23a158c12416d02e5f15f2e4e960decdf224387c8ab41902205426/globalmount
Output: mount: /var/lib/kubelet/plugins/kubernetes.io/csi/ebs.csi.aws.com/044d53b8de23a158c12416d02e5f15f2e4e960decdf224387c8ab41902205426/globalmount: wrong fs type, bad option, bad superblock on /dev/nvme1n1, missing codepage or helper program, or other error.
>
Manually re-formatting the XFS volume on the worker node using mkfs.xfs -f /dev/nvme1n1
will allow the CSI driver to automatically mount the volume after a second or two.
Thanks for the very detailed bug report @mpb10.
This issue is caused by a compatibility mismatch between the version of xfsprogs used by the driver and the kernel version on the worker nodes - the driver utilizes xfsprogs v5.18 which formats XFS volumes with features requiring kernel v5.18 or higher. However, as noted above, the custom Ubuntu 20.04 worker nodes are running an older kernel version (v5.4), which does not support newer XFS features.
Relevant dmesg
output:
[ 7383.213514] XFS (nvme1n1): Superblock has unknown read-only compatible features (0x8) enabled.
[ 7383.214947] XFS (nvme1n1): Attempted to mount read-only compatible filesystem read-write.
[ 7383.214948] XFS (nvme1n1): Filesystem can only be safely mounted read only.
[ 7383.214959] XFS (nvme1n1): SB validate failed with error -22.
Manually re-formatting the XFS volume on the worker node using mkfs.xfs -f /dev/nvme1n1 will allow the CSI driver to automatically mount the volume after a second or two.
The fact that manually reformatting the volume using the host's older xfsprogs version (v5.3.0) resolves the issue further confirms that the problem lies in the kernel's inability to mount volumes formatted by the newer xfsprogs version used by the driver.
Ideally, the best solution here would be upgrading the kernel or using an AMI that includes a more recent kernel version : )
I understand this may be challenging or not feasible. In that case, as far as workarounds go, formatting the volumes with the older xfsprogs version available on the host before they are mounted by the driver (as you are currently doing) or using statically provisioned volumes that are pre-formatted would be viable options.
I'll discuss this pain point with the team during our next sync-up and follow up here with the long term view for this class of issue.
Thank you for the fast response to this!
To add some context, the Ubuntu 20.04 kernel version that we're using is their FIPS-enabled kernel, which only goes up to version 5.4
at the moment. We have a requirement to use FIPS-enabled kernels, so unfortunately we're unable to upgrade to kernel version 5.18
to solve this issue. Also, Canonical doesn't have any exact dates as to when Ubuntu 22.04 will have a FIPS-enabled kernel officially available, so we are likely going to have to continue to use version 5.4
for a while.
Also, having to manually format our volumes during the provisioning process is rather inefficient and breaks up our automation workflows, so this solution isn't ideal either.
I think having the ability to build our own version of the CSI driver image with an older version of xfsprogs
would be great and solve our issue, although I understand this isn't recommended by you guys. Any other way to "opt-out" of using the updated version of xfsprogs
would solve our issue too.
Thanks!
Thanks for that feedback @mpb10, its very helpful.
The team is looking to implement a new optional parameter on the node plugin to let users disable some of the newer XFS formatting features. This should solve the compatibility issues you're seeing with the older kernel.
To be clear, this will be an opt-in feature. We're doing it this way to preserve existing behavior, and more importantly, disabling the newer XFS features may result in other compatibility issues down the road.
Relevant WIP PR: https://github.com/kubernetes-sigs/aws-ebs-csi-driver/pull/2121 - feel free to leave further feedback/questions either here or directly on the PR.
This is great! this solution will work perfectly for us and we eagerly await it.
Thank you very much @torredil and team!
Hi @torredil Is there an ETA to release the opt-in feature - proposed for this issue ? Thank you, Chethan Das
@chethan-das We're actively working on this feature and we hope to release it in the near future but won't have a firm ETA until it's fully ready and tested. I (or somebody else from the team) will update this issue when we have a firm ETA or other information available.
/close
This should be fixed in aws-ebs-csi-driver v1.35.0, and has been tested by a user with nodes with linux kernel versions ≤ 5.4. Thank you for raising this issue!
Please set the node.legacyXFS
helm chart parameter to true to format XFS volumes with bigtime=0,inobtcount=0,reflink=0
, so that they can be mounted onto nodes with linux kernel ≤ 5.4. Warning: volumes formatted with this option may experience issues after 2038, and will be unable to use some XFS features (for example, reflinks).
See our driver options documentation or PR #2121 for more details.
@AndrewSirenko: Closing this issue.
Just wanted to drop in and say thank you to @AndrewSirenko and @ConnorJC3 as this really helped us!
/kind bug
We're running into an issue with the
aws-ebs-csi-driver
on an Ubuntu 20.04 worker node where we believe it is incorrectly formatting XFS volumes, and therefore they can't be mounted in the pod's containers.What happened? We get the following pod error trying to create and mount an XFS volume to a pod that is running on an Ubuntu 20.04 worker node: Events:
What you expected to happen? We expect the XFS volume to be mounted automatically by the CSI driver without any errors.
How to reproduce it (as minimally and precisely as possible)? We use the following manifest to reproduce the error:
Our worker nodes are Ubuntu 20.04 AWS EC2 instances running in FIPS mode. This issue does not occur on a different OS, such as SUSE chost images. We also tried disabling FIPS mode, however that didn't make a difference:
We've ensured that our
xfsprogs
package, which suppliesmkfs.xfs
is up-to-date:We're using the following Kubernetes version:
We've replicated this on the
aws-ebs-csi-driver
CSI driver versionsv1.29.0
and1.33.0
. Changing the CSI driver version doesn't seem to make a difference. We also don't see any mention of this issue in the changelog.Anything else we need to know?: This error does not occur if the volume type is EXT4 or if we change the worker node OS to something other than Ubuntu 20.04.
We believe that the CSI driver is formatting the XFS volumes incorrectly. This is because, if we SSH into the worker node and manually re-format the XFS volume that is failing to mount, it will then be able to be mounted by the
aws-ebs-csi-driver
without any issues. This leads us to believe that theaws-ebs-csi-driver
is incorrectly formatting the XFS volume before attempting to mount it.Additionally, the
xfs_info
values for the XFS volume are slightly different when the CSI driver formats it and when it's manually re-formatted. Both formatting commands are using the default formatting parameters formkfs.xfs
.This is the
xfs_info
of the XFS volume after it is formatted by theaws-ebs-csi-driver
:And this is the
xfs_info
of the same XFS volume after it is re-formatted manually and can be mounted without issue. Notice that the only value that changed isblocks
under thelog
section:As a temporary workaround, we've created a daemonSet that watches the worker nodes for newly created XFS volumes, tests whether they mount successfully, and if they don't, re-formats them automatically. While this does work, it's risky and not production-worthy.
Environment
kubectl version
):Client Version: v1.30.3 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Server Version: v1.29.3
v1.29.0
andv1.33.0
Thanks!