Open MrMarble opened 10 months ago
hmm RWX is indeed supported. What's your use case for it? The filesystem we allow on top of our volumes are not distributed, which is why we have this as not allowed. If you have some application that can coordinate accessed to the raw block device by itself, then maybe this is something we can consider looking into.
hmm RWX is indeed supported. What's your use case for it? The filesystem we allow on top of our volumes are not distributed, which is why we have this as not allowed. If you have some application that can coordinate accessed to the raw block device by itself, then maybe this is something we can consider looking into.
I have a small k8s cluster at home and I use mayastor as my primary storage solution, most of the time RWO works fine but sometimes I get errors when the descheduler moves an app to another node or I need different apps using the same PVC
If you have apps using the same pvc then ext4 or xfs will not suit your use case, you need a distributed filesystem. This is not something we're trying to solve in Mayastor, you might want to consider putting something like nfs I guess.
Sharing the same PVC is a rare thing to do, I'm ok with having to use nodeAffinity but what about apps getting scheduled on different nodes?
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 11s default-scheduler Successfully assigned default/jellyseerr to caos
Warning FailedAttachVolume 12s attachdetach-controller Multi-Attach error for volume "pvc-5c622d91" Volume is already exclusively attached to one node and can't be attached to another
I can manually re-attach a PVC by deleting the volumeattachments.storage.k8s.io
$ kubectl get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE
jellyseerr Bound pvc-5c622d91-8dcf-4a47-b84a-859a606ced94 5Gi RWO mayastor-thin <unset> 3d
$ kubectl get volumeattachments.storage.k8s.io
NAME ATTACHER PV NODE ATTACHED AGE
csi-80304c558ff6466b5f78f1de5633c760f350743f03b6de5cb6b7be24d2dcdfe8 io.openebs.csi-mayastor pvc-5c622d91-8dcf-4a47-b84a-859a606ced94 gea true 24h
$ kubectl patch volumeattachments.storage.k8s.io csi-80304c558ff6466b5f78f1de5633c760f350743f03b6de5cb6b7be24d2dcdfe8 -p '{"metadata":{"finalizers":[]}}' --type=merge
volumeattachment.storage.k8s.io/csi-80304c558ff6466b5f78f1de5633c760f350743f03b6de5cb6b7be24d2dcdfe8 patched
$ kubectl delete volumeattachments.storage.k8s.io csi-80304c558ff6466b5f78f1de5633c760f350743f03b6de5cb6b7be24d2dcdfe8
volumeattachment.storage.k8s.io "csi-80304c558ff6466b5f78f1de5633c760f350743f03b6de5cb6b7be24d2dcdfe8" deleted
$ kubectl rollout restart deployment jellyseerr
With this the PVC can be attached to a new node solving the error mentioned before Volume is already exclusively attached to one node and can't be attached to another
, Should this be automatic? Like a failover mode, the error went for 23h until I manually removed the volumeattatchment.
This may not be related to Mayastor, but I don't know where else to ask
I don't know the exact context of what's being done, but patching the attachments to allow multiples nodes access to the same filesystem volumes can be a recipe for data corruption because the filesystem drivers on both nodes don't know about each other.
I have a use case for this where I would like to use Mayastor for high performance VM storage with KubeVirt, but also preserve the ability to live migrate VMs which requires RWX.
In this case, accesses to the block device is orchestrated by Kubevirt and KVM, where the RWX is simply needed to be able to attach the PV to both VM pods simultaneously during the live migration, it is not being accessed concurrently, but both source and destination VM pod requires access to the PV simultaneously for the "handover" of the block device, before the source pod terminates.
Please see https://kubevirt.io/user-guide/operations/live_migration/#enabling-the-live-migration-support
Does the kubevirt use reservations on the "handover"? Though there are other question such as how do we ensure the data on the source is flushed to disk? Perhaps this is handled by the memory copy?
I'm not a go programmer, but looking at kubevirt's documentation, architecture, and code, it seems to me that kubevirt doesn't get involved in the actual migration itself and offloads this entirely to libvirt/qemu layer. So what kubevirt does through the kubernetes layer is attach the storage to both hosts, set up the network proxy between the hypervisor hosts, then tell libvirt on the source host to migrate the source VM to the destination host, then poll libvirt for status on the migration and handle cleanup and orchestration in the kubernetes layer based on libvirt's responses from the involved hosts.
So for the precise handling of the block devices on either host, we would have to look at libvirt's source, there is an explanation at https://www.linux-kvm.org/page/Migration , but I have no idea how up to date this is, though it states the following: Algorithm (the short version)
Start guest on destination, connect, enable dirty page logging and more
Guest continues to run Bandwidth limitation (controlled by the user) First transfer the whole memory Iteratively transfer all dirty pages (pages that were written to by the guest).
And sync VM image(s) (guest's hard drives).
As fast as possible (no bandwidth limitation) All VM devices' state and dirty pages yet to be transferred
On destination upon success Broadcast "I'm over here" Ethernet packet to announce new location of NIC(s). On source upon failure (with one exception).
Based on this, i can only assume that step 4 would flush to disk on source after it has stopped source VM, then inform libvirt on the destination that block device is ready for takeover?
edit, i found the actual migration handler here: https://gitlab.com/libvirt/libvirt/-/blob/master/src/qemu/qemu_migration.c?ref_type=heads
This seems to be handled by methods: qemuMigrationSrcNBDStorageCopyReady qemuMigrationAnyCompleted qemuMigrationSrcBeginPhaseBlockDirtyBitmaps
I think qemuMigrationSrcBeginPhaseBlockDirtyBitmaps is of most interest for this.
ref: https://www.linux-kvm.org/page/Migration https://[raw.githubusercontent.com/kubevirt/kubevirt/a6f4f91428c2acdc795ba7e9e8b20fa0d021244b/docs/kubevirt-create-vmi-seq-diagram.svg](https://raw.githubusercontent.com/kubevirt/kubevirt/a6f4f91428c2acdc795ba7e9e8b20fa0d021244b/docs/kubevirt-create-vmi-seq-diagram.svg) https://github.com/kubevirt/kubevirt/blob/main/pkg/virt-handler/vm.go#L2020
I second this. Mayastor is really compelling for running as a datastore for Kubevirt. RWX would be a really great addition.
Another reasonable reason: I want to serve LLMs, and not having the ability to use the same storage means i would have to have the exact same model in each PV for each replica. This is a huge waste of storage space, when you could easily just have a single model you load into the replica as needed. I really like that mayastor is future focused and not a mess (at least according to talos and openEBS). So, having the ability to share storage between pods would be ideal.
I get that there are other methods to do this, but they require significantly more complex setups, when just allowing the PV to connect to multiple pods would simplify everything, and reduce the complexity of the entire deployment.
I also get that there are security implications, but this can be mitigated using other methods.
For my personal use case, the ideal version of this would be readManyWriteOnce, but it doesn't seem like anyone can do that.
Also, because you don't support volumeExpansion, There's no straightforward method to transfer files to a new PV...
What you're asking here is for a clustered filesystem. We could in theory make raw block volumes available on N nodes but you'd need to use a clustered filesystem. Xfs or ext4 are not designed to be mounted on multiple nodes simultaneously.
@Mechputer Would the openebs nfs provisioner not suit your needs? https://github.com/openebs/dynamic-nfs-provisioner
@tsteine Unfortunately, no. I'm running talos, which supports mayastor. I can't find anywhere if an NFS server can be put on talos, or how. I don't know if they just think that's excessive, or unsafe. There are several factors that make this not work:
I'm looking for a solution that uses newer technology (mayastor/NVMEoF), works on a read-only filesystem (and has instructions somewhere), and doesn't fail to reconnect if something goes down. I need redundancy and stability, not assumed functionality. Unless there's something I'm not aware of, that would allow the new NFS container to have the same ID as the previous one (which I'm pretty sure goes against how k8s works, in the first place).
In fact, this should just be something that is obviously wrong with the mayastor system as a whole. Anything you're using, that uses a mayastor PVC, you're assuming that the container will never crash, or go down, or anything. Unless the assumption is that you would always constantly back up the mayastor drive, and recreate it if there's a current image stored somewhere outside the container, that is the only container that has access to the PVC (new or old, since only one container can have access to a mayastor drive ever), and tell it to restore the drive from the backup.
NFS service or node goes down, new NFS container can't access mayastor PVC, because it's readWriteOnce. All data is lost.
Why is the data lost? Why can't the new nfs container access the PVC? You might need to manually delete the volumeattachment in this case.
In fact, this should just be something that is obviously wrong with the mayastor system as a whole. Anything you're using, that uses a mayastor PVC, you're assuming that the container will never crash, or go down, or anything
I don't understand what it is you're implying here. Mayastor survives both container crash and node crash. As mentioned above you might need to delete the volumeattachment manually, and this is not something which is specific to mayastor for that matter, though perhaps could be better automated in certain cases.
@tsteine I'm meeting up with some folks to discuss RWM for live migration at the kubecon :)
I'm realizing I might be an idiot. If the node that goes down has the PVC with the data, of course it's not accessible to the rest of the cluster. If it's replicated to multiple drives on multiple machines, that should keep the data in the PVC. Is that correct?
If the node that goes down has the PVC with the data, of course it's not accessible to the rest of the cluster. If it's replicated to multiple drives on multiple machines, that should keep the data in the PVC. Is that correct?
If you have mayastor volume with N replicas, then we can generally support loss of N-1 nodes because the data exists on N nodes. You might be losing availability if your pod is down, but not the data itself as it's redundant. What you might want to do is check why you cannot recreate the pod on another node, which is most likely due to the volumeattachment CR: https://kubernetes.io/docs/reference/kubernetes-api/config-and-storage-resources/volume-attachment-v1/.
Yeah, no, please ignore my idiocy. I only had 1pvc, no replicas. Both the container and the pvc were on the node that went down. So of course I lost everything when it was recreated, with nothing to copy from.
Still, RWM please.
To be fair, I'm only 3 months into learning anything k8s related.
@tsteine
So I can't create an NFS server on talos, because it's read only and/or it's not explicitly stated and/or there are no instructions. I can't add RWX to mayastor, because that's not allowed. I can't use NFS on top of mayastor, because it needs kernel permissions, and I don't have an NFS server. I can't use ganesha for openEBS's nfs provisioner, because for whatever reason, they don't include that. I can even create an nfs pvc, but then my pods can't access it, because it's expected to be accessed by root. Once again, it sounds like all of the issues stem from using a read-only filesystem OS like talos.
Am I just a complete idiot? Is there a way for this to work that I'm just not able to find? Do I have to switch to a mutable filesystem in order to use anything?
talos 1.6.5, k8s 1.29.
I'm able to create PVCs using mayastor, they're just not RWX. I can create "kernel" nfs PVCs, they just can't be accessed. All of my add-ons and service function without issue, except for the things trying to access the read-only filesystem of talos, which is expected.
@Mechputer I think this is getting off topic, with regards to the project, and that a mayastor issue on RWX support is not the appropriate forum for how to run NFS servers on Talos.
That being said, the way this would normally work would be that you would set up the OpenEBS nfs provisioner in the kubernetes cluster, with setup pointing it to the storage class for mayastor, and a new storage class with an appropriate name like "OpenEBS NFS RWX" for the nfs provisioner. What would happen then is that the NFS Provisioner would requisition a volume on the mayastor storage class, then attach it to a new pod running an NFS server which uses that volume and allowing RWX. It just provides an RWX compatible network sharing protocol in front of a mayastor RWO volume. It should be noted that this adds overhead and a single point of failure in the NFS pod.
I don't see why this kind of setup shouldn't work just fine on Talos, since the NFS server is run in a pod.
edit: see here for included packages in the kubelet for talos, nfs-common is included. https://github.com/siderolabs/kubelet/blob/main/Dockerfile
Edit 2: You might be running into this issue: https://kubernetes.io/blog/2021/11/09/non-root-containers-and-devices/ and you may need to have "device_ownership_from_security_context = true" set for containerd
It should be noted that this adds overhead and a single point of failure in the NFS pod.
Single point of failure is usually no go for production. And in that state it rather not really help solving RWX support on mayastor.
Even setting manually 2 replicas for nfs-pvc not fully solving that, but might help a bit. Although I think there is no option to set default number of replicas for new created nfs-pvc pods. And actually these pods are mounting mayastor RWO volumes so still can be only deployed on same node. So there is no option to tolerate node failures.
I have a use case for this where I would like to use Mayastor for high performance VM storage with KubeVirt, but also preserve the ability to live migrate VMs which requires RWX.
In this case, accesses to the block device is orchestrated by Kubevirt and KVM, where the RWX is simply needed to be able to attach the PV to both VM pods simultaneously during the live migration, it is not being accessed concurrently, but both source and destination VM pod requires access to the PV simultaneously for the "handover" of the block device, before the source pod terminates.
Please see https://kubevirt.io/user-guide/operations/live_migration/#enabling-the-live-migration-support
Hi @tsteine @synthe102 ... as @tiagolobocastro mentioned earlier... Our Team was set to meet with the KubeVirt / RedHat Storage engineering team (e.g. KubeVirt, KVM & QEMU folks) at the Paris 2024 KubeCon last week. This meeting did happen and was very good We discussed this issue and how to safely enable this RWX label/tag explicitly for KVM/KubeVirt/QEMU.
** We are going to do this engineering work as it's not too complex. So your solution is coming soon.
Please note that we are going to add some safety precautions around enabling RWX label and so that we will only allow it to be done for KVM/KubeVIrt/QEMU... as its a dangerous recipe for data corruption if we allow it to be generally & widely open to any apps. Its guaranteed that users will try all sorts on unsupported operations with this and corrupt their filesystems (& block devices) very easily.
Hi @tsteine @synthe102 ... as @tiagolobocastro mentioned earlier... Our Team was set to meet with the KubeVirt / RedHat Storage engineering team (e.g. KubeVirt, KVM & QEMU folks) at the Paris 2024 KubeCon last week. This meeting did happen and was very good We discussed this issue and how to safely enable this RWX label/tag explicitly for KVM/KubeVirt/QEMU.
** We are going to do this engineering work as it's not too complex. So your solution is coming soon.
Please note that we are going to add some safety precautions around enabling RWX label and so that we will only allow it to be done for KVM/KubeVIrt/QEMU... as its a dangerous recipe for data corruption if we allow it to be generally & widely open to any apps. Its guaranteed that users will try all sorts on unsupported operations with this and corrupt their filesystems (& block devices) very easily.
That's great to hear.
As for restricting it to Kubevirt/KVM/QEMU, that makes perfect sense.
Was the kubevirt csi driver mentioned during the meeting? Kubevirt has their own CSI driver for downstream K8s clusters running as VMs on kubevirt, which then provisions storage from the upstream k8s cluster, and hotplugs it to the k8s-vm running the pod which requires storage. https://github.com/kubevirt/csi-driver
I am making note of this, as I would like to be able to live-migrate my downstream k8s vm nodes, however, I suspect that if we set RWX on the pvc in the virtualized cluster, and on the pvc in the kubevirt infrastructure cluster, that it would be possible to hotplug the volume to multiple vms simultaneously outside of live-migration, since this is all done by kubevirt.
I think it may be necessary to look at a mechanism for restricting kubevirt-csi provisioned volumes to only RWO, if using mayastor with the kubevirt-csi in virtualized K8S clusters. Edit for clarity. Restrict PVCs in the vm k8s cluster to RWO, as mayastor pvc would still need RWX for live-migration.
Either that, or document specifically that the kubevirt-csi driver is not supported for mayastor, and if you do and corrupt your data, that is "your bad", I don't mind using a different CSI in a downstream cluster, being able to hook into upstream storage just makes it easier.
Firstly, the meeting with RedHat / KubeVirt was super technical and we went into a lot of deep low-level KVM/QEMU storage minutia as well as what's really happening when KVM is doing a Live Migration, who is write what data to what storage devices and what transport the data flows over and who is in control of what disk write operations and at what times, etc, etc. - It was pretty intense but it was a very good meeting. (The team what does this work is based in Germany). - Anyway, that the RWX for Live Migration.
Peppered in with that discussion we did diverge off (a little bit) to the Downstream KubeVirt CSI driver, but the conversation quickly came back to the QEMU/KVM/Live-Migration discussion. We didn't spend much time on the KV CSI driver. We (RedHat + KubeVirt + OpenEBS) have a large common customer who has deployed a very large amount of KubeVirt /KVM in their K8s infrastructure. That customer uses with the Downstream KubeVrit CSI driver and OpenEBS and Live-Migration. So we have a very good environment (Test and PROD) for developing evaluating and testing the code. You're welcome to test with us.
The issues you bring up with the downstream KV CSI driver are complex and very tricky, especially when you add Live-Migration component into that mix via the CSI of the storge platform (OpenEBS). It feels complex and a bit dangerous and I get the feeling we may be doing some pioneering work here that is probably going to be a bit messy until we figure it all out. I don't have the answers yet; but this is a high priority project for OpenEBS + RedHat that we are (OpenEBS) will start coding-up once we drop the next release of OpenEBS (v2.6) in a week or 2.
Not sure how effective is this, but maybe it can be an improvenemt for openebs nfs?
All Robin NFS server Pods are High Availability (HA) compliant. Robin monitors the health of the NFS server Pods and executes a failover if any NFS server Pod is offline
from:
https://docs.robin.io/storage/5.3.4/storage.html#nfs-server-pod
Longhorn also has some failure handling https://longhorn.io/docs/1.6.1/nodes-and-volumes/volumes/rwx-volumes/
Although from my testing if nfs terminates the pods using the volume are also terminated.. The best would be probably to failover to new nfs and recover connection on pods that use failed nfs.
You're welcome to test with us.
I probably should've replied earlier, but I'm certainly interested in helping testing this feature.
Found this thread trying to set up RWX on a 3-node replicated mayastor storage-class. I am seeing the same error mentioned above:
Warning ProvisioningFailed 15s (x5 over 30s) io.openebs.csi-mayastor_m720q2_e2f5b3d1-2b67-4768-ad42-4808e281e520 failed to provision volume with StorageClass "mayastor-3": rpc error: code = InvalidArgument desc = Invalid volume access mode: 5
Here are the storage-class details:
Name: mayastor-3
IsDefaultClass: No
Annotations: meta.helm.sh/release-name=mayastor,meta.helm.sh/release-namespace=mayastor
Provisioner: io.openebs.csi-mayastor
Parameters: ioTimeout=60,local=true,protocol=nvmf,repl=3
AllowVolumeExpansion: <unset>
MountOptions: <none>
ReclaimPolicy: Delete
VolumeBindingMode: WaitForFirstConsumer
Events: <none>
Here is my use-case - I'm trying to set up 2 replicas of an application that only works with a local filesystem. I am trying to create a deployment with 2 replicas sitting behind a loadbalancer and hoping that both pods will have RW access to the same pvc backed by the same sc backed by 3 mayastor disk pools. The intended topology should support nodenotready scenarios for either the application or mayastor in any combination. As long as there is one node available in both the control plane and the data plane, the overall application should seamlessly work.
Is this possible today?
@anmshkmr, that is not possible because the local filesystem is "local", it is mounted on the node where the application is running. For RWX we'd need to use a kind of distributed filesystem - which is why NFS tends to be a solution for RWX.
@tiagolobocastro appreciate the helpful response. A few questions if you could answer them:
@tiagolobocastro appreciate the helpful response. A few questions if you could answer them:
1. Would the storage class definition from above change to enable RWX? I am guessing not, because the storage class is just about defining the replication parameter.
The storage class would not change I'd say. If you use an NFS provisioner
2. What is the prescribed way to create NFS on top of mayastor replicated storage class?
We don't currently have docs for this. There might be some older docs for other engines you could use perhaps, maybe @avishnu can suggest some?
@tiagolobocastro appreciate the helpful response. A few questions if you could answer them:
1. Would the storage class definition from above change to enable RWX? I am guessing not, because the storage class is just about defining the replication parameter.
The storage class would not change I'd say. If you use an NFS provisioner
2. What is the prescribed way to create NFS on top of mayastor replicated storage class?
We don't currently have docs for this. There might be some older docs for other engines you could use perhaps, maybe @avishnu can suggest some?
You may refer NFS-on-CStor provisioning steps here and replicate the same setup replacing CStor with Mayastor.
@tiagolobocastro appreciate the helpful response. A few questions if you could answer them:
1. Would the storage class definition from above change to enable RWX? I am guessing not, because the storage class is just about defining the replication parameter.
The storage class would not change I'd say. If you use an NFS provisioner
2. What is the prescribed way to create NFS on top of mayastor replicated storage class?
We don't currently have docs for this. There might be some older docs for other engines you could use perhaps, maybe @avishnu can suggest some?
You may refer NFS-on-CStor provisioning steps here and replicate the same setup replacing CStor with Mayastor.
Thank you. I am going to try that and let you know.
Hi @tiagolobocastro Is there a plan to support exporting a volume through multiple nodes to achieve NVMe-oF multipathing and support RWX features?
@tiagolobocastro appreciate the helpful response. A few questions if you could answer them:
1. Would the storage class definition from above change to enable RWX? I am guessing not, because the storage class is just about defining the replication parameter.
The storage class would not change I'd say. If you use an NFS provisioner
2. What is the prescribed way to create NFS on top of mayastor replicated storage class?
We don't currently have docs for this. There might be some older docs for other engines you could use perhaps, maybe @avishnu can suggest some?
You may refer NFS-on-CStor provisioning steps here and replicate the same setup replacing CStor with Mayastor.
Thank you. I am going to try that and let you know.
@avishnu @tiagolobocastro Just a follow-up here - unfortunately, nfs-server-provisioner has been deprecated. When I install it using the helm chart mentioned on the page you linked, the nfs-provisioner pod cannot be rescheduled on a multi-node cluster. It's because the pvc that's created for the nfs server's internal use is bound to a single node.
This appears to be an alternative: https://github.com/openebs-archive/dynamic-nfs-provisioner, but that is part of openebs-archive as well.
I also found this: https://blog.mayadata.io/openebs/setting-up-persistent-volumes-in-rwx-mode-using-openebs, but that looks pretty involved.
Do you have the recommendation for a good alternative for nfs-server-provisioner that can work seamlessly on a multi-node cluster?
Hey @anmshkmr, we've now documented this: https://openebs.io/docs/Solutioning/read-write-many/nfspvc
Hi @tiagolobocastro Is there a plan to support exporting a volume through multiple nodes to achieve NVMe-oF multipathing and support RWX features?
For a block mode volume?
Hey @anmshkmr, we've now documented this: https://openebs.io/docs/Solutioning/read-write-many/nfspvc
Thanks, that's very helpful in general.
I was able to set up the whole stack end-to-end a few weeks ago though. I can tell you that the setup is still not practical to use because the cpu usage is very high from the io-engine. It makes the overall k8s cluster very unstable to run the other workloads. Any suggestion to keep the overall resource footprint low?
You can reduce the io-engine cpu usage to 1 core. In the future we may implement interrupt mode which will help reduce footprint when no load is present, but it's not supported at the moment. See: #1745
@tiagolobocastro is there any advantage using configuration from https://openebs.io/docs/Solutioning/read-write-many/nfspvc over the old way? https://github.com/openebs-archive/dynamic-nfs-provisioner/blob/develop/docs/intro.md#quickstart
in practice from user perspective I see the main difference in new approach is using only one pvc for nfs and using subdir for volumes vs using multiple volumes in old approach. For me that is rather disadvantage if i'd like to for example snapshot individual volumes.
But maybe there are some other pros in new way?
Issue #1127 was closed as completed but trying to create a PVC with RWX access mode throws an error:
failed to provision volume with StorageClass "mayastor-thin": rpc error: code = InvalidArgument desc = Invalid volume access mode: 5
Is this not supported?
PVC manifest
StorageClass manifest