Open stevefan1999-personal opened 1 year ago
@stevefan1999-personal , Thanks for raising the issue . Could you provide some more context on the issue. Are you not able to use Oracle block volume as PV for lvm?
@abhilashshetty04 Yes, and no. I can create a block volume on OCI, but with a minimum of 50GB per block volume. As I have 200GB free quota, this only means I can only have 4 block volumes. I clearly needed more than that. Since I would be running on a single node solely, having LVM access on top of a PVC attached from OCI is the most suitable, but I just didn't see the options here likeso.
I remember the OCI PVC can be freely migrated to different VMs using iSCSI. So basically we don't need NFS for that. This technique can be ported to other cloud platforms as a competitor to Rook/Ceph if we can support LVM on top of another block based PVC. I do understand the pros and cons of Ceph (as I should be using RBD as an alternative here)
@stevefan1999-personal , LV that gets created is tied to a particular LVMNode. Did i understand correctly. Are you trying to move PV to a different LVMNode as Oracle allows it? If yes, Did it work. Even though i believe pod utilizing lv will lose lv access right?
@abhilashshetty04 Yes. Because iSCSI let's you move the volume to other nodes in case one is down. This is done behind the scene with Oracle's block volume provisioner. I want to preserve this behavior so that I don't need intervention when one of the node suddenly down like due to overloading
@stevefan1999-personal , With this functionality. Suppose PV attached to lvmnode1. If lvmnode1 goes down somehow, iSCSCI volume backed by PV say for example gets mounted on lvmnode2. Wont you have to create PV with that volume manually? Or is mounted as read-only by other nodes all the time (was this your ask when you said ReadWriteOnce)?
With this also k8s should be aware which node in the cluster has acquired access to the volume? pod needs to be be scheduled to the correct node.
@stevefan1999-personal , With this functionality. Suppose PV attached to lvmnode1. If lvmnode1 goes down somehow, iSCSCI volume backed by PV say for example gets mounted on lvmnode2. Wont you have to create PV with that volume manually? Or is mounted as read-only by other nodes all the time (was this your ask when you said ReadWriteOnce)?
With this also k8s should be aware which node in the cluster has acquired access to the volume? pod needs to be be scheduled to the correct node.
Let's also call the LVM PV to be lvm-backstore
, and the LocalPV created on top of lvm-backstore
to be virtual-lvm
Consider that when lvmnode1 goes down, the pods are supposed to be migrated by Kubernetes scheduler too. Then the lock on lvm-backstore
would be release, and other nodes can take it.
Of course, this comes with the downside that all the pods would have to be migrated to all the pods of that specific node hosting the LVM PV.
So, if lvmnode1 goes down, and lvmnode2 acquired lvm-backstore
, all the pods that referenced virtual-lvm
would have to run on lvmnode2
from then on since virtual-lvm
referenced lvm-backstore
which is bound to lvmnode2
. There may also be some issue regarding metadata flushing, but that would be handled by user themselves.
All the volumes under this special setup should be ReadWriteOnce. It's like local-path-provisioner but migratable. lvm-backstore
must have a volumeMode
of Block
, since you wouldn't want to make a LVM on top of files
@stevefan1999-personal Thanks for explaining. This seems like shared vg required. Let me know if i am wrong. I still have some questions:
lvm-backstore
is going to be a PV object from lvm perspective that is hosted externally and mounted by all lvm node?shared-vg
created on top lvm-backstore
) by all nodes if my previous assumption is correct?FYI we had tried shared-vg
some time back. But due to some hurdles we had to shelf the PR..
@stevefan1999-personal Thanks for explaining. This seems like shared vg required. Let me know if i am wrong. I still have some questions:
- Is
lvm-backstore
is going to be a PV object from lvm perspective that is hosted externally and mounted by all lvm node?- You have not mentioned VG in the usecase. Is it going to be accessible (
shared-vg
created on toplvm-backstore
) by all nodes if my previous assumption is correct?FYI we had tried
shared-vg
some time back. But due to some hurdles we had to shelf the PR..
lvm-backstore
is going to be a PersistentVolume of any kind. It is most likely be provisioned by any storage controller that provisioned block volume type (so for example, local-static-provisioner), although it is not strictly required that PersistentVolume needs to be provisioned by a storage controller. You can, in fact, make a block volume that reference the local disk path yourself without any trouble. I did this two years ago.
We should start concerning about LVM Physical Volume first. Essentially, the end goal of this feature request is that we can treat a Kubernetes PV as a LVM PV. A Kubernetes PV is supposed to be distributed or be marked for certain nodes for access, so while it technically requires all valid nodes to have access at any time, only one node can have exclusive access at one time for that specific PV, due to the nature of LVM not allow concurrent access so we need some exclusivity lock here.
For Logical Volume and Volume Group this is actually out of scope, but I think it will have to be tackled eventually.
Hi @stevefan1999-personal , Apologies for the deplayed response. We have made some producr restructuring. Now if you notice lvm and zfs local-pv engines are budled with Mayastor platform. Although all have its own provisioner and components.
Coming back to your requirement. I still dont get your point about having LVM PV reference a local diskpath for your use case. How that can be accessible from some other node in cluster.
In case device backing PV is on remote storage device wrt all cluster members. Then LVM has a shared-vg feature which uses local manager like sanlock or dlm for co-ordinating access to LV on shared-vg. Does this make sense?
Hi @stevefan1999-personal , Apologies for the deplayed response. We have made some producr restructuring. Now if you notice lvm and zfs local-pv engines are budled with Mayastor platform. Although all have its own provisioner and components.
Coming back to your requirement. I still dont get your point about having LVM PV reference a local diskpath for your use case. How that can be accessible from some other node in cluster.
In case device backing PV is on remote storage device wrt all cluster members. Then LVM has a shared-vg feature which uses local manager like sanlock or dlm for co-ordinating access to LV on shared-vg. Does this make sense?
My use case is to have remote attachment of LVM node, that some K8S storage provisioners uses iSCSI as remote mounting source, that can be attached to other nodes at any time, which is currently how Oracle Cloud handles block storage. That means although the block is local and exclusive to one specific node at a time, it can still be remounted on other nodes at any given time for quick recovery, given the exclusive lock is unlocked or expired for any reasons.
This would be a very useful feature since we can bypass the network layer such as GlusterFS/NFS/Ceph because the underlying block storage is already virtualized though host provided network. LVM + iSCSI is a validated solution for storage virtualization and I think we can do this on K8S too.
That said, I think the idea can be more general and apply to persistent volume as a whole as well. Other cloud providers such as Azure/AWS and GKE would also benefit from this especially with regards to their block storage option. Otherwise, the best choice for me now is just use Rook/Ceph which does support this kind of PVC layering use case.
The solution you want still has a single point of failure right? What if node hosting the PV accessible remotely goes down? HCI storage engines should replicate volumes for high availability. Lvm localpv was designed in a way where it uses native LVM capabilities. Keeping storage object and consumer local was a driving force of the development. We have not planned inter node storage access as of yet.
If you want to have a storage solution where storage objects hosted on local device can be accessed by other cluster members, You can give mayastor a try. Mayastor is based on nvme, it replicates volume as replicas for redundancy. Supports Thin provisioning, Snapshop , Volume resize, Performance monitoring etc.
Please find more information about mayastor here. https://openebs.io/docs#replicated-volumes
Let me know if you have more questions.
@stevefan1999-personal This project is specifically for the localPV use case. Hence there is no support for the remote mounting of lvm LVs. The Mayastor offering under openEBS supports that over Nvme. The default backend there is not LVM but SPDK based. However, we have very recently introduced the support for LVM based backend as well that you may want to check out and provide feedback, though it doesn't support all the features there yet.
@stevefan1999-personal just for clarification, what you'd like localpv-lvm driver is to be capable of detecting that the underlying block device (LVM PV) has moved to another node, and make the localpv volumes accessible once again?
I think by accepting Block
volumes and adding a job to prepare that volume (creating PV
s/LV
s)
We will be able to do that, but we have to make provisioner to be able to be deployement to cover more needs
@stevefan1999-personal just for clarification, what you'd like localpv-lvm driver is to be capable of detecting that the underlying block device (LVM PV) has moved to another node, and make the localpv volumes accessible once again?
This is one possible scenario for addressing storage migration in a distributed system. For example, if your underlying storage is based on iSCSI (Internet Small Computer Systems Interface) or Ceph RBD (RADOS Block Device), you can migrate that to another node without significant downtime or data loss. I want a more general approach because I abstracted those distributed storage into the form of Persistent Volume, that is why LVM on top of another PV
Describe the problem/challenge you have
There is currently no way to deploy LVM on top of another persistent volume.
Describe the solution you'd like [A clear and concise description of what you want to happen.] Let us compose the LVM on top of other PV
Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]
I want to use Oracle Cloud's block storage driver to create a 200GB persistent volume that is a Block Storage resource in Oracle Cloud, which can be reattached to other nodes but only one can access at a time (in other word, ReadWriteOnce), so I can do node migration if things gone wrong.
I've considered using
local-pv
before as this is one of the supported features, but I need to have thin provision and quota support, which snapshot being a nice add-in.