openebs / lvm-localpv

Dynamically provision Stateful Persistent Node-Local Volumes & Filesystems for Kubernetes that is integrated with a backend LVM2 data storage stack.
Apache License 2.0
257 stars 97 forks source link

lvm-localpv + FC + PureStorage #257

Open metbog opened 1 year ago

metbog commented 1 year ago

Hi there,

We have a Kubernetes cluster (k8s) with BareMetal (BM) workers. These BM workers are connected via Fibre Channel (FC) to PureStorage FA. Our goal is to create a shared volume for our BM workers and use it with lvm-localpv.

PureStorage -> (BM1, BM2) -> /dev/mapper/sharevolume (attached to each BM worker via FC) -> PV -> VG1

Here is StorageClass:

allowVolumeExpansion: false
allowedTopologies:
- matchLabelExpressions:
  - key: kubernetes.io/hostname
    values:
    - bm-worker1
    - bm-worker2
    - bm-worker3
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: pure-lvm
parameters:
  storage: lvm
  volgroup: test-pure-volume
provisioner: local.csi.openebs.io
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer

One idea is to make it possible to reattach LVM to any of the BM workers because currently, it creates a Persistent Volume bound to one worker (where it was originally created). This limitation prevents pods from starting on other workers.

Is it possible to achieve this? Perhaps there is already a solution available for this issue?

abhilashshetty04 commented 1 year ago

@metbog This seems like a shared vg feature requirement. We had tried shared vg previously but had to shelf the task due to technical roadblocks.

dsharma-dc commented 6 months ago

If I understand the requirement correctly, the same PVC is required to be used by applications on two(or more) different nodes, where the underlying PV is a shared vg managed by LVM, and assuming the lock managers required for shared vg are up and running on all worker nodes. I don't see this feasible with current CSI provisioner to provide this kind of rwx capability.

orville-wright commented 6 months ago

@metbog This seems like a shared vg feature requirement. We had tried shared vg previously but had to shelf the task due to technical roadblocks.

@abhilashshetty04 what was the technical roadblock that we identified?

abhilashshetty04 commented 5 months ago

@orville-wright , We had a previous employee of OpenEBS who attempted this feature. AFAIK, There were some roadblocks due to kernel semaphore dependencies.

This is the PR for your reference. https://github.com/openebs/lvm-localpv/pull/184

m-czarnik-exa commented 5 months ago

If I understand the requirement correctly, the same PVC is required to be used by applications on two(or more) different nodes, where the underlying PV is a shared vg managed by LVM, and assuming the lock managers required for shared vg are up and running on all worker nodes. I don't see this feasible with current CSI provisioner to provide this kind of rwx capability.

Is there another way to achieve this with OpenEBS? For instance, VMware uses VMFS to reattach volumes or disks between VMs. I would like to find a way to use shared storage between nodes, and the only solution I've found so far involves replication. Is there an alternative approach?

orville-wright commented 5 months ago

Hi @m-czarnik-exa - I run Product Mgmt for openEBS. o.k. lets dig into your use case and figure out some stuff and see if we can help.

openEBS is primarily designed as a Hyper-converged vSAN system. This means that...

  1. We prefer the storage media to be installed locally & physically as disk media in your physical cluster nodes.
  2. We provide an NVMe-oF (TCP & RDMA) vSAN Fabric between all Hyper-Converged nodes in the cluster. (we call our vSAN fabric the Nexus).
    • The Nexus is a Block-mode storage Area Network (SAN) and works like a SAN within the cluster.
    • It can leverage 2 protocols (NVME-TCP and NVMe-RDMA (iWARP and RoCE)
    • RDMA is a new addition. It was prototype a couple of months ago. It is in dev/eng right now. Its significantly offloads CPU and memory by not reducing reliance on the network stack and CPU. All RDMA vSAN Fabric I/O is via RDMA directly between nodes.
  3. We have a Block Allocator stack that owns the physical disk media on each node and presents blocks devices into our DIskPool.
  4. We carve out PV's (LUN's) from our DiskPool (based on a PVC's) and each PV (LUN) is made addressable anywhere within the Nexus vSAN fabric address space, to any node running the NEXUS... via NVMe-TCP of NVMe-RDMA protocols.
  5. A PV can can only have 1 single reservation from 1 single node, and I/O can only be done from 1 node. That I/O doesn't have to be node local. Any node in the NEXUS can address and LUN that has been presented to the NEXUS. Basically an internal vSAN.
  6. From here, we build build a Filesystem ontop of the PV (LUN)... ext3/ext4, XFS, BTRFS. A node can then mount that PV just like a LUN get mounted.
  7. Since that PC (LUN) is a fabric attached block-mode kernel NVMe LBA Namespace (acting a a disk), your mount operation is treated as a local kernel block device operation. You node claims the PV, mounts the file system and all is well.

Operations 5 ... 7 are node exclusive operations. - Only 1 node can safely claim a LV (LUN) device, becasue that block device is presented into the kernel of the node. There is **no way to safely arbitrate multiple kernels claiming the same (PV) LUN. - i.e. no easy simple way... without the complexity of a clustered kernel block-device subsystem. YES these exist, but they're complex, slow, painful to work with, difficult to manage etc, etc. On-top of this... you would also need a Clustered File System with a distributed lock manger that understands that multiple nodes have physically claimed 1 single LUN and are sharing I/O to the same LUN. This would also require arbitration and a very complex clustered I/O Data-Plane. - YES these exist... but again, they are complex, many are not free/open source and they are horribly complex to deploy and manage.

There are ways to do things that are close to what you are asking for.

You mention VMFS.

VMware vSAN and our NEXUS Fabric are similar in that any node in the vSAN fabric can address any block-mode disk device on the fabric. But... only 1 node can safely claim and do I/O to that LUN. - VMFS, VMware vSAN are not clustered block-device system or a Clustered File System that allows Multiple nodes to shared mount and shared write to exactly same single LUN at the same time.

openEBS

For openEBS, we have 5 Storage Engines that the user can choose to deploy. Each has different characteristics and different backend Block Allocator kernels. In all of the above... I am referring to openEBS Mayastor (see attached pics). - Not openEBS Local-PV LVM, becasue Mayastor is the only Storage Engine that currently contains the NEXUS vSAN.

openEBS Local-PV LVM utilizes the LVM2 kernel. (i.e. PE, PV, VG, LV structures) but does not currently utilize the NEXUS vSAN fabric. All I/O is Node-local.

openES Local PV LVM

Our LVM2 kernel is very mature, rock solid and high performance. It does inherit the native LVM2 concept of a Clustered VG (Volume Group) which allows multiples nodes to share access to a 1 single VG. This is somewhat like VMFS or VMware vSAN. You can extend LVM2 to work as in a Clustered LVM mode, but we have not prototyped or tested this.

So... after all of this... as a starting primer... what problem are you trying to solve? when you say the words...



overview_fabric


SPDK _Structure_components_v4

I'm looping in @tiagolobocastro @niladrih and @abhilashshetty04 for any further commentary.

m-czarnik-exa commented 5 months ago

@orville-wright

First of all, thank you for this elaborate explanation :)

Basically what I'm trying to achieve is a migration from VMware CSI to open source virtualization platforms like proxmox/opennebula etc... or possibly a bare metal solution.

The setup that I'm trying to configure (sorry for simplifications but I don't have a deep expertise in the field of storage) is to connect let's say, tree k8s nodes to SAN storage with fc or iscsi (to allocate block storage for these nodes) and be able to attach pv's created on one node, to another node (VMware CSI just reattach vmkd from one vm to another when pod with pvc starts on other node). Taking into account that the SAN storage has configured RAID and that I would like to use velero for backups, I don't need to replicate data from one node to another because it will affect performance. What I'm looking for is CSI that will handle shared storage from disk array between nodes and that will simultaneously be as fast as possible.

dsharma-dc commented 5 months ago

@m-czarnik-exa Today it won't be possible to let a PV(Persistent Volume) be used by multiple nodes(or by node different than where PV is created). With LVM-localPV engine, a PV represents the LVM logical volume created on that node.

There is a slightly similar use case of using an LVM VG as shared so that multiple nodes can create PVs on the same VG. This isn't complete yet as discussed and designed here: https://github.com/openebs/lvm-localpv/issues/134

avishnu commented 1 month ago

This requirement needs the LVM shared VG support, this will be tracked post #134