Support for creating volumes on LVM2 shared volume group

kmova commented 3 years ago

Describe the problem/challenge you have

LVM2 supports shared volume groups that are implemented with different types of clustered LVM modules. Shared volume groups are used in deployment topologies where - storage devices are connected to multiple storage nodes. For example, using SAS modules to connect multiple storage nodes to a single(or multiple) JBODs.

In such cases, administrators can create a shared VG using projects like ( https://github.com/Seagate/propeller ), that allow for:

Creating a VG on a shared device and mounting it to multiple nodes
Creating volumes on shared VG from any of the nodes that have access to Shared VG/Device
Mounting a volume from any of the nodes that have access to Shared VG/Device

Describe the solution you'd like

For the LVM Local PV to use shared VGs, the following needs to be changed/modified:

Current CSI Driver - sets the topology/affinity of the LVM Local PV to a single node. This needs to change to include the affinity to the set of nodes from where the PV is accessible.
The mount/publish calls have checks to see if the volume is mounted or not only from the current node. With Share VG, the code should handle the cases where the volume might be still mounted on a different node and handle the cases gracefully.
- The volume creation process can be split into two phases - creating the volume in inactive mode, and activating it on the node where the volumes need to be attached.
- The volume creation needs to be handled to avoid multiple nodes trying to create the volume.

Anything else you would like to add: The solution should work for both shared and non-shared volume groups. The design should attempt to keep the changes as generic as possible. The specifics - depending on the type of clustered solution used should be clearly abstracted via the storage class - were VG parameters are specified.

TProhofsky commented 3 years ago

VG discovery is a good feature to change Publish behavior. Both Node Publish and then Controller Publish which covers other deployments with virtual machines and ISCSI/NVMe proxy mode.

One thing to keep in mind is possible vgname collisions with multiple nodes. Adding the vguuid to the Storage Class definition will remove this possibility. It could be made optional and only checked by the CSI agent when passed in to reduce complexity if end users know they will avoid this scenario. If the agent uses lvmlockctl -i -d to fetch the a list of shared VGs is presented with uuid and some other states like kill_vg and drop_vg which will flag more action is required to resolve.

Graceful handling of failing to activate a LV on a node during NodePublish is difficult in the scope of CSI. LVM2 will prohibit the action so no check is required before issuing the lvchange -ay since the VG locks have this covered. When another server holds the LV active it can be from several conditions including:

User manually created the lv
User attempting to have two pods using the same PVC
NodeUnPublish failed
Node with running pod and PVC failed

The node failure will automatically resolve in 30 to 60 seconds as the shared VG lease is lost. The others are orchestration issues requiring user or maybe CRD Operator smarts to automatically resolve.

Ab-hishek commented 2 years ago

Here is an initial design/approach document for shared LVM2 shared volumes: https://docs.google.com/document/d/13D8Ht__66n66c5ZBThPwqhKjyJey0GFKVO3dgK3w8OA/edit?usp=sharing

Ab-hishek commented 2 years ago

Hi @TProhofsky, the shared vg document is ready and shared above. Can you PTAL and confirm if this looks okay to you or if any changes are required?

TProhofsky commented 2 years ago

I added a comment to issue a vgscan if the the LV is not found during mounting. Also I suggested we add the lock method as value instead of simply shared. This could be debated since at this time there is no action needed to be taken by the CSI agent that is unique to one particular mode of lvmlockd.

I don't understand how OwnerNodeID is being consumed but to me this looks unnecessary. LVM2 holds the source of truth of which node owns the volume by active status. Creating a shadow copy is OK but root users can manually modify LVM2 with the agent potentially missing a transition opening the risk for OwnerNodeID to be stale. Whatever is looking at OwnerNodeID should look the lvs state on each node to get a definitive answer.

Ab-hishek commented 2 years ago

We have hit a challenge here while implementing this feature. For adding shared-vg support, we cannot use separate lvm binaries inside a container to perform lvm related operations because we need to use the same lvm configuration that are applied on the node. Therefore, we tried to use the binaries that the node already has with it and using that we should be performing the operations.

A small POC for this is in progress here -> https://github.com/Ab-hishek/lvm-localpv/commit/09de4553c637c9b364ffc6f15a0e2b7131177b2e

But, while using the node lvm binaries and executing the commands from inside the container, the commands are hanging and not running into completion. Upon investigating further the strace -f command output for the commands getting executing can be found here: https://pastebin.com/raw/DwQfdmr8.

This problem is also being discussed with the lvm community. Follow this thread for updates regarding the discussion going on for this issue.

We need to investigate further, which kernel call is causing the issue and why. Will keep posting the findings on it as it progresses.

Ab-hishek commented 2 years ago

Have updated the issue with all the findings and bottlenecks. To get a better clarity, the following action items can be taken up to find the root cause of the above problem and take corresponding actions to fix it. Point to note: After forcefully interrupting the process(lvm create command), we see that the lv is created and seen in the lvs command but not sure that the created lv is a healthy one or not i.e able to perform IOs successfully. To be sure of that:

Test with the above POC code to create local LVs first on a local volume group and see if all the command execute properly and the LVs are able to serve IOs or not. If so, then move onto experimenting with shared LVs.
If the command are not executing to completion but still able to see the LV created on the volume group, test the particular LV with a fio pod for IO reads/writes. Similarly try out with the shared LV.

This way, maybe we can narrow down our debugging domain. If anybody is interested on working on this issue, feel free to jump in and get your hands dirty with the above experimental tasks and the subsequent fixes :)

Unassigning myself for others who are interested to take up the issue. Feel free to reach out to me for any help.

TProhofsky commented 2 years ago

DLM and Sanlock have several known issues and I would suspect some unknown issues. Seagate has contributed the IDM locking mode to LVM2 2.3.13 which is now in Red Hat and Rocky Linux 8.6. The IDM method replaces the DLM/Sanlock version of lvmlockd with something more robust. Let me know if you want to collaborate on getting an IDM compatible system setup for development.

nward commented 2 years ago

Hi, I am new to this project and am looking for a way to expose a single physical volume (backed by a libvirt volume) to multiple nodes (multiple VMs, on the same libvirt host). This issue sounds like exactly what I'm looking for - so am happy to get involved - I'm probably not able to solve the problems here, at least in the short term, but I'm happy to test etc. The topolvm project that was referenced in the LVM ML thread looked like it might have an interesting way to solve the issue of running lvm binaries in a container - with a daemon that runs on the node directly.

I note that the readme on this repo refers to VGs being available on multiple nodes (https://github.com/openebs/lvm-localpv#volumegroup-availability) - I'm not sure exactly what this means.. on the surface, this implies it is multiple nodes interacting with the same VG, but the existence of this issue makes me doubt my assumption. Is this really just talking about 2 VGs with the same name, but different backing disk or something (so, different data of course)?

TProhofsky commented 2 years ago

Nathan, the feature of LVM2 that allows shared storage architectures (multiple server connected to the same block device(s)) is described by the lvmlockd man page: https://man7.org/linux/man-pages/man8/lvmlockd.8.html. When a LVM2 Volume Group is created with the --shared flag and locks are started, LVM2 will manage metadata updates and LV activation across all attached servers of the cluster.

nward commented 2 years ago

Hi! I am familiar with this feature - I have used this on iSCSI shared volumes in the past. The flag is --clustered, rather than --shared though, right? Or is there another flag I'm not aware of?

My confusion is that the readme for this repo describes a VG available on multiple nodes - which I understand that this issue is addressing and is not yet complete, so I am not clear what the current behaviour is intended to be. From the readme:

The above storage class tells that volume group "lvmvg" is available on nodes lvmpv-node1 and lvmpv-node2 only. The LVM driver will create volumes on those nodes only.

Is this describing VGs with a common name, but with different underlying PVs (i.e. one per node) and so different data on disk? If so, I can look at clarifying that documentation so others don't trip on it like I have :-)

Or does this describe shared storage and expect that the VGs are running with --clustered - in which case what does this issue intend to deliver that's different to that?

dsharma-dc commented 5 months ago

Need to be scoped as enhancement. Bringing to notice.

avishnu commented 1 month ago

This issue needs more investigation and debugging. Scoping this for the v4.3 release.

openebs / lvm-localpv

Support for creating volumes on LVM2 shared volume group #134