Open kmova opened 3 years ago
VG discovery is a good feature to change Publish behavior. Both Node Publish and then Controller Publish which covers other deployments with virtual machines and ISCSI/NVMe proxy mode.
One thing to keep in mind is possible vgname collisions with multiple nodes. Adding the vguuid to the Storage Class definition will remove this possibility. It could be made optional and only checked by the CSI agent when passed in to reduce complexity if end users know they will avoid this scenario. If the agent uses lvmlockctl -i -d to fetch the a list of shared VGs is presented with uuid and some other states like kill_vg and drop_vg which will flag more action is required to resolve.
Graceful handling of failing to activate a LV on a node during NodePublish is difficult in the scope of CSI. LVM2 will prohibit the action so no check is required before issuing the lvchange -ay since the VG locks have this covered. When another server holds the LV active it can be from several conditions including:
The node failure will automatically resolve in 30 to 60 seconds as the shared VG lease is lost. The others are orchestration issues requiring user or maybe CRD Operator smarts to automatically resolve.
Here is an initial design/approach document for shared LVM2 shared volumes: https://docs.google.com/document/d/13D8Ht__66n66c5ZBThPwqhKjyJey0GFKVO3dgK3w8OA/edit?usp=sharing
Hi @TProhofsky, the shared vg document is ready and shared above. Can you PTAL and confirm if this looks okay to you or if any changes are required?
I added a comment to issue a vgscan if the the LV is not found during mounting. Also I suggested we add the lock method as value instead of simply shared. This could be debated since at this time there is no action needed to be taken by the CSI agent that is unique to one particular mode of lvmlockd.
I don't understand how OwnerNodeID is being consumed but to me this looks unnecessary. LVM2 holds the source of truth of which node owns the volume by active status. Creating a shadow copy is OK but root users can manually modify LVM2 with the agent potentially missing a transition opening the risk for OwnerNodeID to be stale. Whatever is looking at OwnerNodeID should look the lvs state on each node to get a definitive answer.
We have hit a challenge here while implementing this feature. For adding shared-vg support, we cannot use separate lvm binaries inside a container to perform lvm related operations because we need to use the same lvm configuration that are applied on the node. Therefore, we tried to use the binaries that the node already has with it and using that we should be performing the operations.
A small POC for this is in progress here -> https://github.com/Ab-hishek/lvm-localpv/commit/09de4553c637c9b364ffc6f15a0e2b7131177b2e
But, while using the node lvm binaries and executing the commands from inside the container, the commands are hanging and not running into completion. Upon investigating further the strace -f
command output for the commands getting executing can be found here: https://pastebin.com/raw/DwQfdmr8.
This problem is also being discussed with the lvm community. Follow this thread for updates regarding the discussion going on for this issue.
We need to investigate further, which kernel call is causing the issue and why. Will keep posting the findings on it as it progresses.
Have updated the issue with all the findings and bottlenecks. To get a better clarity, the following action items can be taken up to find the root cause of the above problem and take corresponding actions to fix it. Point to note: After forcefully interrupting the process(lvm create command), we see that the lv is created and seen in the lvs
command but not sure that the created lv is a healthy one or not i.e able to perform IOs successfully. To be sure of that:
This way, maybe we can narrow down our debugging domain. If anybody is interested on working on this issue, feel free to jump in and get your hands dirty with the above experimental tasks and the subsequent fixes :)
Unassigning myself for others who are interested to take up the issue. Feel free to reach out to me for any help.
DLM and Sanlock have several known issues and I would suspect some unknown issues. Seagate has contributed the IDM locking mode to LVM2 2.3.13 which is now in Red Hat and Rocky Linux 8.6. The IDM method replaces the DLM/Sanlock version of lvmlockd with something more robust. Let me know if you want to collaborate on getting an IDM compatible system setup for development.
Hi, I am new to this project and am looking for a way to expose a single physical volume (backed by a libvirt volume) to multiple nodes (multiple VMs, on the same libvirt host). This issue sounds like exactly what I'm looking for - so am happy to get involved - I'm probably not able to solve the problems here, at least in the short term, but I'm happy to test etc. The topolvm project that was referenced in the LVM ML thread looked like it might have an interesting way to solve the issue of running lvm binaries in a container - with a daemon that runs on the node directly.
I note that the readme on this repo refers to VGs being available on multiple nodes (https://github.com/openebs/lvm-localpv#volumegroup-availability) - I'm not sure exactly what this means.. on the surface, this implies it is multiple nodes interacting with the same VG, but the existence of this issue makes me doubt my assumption. Is this really just talking about 2 VGs with the same name, but different backing disk or something (so, different data of course)?
Nathan, the feature of LVM2 that allows shared storage architectures (multiple server connected to the same block device(s)) is described by the lvmlockd man page: https://man7.org/linux/man-pages/man8/lvmlockd.8.html. When a LVM2 Volume Group is created with the --shared flag and locks are started, LVM2 will manage metadata updates and LV activation across all attached servers of the cluster.
Hi! I am familiar with this feature - I have used this on iSCSI shared volumes in the past. The flag is --clustered
, rather than --shared
though, right? Or is there another flag I'm not aware of?
My confusion is that the readme for this repo describes a VG available on multiple nodes - which I understand that this issue is addressing and is not yet complete, so I am not clear what the current behaviour is intended to be. From the readme:
The above storage class tells that volume group "lvmvg" is available on nodes lvmpv-node1 and lvmpv-node2 only. The LVM driver will create volumes on those nodes only.
Is this describing VGs with a common name, but with different underlying PVs (i.e. one per node) and so different data on disk? If so, I can look at clarifying that documentation so others don't trip on it like I have :-)
Or does this describe shared storage and expect that the VGs are running with --clustered
- in which case what does this issue intend to deliver that's different to that?
Need to be scoped as enhancement. Bringing to notice.
This issue needs more investigation and debugging. Scoping this for the v4.3 release.
Describe the problem/challenge you have
LVM2 supports shared volume groups that are implemented with different types of clustered LVM modules. Shared volume groups are used in deployment topologies where - storage devices are connected to multiple storage nodes. For example, using SAS modules to connect multiple storage nodes to a single(or multiple) JBODs.
In such cases, administrators can create a shared VG using projects like ( https://github.com/Seagate/propeller ), that allow for:
Describe the solution you'd like
For the LVM Local PV to use shared VGs, the following needs to be changed/modified:
Anything else you would like to add: The solution should work for both shared and non-shared volume groups. The design should attempt to keep the changes as generic as possible. The specifics - depending on the type of clustered solution used should be clearly abstracted via the storage class - were VG parameters are specified.