lvmteam / lvm2

Mirror of upstream LVM2 repository
https://gitlab.com/lvmteam/lvm2
GNU General Public License v2.0
133 stars 73 forks source link

LVM volumes mapped without udev cause systemd to stall booting trying to activate swap #153

Closed desultory closed 3 months ago

desultory commented 3 months ago

Systemd seems to think this is an issue with your udev rules. Volumes are mounted using vgchange -ay lvscan then vgscan --mknodes https://github.com/desultory/ugrd/blob/main/src/ugrd/fs/lvm.py

https://github.com/systemd/systemd/issues/33920

prajnoha commented 3 months ago

After swtiching from initrd to root fs, there's the udev trigger called (systemd-udev-trigger.service) to reevaluate all the devices that have been activated before up to this point (including the ones from initrd phase). That trigger uses add udev events. However, DM (including subsystems like LVM, crypt...) udev rules do not normally react to these add events because during the DM device activation and switch-root, there's this sequence:

  1. DM device created in kernel, no udev event generated yet, device not usable/ready yet
  2. DM device table load, add uevent generated, device not ready yet (note: for kernels \< 5.15, the event was generated during 1.)
  3. DM device resumed (the DM table is active now), change uevent generated, device ready now
  4. switch-root
  5. Calling a trigger with add event anytime after 3., the rules are reevaluated  (otherwise skipped as device is considered "not ready yet")

For us to be able to make a difference between 1. and 5. within udev rules, we need to look at current udev database records. And for that, we need to keep the udev records from the initrd phase too  - udev rules in initrd use OPTIONS+="db_persist" for that.

If udevd is not running in initrd to track the state properly and keep the records around the switch-root and then we run the trigger for udevd in rootfs to reevaluate, the DM/LVM udev rules are skipped due to inability to jump into the device-tracking state machine properly. That means, we do not normally support the case when devices in initrd are created manually without udevd and then suddenly tracked by udevd in rootfs.

I'll see if we can add some other form of a simple check in the udev rules to cover this situation somehow, but normally, right now, we do not support that.

prajnoha commented 3 months ago

This problematic scenario comes from the fact that udev has just 2 usable events to do the notification - add and change.  Also, all the events can be triggered artificially - that's what happens during the systemd-udev-trigger after switch-root. Then trying to make a proper state-machine based on just these two events becomes a bit tricky as we need to be able to track the event sequence during actual DM device activation has passed correctly.

zkabelac commented 3 months ago

it's worth to be noted we are thinking about enhancing usage of DM UUID to eventually encode all the persistent info into that - however this is a really tricky complicated part - so it's not clear whether we will make it....

prajnoha commented 3 months ago

it's worth to be noted we are thinking about enhancing usage of DM UUID to eventually encode all the persistent info into that - however this is a really tricky complicated part - so it's not clear whether we will make it....

Right, that reminds me that at the moment the (genuine) change event (the 2. from the sequence above) comes also with the flags encoded inside the uevent environment that tells udev which /dev/ content should be created for that (based on whether that device is "private" or not etc.). So yes, probably the best and quickest to support initrd without udev while switching to udev-powered root fs is to simply record the udev events (at least for block devs/DM) in initrd and then just replay them before the actual trigger is called in root fs.

desultory commented 3 months ago

it's worth to be noted we are thinking about enhancing usage of DM UUID to eventually encode all the persistent info into that - however this is a really tricky complicated part - so it's not clear whether we will make it....

Right, that reminds me that at the moment the (genuine) change event (the 2. from the sequence above) comes also with the flags encoded inside the uevent environment that tells udev which /dev/ content should be created for that (based on whether that device is "private" or not etc.). So yes, probably the best and quickest to support initrd without udev while switching to udev-powered root fs is to simply record the udev events (at least for block devs/DM) in initrd and then just replay them before the actual trigger is called in root fs.

Do you know of a straightforward way to record these events and play them back? I would like to avoid adding large hacks if I can avoid it, and if this can be fixed upstream, I'd rather wait for a patch. I'm adding this support to make my tool compatible with the encrypted setups on Debian and Fedora installers. It works but the boot stalls because of this issue. Nobody is really asking for that support, and this was supposed to be a sanity check/test case.

desultory commented 3 months ago

If udevd is not running in initrd to track the state properly and keep the records around the switch-root and then we run the trigger for udevd in rootfs to reevaluate, the DM/LVM udev rules are skipped due to inability to jump into the device-tracking state machine properly. That means, we do not normally support the case when devices in initrd are created manually without udevd and then suddenly tracked by udevd in rootfs.

I'll see if we can add some other form of a simple check in the udev rules to cover this situation somehow, but normally, right now, we do not support that.

So does this mean that udev is a hard requirement at mount time for LVM2 when systemd is used, or at least its fstab generator is used?

zkabelac commented 3 months ago

systemd and its units & services are based around udevd - so it's no surprise lvm2 is expecting these two daemons working together.

When you use system WITHOUT udev - you should be also using lvm2 compiled without udev support - or eventually at least 'decorate' commands to not expect any waits for udev and let create dev nodes/symlinks by lvm2 itself.

Clearly the idea of running system without udev - and later enable udevd and expect things will 'magically' start to work was never in our supported scenarios.

Udev rules do some work based on ordering of events and things cannot work properly without knowing the order. Also a very critical moment is to 'separate' operation that do create/add devices from those that remove devices and make sure these 'rule ops' do not overlap - otherwise the resulting state of /dev is simply non-determistic.

I'm not clear what should be the final goal here - but I'm seeing the idea of building ramdisk without udev and later switching to rootfs with running udev as highly fragile plan that may work properly on very restricted subset of supported systems (maybe some 'virtual' machines)....

desultory commented 3 months ago

systemd and its units & services are based around udevd - so it's no surprise lvm2 is expecting these two daemons working together.

When you use system WITHOUT udev - you should be also using lvm2 compiled without udev support - or eventually at least 'decorate' commands to not expect any waits for udev and let create dev nodes/symlinks by lvm2 itself.

Clearly the idea of running system without udev - and later enable udevd and expect things will 'magically' start to work was never in our supported scenarios.

Udev rules do some work based on ordering of events and things cannot work properly without knowing the order. Also a very critical moment is to 'separate' operation that do create/add devices from those that remove devices and make sure these 'rule ops' do not overlap - otherwise the resulting state of /dev is simply non-determistic.

I'm not clear what should be the final goal here - but I'm seeing the idea of building ramdisk without udev and later switching to rootfs with running udev as highly fragile plan that may work properly on very restricted subset of supported systems (maybe some 'virtual' machines)....

The initramfs I'm making specifically tries to only pull in drivers needed for mounting the root filesystem, mounts it, and hands things off to the main init sytstem. If systemd requires a very specific pre-boot environment, I think these requirements should be made clear.

Mounting the rootfs in a plain and sensible way that works the same way on every boot has been working very well for me. I use udev on many of my systems, and it doesn't seem to really have any issue with being started "late" as in not in the initramfs. This only seems to be an issue with the systmd fstab generator, since it takes a working fstab and breaks it by making it depend on udev, which clearly can't be relied on.

There has to be a better solution than "systemd mounts only work if you started udev before everything, even in the initramfs, even if you don't need it's functionality at that time".

desultory commented 3 months ago

For what it's worth, I don't think this issue is related to swap, I think it applies for any mount but the root mount, if the filesystem was activated before udev was: https://github.com/systemd/systemd/issues/33948

teigland commented 3 months ago

I wonder if you could avoid these issues by narrowing the lvm activation to only the root LV. i.e. replace 'vgchange -ay' with 'lvchange -ay vg/root'. In dracut we attempt to only activate LVs named on the kernel command line, because activating everything has a number of potential issues.

zkabelac commented 3 months ago

I believe you are partially looking at this issue from incorrect angle.

It's not about 'systemd mounts only works'.... it's rather you are trying to 'combine' things in an unsupported way.

Udev works properly when it's properly notified about state a devices from it's beginning. Transfer of DB info of the udev between ramdisk and rootfs is there clearly for a reason.

What you try to get is the 'perfectly' working udev when it's missing some info that can be currently only obtained when it's collected at the moment of device initialization - as there is simply no other place where the same information is stored - and since udevd is normally taken as 'granted' noone really bothers to solve this issue - as it is in fact non-trivial.

zkabelac commented 3 months ago

the systemd mouting does require some sequence of operation - and is based on proper udevd notification.

However you can always opt-out from this 'systemd' automation and the things in the old fashion way - thus dropping the 'event based' logic and replacing things with traditional shell scripting - which should still work...

teigland commented 3 months ago

Part of the problem with udev is that it's so fragile -- one little detail doesn't work quite as expected and things fail. IMO we should be more resilient than that.

desultory commented 3 months ago

I wonder if you could avoid these issues by narrowing the lvm activation to only the root LV. i.e. replace 'vgchange -ay' with 'lvchange -ay vg/root'. In dracut we attempt to only activate LVs named on the kernel command line, because activating everything has a number of potential issues.

I can try doing this. The current logic is designed to be open ended in how it activates things, as later it tries to find fs uuids from those mounts. If this is unreliable for some reason, I can make the activation more narrow, but I hesitate to make it narrower in what it mounts since that requires extra logic, and I was under the impression that it was safe to just let the lvm tools do their things.

I believe you are partially looking at this issue from incorrect angle.

It's not about 'systemd mounts only works'.... it's rather you are trying to 'combine' things in an unsupported way.

Maybe so, but I'd like to know where it says "udev must be running in the initramfs for fstab mounts to work".

Udev works properly when it's properly notified about state a devices from it's beginning. Transfer of DB info of the udev between ramdisk and rootfs is there clearly for a reason.

This seems very very fragile What you try to get is the 'perfectly' working udev when it's missing some info that can be currently only obtained when it's collected at the moment of device initialization - as there is simply no other place where the same information is stored - and since udevd is normally taken as 'granted' noone really bothers to solve this issue - as it is in fact non-trivial.

Maybe using udev shouldn't be taken for granted, especially before the real PID1 runs. I think it makes sense for an initramfs to fill the simple role of mounting the rootfs. If it's able to do that in a manner that allows the init to run, that seems like a success.

the systemd mouting does require some sequence of operation - and is based on proper udevd notification.

However you can always opt-out from this 'systemd' automation and the things in the old fashion way - thus dropping the 'event based' logic and replacing things with traditional shell scripting - which should still work...

This is what I've tried to do. Things are mounted properly and simply before systemd runs, but then systemd runs and can't do additional mounts. If I could tell systemd to just use the fstab like normal, this would be a non-issue.

Part of the problem with udev is that it's so fragile -- one little detail doesn't work quite as expected and things fail. IMO we should be more resilient than that.

I agree with this 100%, I think things could be better.

teigland commented 3 months ago

The initrd should be focused on only mounting the root fs, and nothing more. That also means only activating the root LV and nothing more.

desultory commented 3 months ago

The initrd should be focused on only mounting the root fs, and nothing more. That also means only activating the root LV and nothing more.

Currently I do the following:

vgchange -ay
lvscan
vgscan --mknodes

Would this alone work:

lvchange -ay <vgname>/<rootlvname>

I added the vgscan --mknodes bit in an attempt to fix this issue, and kept it because having those mapped device nodes seemed generally helpful.

Within scope of the initramfs, it only needs to activate the volume enough that it can find the uuid and use that to mount it later.


An important note is that this does not fix issues with btrfs subvol mounting later. Even if the rootfs is mounted properly, systemd will fail to mount subvolumes from that fs, if the fs was not mounted with udev.

teigland commented 3 months ago

Would this alone work:

lvchange -ay <vgname>/<rootlvname>

Yes

I added the vgscan --mknodes bit in an attempt to fix this issue, and kept it because having those mapped device nodes seemed generally helpful.

Drop that, it will probably cause problems.

desultory commented 3 months ago

Thanks, so is that lvscan bit useless here too?

teigland commented 3 months ago

Thanks, so is that lvscan bit useless here too?

Yes

zkabelac commented 3 months ago

vgscan --mknodes is something that should be used in 'recovery' cases.

Normally if you don't have running udev and you expect lvm2 to create dev nodes - just set in lvm.conf

verify_udev_operations = 1

desultory commented 3 months ago

vgscan --mknodes is something that should be used in 'recovery' cases.

Normally if you don't have running udev and you expect lvm2 to create dev nodes - just set in lvm.conf

verify_udev_operations = 1

I'm not really sure what should be expected here. I was under the impression that things expected those nodes to be made, by udev or lvm itself, and that these would not be made later, since udev makes them at the time the device is first recognized.

teigland commented 3 months ago

I'm not really sure what should be expected here. I was under the impression that things expected those nodes to be made, by udev or lvm itself, and that these would not be made later, since udev makes them at the time the device is first recognized.

I'm not sure what will happen if the symlinks to the root LV are not created by anything, it's possible there are still problems.

desultory commented 3 months ago

I'm not really sure what should be expected here. I was under the impression that things expected those nodes to be made, by udev or lvm itself, and that these would not be made later, since udev makes them at the time the device is first recognized.

I'm not sure what will happen if the symlinks to the root LV are not created by anything, it's possible there are still problems.

so it seems using vgscan --mknodes would be a good idea?

desultory commented 3 months ago

Is there a way to get info such as the VG/LV names from /dev or /sys? I'm currently resolving DM stuff using that, and can't find that info in there. Do I need to determine that using the name entry? It's called debian--vg-root and that should be for VG/LV debian-vg/root

zkabelac commented 3 months ago

Do not use 'vgscan --mknodes' unless you experience some missing nodes.

Use the lvm.conf option - that should ensure - links are created in the proper location if udev is not running.

desultory commented 3 months ago

is there a cmdline arg to do verify_udev_operations = 1? I see no mentions of udev in the lvm.conf man page.

desultory commented 3 months ago

maybe I want the DM_DISABLE_UDEV environment variable?

teigland commented 3 months ago

Get the VG/LV name from lvm commands, i.e 'lvs'.

lvchange --config activation/verify_udev_operations=1 vg/root

desultory commented 3 months ago

Get the VG/LV name from lvm commands, i.e 'lvs'.

image

I'm confused about this operation, it seems to not want to let you use the plain 'dm-x' device path, but will use the mapper path? I would like to use the dm-x device nodes since they can be checked with /sys

My bad, I was using the wrong path, it seems to work. Should I be using --devices? image

Wait, it actually doesn't get the info I want:

$ sudo lvs --report-format json /dev/mapper/debian--vg-root; sudo lvs --report-format json --devices /dev/dm-0
  {
      "report": [
          {
              "lv": [
                  {"lv_name":"root", "vg_name":"debian-vg", "lv_attr":"-wi-ao----", "lv_size":"<30.05g", "pool_lv":"", "origin":"", "data_percent":"", "metadata_percent":"", "move_pv":"", "mirror_log":"", "copy_percent":"", "convert_lv":""}
              ]
          }
      ]
  }
  {
      "report": [
          {
              "lv": [
                  {"lv_name":"root", "vg_name":"debian-vg", "lv_attr":"-wi-ao----", "lv_size":"<30.05g", "pool_lv":"", "origin":"", "data_percent":"", "metadata_percent":"", "move_pv":"", "mirror_log":"", "copy_percent":"", "convert_lv":""},
                  {"lv_name":"swap_1", "vg_name":"debian-vg", "lv_attr":"-wi-a-----", "lv_size":"980.00m", "pool_lv":"", "origin":"", "data_percent":"", "metadata_percent":"", "move_pv":"", "mirror_log":"", "copy_percent":"", "convert_lv":""}
              ]
          }
      ]
  }

If I run that against the mapped logical volume, it gives me the info I want, but I have to run it against dm-0 which is the LUKS volume to get the associated LVM devices, which returns all of them. If I run it against dm-1 which is the root lv, it returns nothing. I'd have to do some guesswork to determine which of those is really associated with the root since the device node/path is not mentioned.

desultory commented 3 months ago

Get the VG/LV name from lvm commands, i.e 'lvs'.

lvchange --config activation/verify_udev_operations=1 vg/root

Are there docs on these options? I see a few similar options like that environment variable, and I also see --noudevsync . I'm assuming that verify_udev_operations config changes the operation of --activate?

zkabelac commented 3 months ago

You are misusing 'lvs' command - it's syntax is supposed to take either nothing (it will print all available LVs it can see in the system with given config) - or the VG name - passing device path makes no sense (please follow 'man lvs')

Also lvm2 (man lvm) recommends using /dev/vgname/lvname as a path (as this path is a clearly public LV)

Using /dev/dm-XXX is possible - however I cannot imagine how anyone could use such mount point in a meaningful with fstab when dm-XXX depends on the order of device creation and removal - so pretty much random... (unless user knows there is just one activated DM device)

The list of configurable settings can be seen in lvm.conf (normally should be distributed with comments)

You can also get it via command:

lvmconfig --type default --withcomments

(see man lvmconfig for full option description)

desultory commented 3 months ago

Using /dev/dm-XXX is possible - however I cannot imagine how anyone could use such mount point in a meaningful with fstab when dm-XXX depends on the order of device creation and removal - so pretty much random... (unless user knows there is just one activated DM device)

I would like to use the "dm-x" names because this is what my system uses for all device-mapper mounts. It tracks the following info: image I could use the name and query /dev/mapper, but im not sure how much I can rely on this. A /dev/dm-x node should always be made, if the lvm volume is mounted, without exception, right?

Also lvm2 (man lvm) recommends using /dev/vgname/lvname as a path (as this path is a clearly public LV)

My issue with this is that I'm trying to determine the lvm vg/lv name from the dm-x name, major/minor, or dm "name" attribute. How do I even know if a dir in /dev/ is for a vg, or something else. I would like to avoid checking tons of stuff's maj/minor in hopes of it being a match, then passing that to an external tool.

zkabelac commented 3 months ago

Use of /dev/dm-XXX names doesn't make much sense since these names are NOT stable and may easily change between activation - i.e. when you simply create snapshot of your rootfs - on your next boot your debian-vg/root LV will have completely different DM node. With your crypt devices you should likely use /dev/mapper/ path as again dm-XXX is just a 'generic' sequence made by kernel when a new DM devices appear in your system (XXX is just a 'minor' number) (Using /dev/mapper for LVs comes with the problem you need handle '-' escaping within this name - you will not have this problem when you will use /dev/vgname/lvname)

You can also try to use UUID (i.e. for fstab) - however this is somewhat hard to remember for human brain - but not a problem with mouse cut&paste....

And how do you know if the /dev is your VG - well simply create a unique VG name :) that's the whole magic it takes....

desultory commented 3 months ago

Use of /dev/dm-XXX names doesn't make much sense since these names are NOT stable and may easily change between activation - i.e. when you simply create snapshot of your rootfs - on your next boot your debian-vg/root LV will have completely different DM node. With your crypt devices you should likely use /dev/mapper/ path as again dm-XXX is just a 'generic' sequence made by kernel when a new DM devices appear in your system (XXX is just a 'minor' number) (Using /dev/mapper for LVs comes with the problem you need handle '-' escaping within this name - you will not have this problem when you will use /dev/vgname/lvname)

This is fine, this information is used at build time. It already has all required info to mount things, it's just missing the vg/lv name which I would be using to restrict what is initialized by lvm in the initramfs.

You can also try to use UUID (i.e. for fstab) - however this is somewhat hard to remember for human brain - but not a problem with mouse cut&paste....

And how do you know if the /dev is your VG - well simply create a unique VG name :) that's the whole magic it takes....

The issue is that I don't know how I'm supposed to go from the mountpoint listed for /, to a vg/lv name, simply. I think the solution is to use the "name" listed in the sys info, which would always correspond to a /dev/mapper/ entry. I'm not sure if that is the best way. It would be great if there were some LVM tool I could feed a mountpoint into, and it tells me if it's part of a lv/vg and info about that.

like:

lvresolve /
vg0/root
zkabelac commented 3 months ago

The issue is that I don't know how I'm supposed to go from the mountpoint listed for /, to a vg/lv name

And we are closing the circle now - that's why there is udev and we are able to explore device and figure out what needs to be activated by systemd - as the udev examines arriving devices - whether they have something needed to bring rootfs....

And since you want to hard code things in your ramdisk - you need place somewhere 'vgchange -ay vg/lv' and then just mount your /dev/vg/lv path to your '/' rootfs...

desultory commented 3 months ago

The issue is that I don't know how I'm supposed to go from the mountpoint listed for /, to a vg/lv name

And we are closing the circle now - that's why there is udev and we are able to explore device and figure out what needs to be activated by systemd - as the udev examines arriving devices - whether they have something needed to bring rootfs....

And since you want to hard code things in your ramdisk - you need place somewhere 'vgchange -ay vg/lv' and then just mount your /dev/vg/lv path to your '/' rootfs...

Currently it simply activates LVM stuff, and uses like "mount UUID=x /newroot" to mount the root. What it really cares about is the UUID of the underlying filesystem, it doesn't necessarily need to care about LVM stuff for the sake of mounting things. This only seems to cause an issue with systemd things later because it's confused that these things are mapped but not setup by udev.

My system very reliably mounts the rootfs, as long as it's on some lvm device that is initialized by vgchange -ay. Right now the only missing piece to limit what it initializes, in hope of appeasing systemd, is to reliably determine the vg/lv name of a certain mount point.

desultory commented 3 months ago

And to be very clear, even if i limited it to only init the root lv, systemd would still fail to mount subvolumes, if btrfs is used. I think this fix I'm trying to make will potentially fix the lvm issues, by only setting up lvm for the root, but it won't really solve the underlying issue of systemd fundamentally expecting every single device is mounted with udev running, even if it happens pre-init.

zkabelac commented 3 months ago

Clearly we cannot solve all your problems here - btrfs is even something you should possibly not combine with lvm2 - as btrfs is using it's own volume management - thus you are basically degrading your performance by duplicating volume management layer in your device stack.

ATM I'm not seeing a bug on lvm2 and it's not even clear if there is some RFE to enhance lvm2 code base in some way to handle something in your weird setup better - as lvm2 likely can support all your requirements with some optional lvm2 command annotation to support the setup.

This issue tracker for lvm2 project is likely not a the best discussion forum for your 'new ramdisk solution/project' for Gentoo development.

desultory commented 3 months ago

Clearly we cannot solve all your problems here - btrfs is even something you should possibly not combine with lvm2 - as btrfs is using it's own volume management - thus you are basically degrading your performance by duplicating volume management layer in your device stack.

The thing is that I'm not trying to add any volume management. For the sake of booting, running vgchange -ay does literally everything needed to stage things to mount the rootfs and switch_root. The issue is that this confuses systemd because it seems to expect that udev is running before the init even runs.

ATM I'm not seeing a bug on lvm2 and it's not even clear if there is some RFE to enhance lvm2 code base in some way to handle something in your weird setup better - as lvm2 likely can support all your requirements with some optional lvm2 command annotation to support the setup.

Systemd told me it was a bug with the lvm udev rules, so I opened an issue here. I'm trying to understand the problem. When it comes to improving LVM, I think it would make sense if there was a simple/straightforward way to resolve the vg/lv info from a certain mountpoint, without extra steps or guesses. I mean simply saying "I know something is mounted at /mnt/whatever, is that a LVM volume? if so, what is the vg/lv info so I can know how to address it." At the moment, you can see the mount source device from /proc/mounts, but that seems to use the /dev/mapper path and not the /dev/vgname/lvname path, meaning I have to use several operations to get vg/lv info.

To be honest, I'm surprised this info isn't present in the sysfs info, like at /sys/block/dm-0/dm/vg

This issue tracker for lvm2 project is likely not a the best discussion forum for your 'new ramdisk solution/project' for Gentoo development.

FWIW it works fine with Gentoo, I'm hitting these issues because I'm trying to add support for Fedora and Debian. I would like for this project to work on any distro and to be as simple/generic as possible. I think adding udev to the system because doing simple mount operations breaks systemd is a problem with systemd, or specifically udev. I don't see the sense in integrating a fragile, breakable system. An initarmfs should simply mount the rootfs, including udev means it's doing much more than that.

zkabelac commented 3 months ago

The thing is that I'm not trying to add any volume management. For the sake of booting, running vgchange -ay does literally everything needed to stage things to mount the rootfs and switch_root. The issue is that this confuses systemd because it seems to expect that udev is running before the init even runs.

And you've been explained to just activate one particular LV you need to pass-in the info to your ramdisk which devices are needed - i.e. check how dracut handles option rd.lvm & rd.luks as an example how to solve the Head22 issue.

Systemd told me it was a bug with the lvm udev rules, so I opened an issue here. I'm trying to understand the problem. When it comes to improving LVM, I think it would make sense if there was a simple/straightforward way to resolve the vg/lv info from a certain mountpoint, without extra steps or guesses. I mean simply saying "I know something is mounted at /

There is no bug at lvm2 and likely not even in systemd - it's just incorrect usage on your side - systemd is integrated with udev in a pretty strong way.

lvm2 is a 'simple' volume manager project - not a project to resolve all mounting troubles in every Linux system ;) There are many ways to skin a cat - so we can likely advice the way we see the most natural...

However when you use tools that expect udev will be there - you need to notify them about your changes

mnt/whatever, is that a LVM volume? if so, what is the vg/lv info so I can know how to address it." At the moment, you can see the mount source device from /proc/mounts, but that seems to use the /dev/mapper path and not the /dev/vgname/lvname path, meaning I have to use several operations to get vg/lv info.

As said lvm2 tries to understand many ways how to pass in VG/LV - kernel historically (likely from lvm1 era) display some info - but this info is not necessary to be used by user - and we simply advice to use /dev/vgname/lvname for device path as otherwise you need to solve far more naming troubles

It's worth to be noted even today's lvm2 still works on systems with 2.6 kernel - which do not have anything like /dev/dm-xxx....

This issue tracker for lvm2 project is likely not a the best discussion forum for your 'new ramdisk solution/project' for Gentoo development.

FWIW it works fine with Gentoo, I'm hitting these issues because I'm trying to add support for Fedora and Debian. I would like for this project to work on any distro and to be as simple/generic as possible. I think adding udev to the system because doing simple mount operations breaks systemd is a problem with systemd, or specifically udev. I don't see the sense in integrating a fragile, breakable system. An initarmfs should simply mount the rootfs, including udev means it's doing much more than that.

This however makes your project limited to be used only with a limited subset of system with some very basic disk layouts (which could be perfectly ok for your solution) .... however as soon as you will start to worry about any raid, multipath, iscsi setup you will start to notice wider picture.... and your simple 'solution' will start to complicate....

Anyway it's worth to be noted there is on going SID project....

desultory commented 3 months ago

The thing is that I'm not trying to add any volume management. For the sake of booting, running vgchange -ay does literally everything needed to stage things to mount the rootfs and switch_root. The issue is that this confuses systemd because it seems to expect that udev is running before the init even runs.

And you've been explained to just activate one particular LV you need to pass-in the info to your ramdisk which devices are needed - i.e. check how dracut handles option rd.lvm & rd.luks as an example how to solve the Head22 issue.

Systemd told me it was a bug with the lvm udev rules, so I opened an issue here. I'm trying to understand the problem. When it comes to improving LVM, I think it would make sense if there was a simple/straightforward way to resolve the vg/lv info from a certain mountpoint, without extra steps or guesses. I mean simply saying "I know something is mounted at /

There is no bug at lvm2 and likely not even in systemd - it's just incorrect usage on your side - systemd is integrated with udev in a pretty strong way.

lvm2 is a 'simple' volume manager project - not a project to resolve all mounting troubles in every Linux system ;) There are many ways to skin a cat - so we can likely advice the way we see the most natural...

However when you use tools that expect udev will be there - you need to notify them about your changes

mnt/whatever, is that a LVM volume? if so, what is the vg/lv info so I can know how to address it." At the moment, you can see the mount source device from /proc/mounts, but that seems to use the /dev/mapper path and not the /dev/vgname/lvname path, meaning I have to use several operations to get vg/lv info.

As said lvm2 tries to understand many ways how to pass in VG/LV - kernel historically (likely from lvm1 era) display some info - but this info is not necessary to be used by user - and we simply advice to use /dev/vgname/lvname for device path as otherwise you need to solve far more naming troubles

It's worth to be noted even today's lvm2 still works on systems with 2.6 kernel - which do not have anything like /dev/dm-xxx....

would it be impossible to make lvm2 attempt to add this information to the sysfs stuff? I think this information is useful for working backwards. My system attempts to generate an init script for the initramfs that mounts the rootfs as it's currently mounted, on the build host (by default). In order to do this successfully, it should obtain information about the currently active mounts. Having some mechanism to clearly resolve the vg/lv used for a mount, given the mountpoint and nothing else, would be very nice to have.

This issue tracker for lvm2 project is likely not a the best discussion forum for your 'new ramdisk solution/project' for Gentoo development.

FWIW it works fine with Gentoo, I'm hitting these issues because I'm trying to add support for Fedora and Debian. I would like for this project to work on any distro and to be as simple/generic as possible. I think adding udev to the system because doing simple mount operations breaks systemd is a problem with systemd, or specifically udev. I don't see the sense in integrating a fragile, breakable system. An initarmfs should simply mount the rootfs, including udev means it's doing much more than that.

This however makes your project limited to be used only with a limited subset of system with some very basic disk layouts (which could be perfectly ok for your solution) .... however as soon as you will start to worry about any raid, multipath, iscsi setup you will start to notice wider picture.... and your simple 'solution' will start to complicate....

Anyway it's worth to be noted there is on going SID project....

My system barely uses LVM as is. I added LVM support in the form of is very basic module: https://github.com/desultory/ugrd/blob/main/src/ugrd/fs/lvm.py All the module aims to do is ensure lvm kmods are added, and activate lvm stuff blindly, so it can later run mount UUID=x with the uuid of the underlying filesystem. This is a very simple and reliable approach, but it upsets udev later.

The lvm autodetection is more advanced and is handled in the mount module: https://github.com/desultory/ugrd/blob/main/src/ugrd/fs/mounts.py#L296-L371 Where most of the logic is generic to "device mapper" types. The logic works about the same for detecting the source of a LUKS and LVM mount, I haven't added any other DM types, but I don't think it should be too different. It's mostly using the logic to find the parent container of mounts.

The support for root filesystem types is limited by what it's able to mount. It supports LVM very well with this method, it just seems that it "does too much". The fundamental issue here is that even if it did less, activating only the root volume, udev may still eventually be upset by that not being activated properly.

I did most of the testing with openrc and it doesn't care that you happen to scan for all logical volumes. I think this is sensible behavior. I think it's the same for dinit. My main issue is that systemd seems to reinvent the fstab, in a way that is more breakable and has more dependencies. I think it makes sense to try to support simpler initialization methods with fewer dependencies. I'm not sure how that works concerning support for older versions, but I would hope simple methods can be built out if possible.


The project goal of ugrd is to make simple, mostly static images that are designed to only do what they need to mount a specific root filesystem. I can see how this can become somewhat complicated adding many mount options, but it has various ways to organize this and allow slotting various methods in. It also makes attempts to validate most of the config, and is designed mostly for booting into the same (local) setup you are currently using. The initial design goal was to simply setup LUKS mounts, and support loading a few common key types. Many people end up making their own initramfs that's just a few lines of bash to simply mount something like this, then have trouble maintaining it because you have to worry about adding in kmods with every update, pulling in library dependencies for files, etc etc. Adding udev to the initramfs isn't very hard, but it's not trivial, and most of its features will not be used.

prajnoha commented 3 months ago

Well, thing is, that upstream moved to dynamic discovery of devices. That's the fact. That means using udev as that's what we have available at the moment for this purpose. 

The udev is core part of the system, if we like it or not, and many projects (including LVM and systemd) depend on that now. The change was happening 15 years ago. For LVM, it's still possible to switch to non-udev operation, but for systemd this is a core dependency. Unless there's a decent replacement, we need to use and depend on udev. I don't mean only userspace part of udev (the udev daemon and rules), but also the whole notification infrastructure bound to kobjects we have in kernel. Unless there's someone brave enough to change the kernel side as well, we should probably not rant about it.

I think there are these main reasons for the udev usage in initrd:

Sure, it doesn't seem to be that much to support  if we say it's just raw disk, and then maybe encrypted and then maybe some LVM/MD layer on top of that. But that simply doesn't cover all the possible use cases that users have. There's a high risk people will end up asking for more and more support and that may end up unmaintainable if doing that the static way. I rememeber this exact situation with Fedora and the pre-dracut (the pre-udev-powered-initrd) era. We simply ended up saying "not supported" here.

With the dynamic, uevent-based initrds, with sharing the udev rules from the actual rootfs, we could cover that without reinventing the wheel for using different methods. And technically, it's not that much to add for the initrd to have udev support. The notification are there anyway, can't be switched off. It's the daemon and rules, mostly copied from the rootfs.

Saying that, sure, udev is not perfect, it has its own pitfalls. But I'll be honest, doing this discovery/activation the static way without reacting to events, is actually a step back, several years back. Yes, it may cover some portion of use cases. But to support that upstream for all, we need a justification for spending resources on trying to make the non-udev-based and udev-based environments compatible.

We would need to provide a workaround, some hacks here - as for DM (and its subsystems like crypt and LVM), we would need to provide a simple uevent listener in the initrd, record the DM_COOKIE udev event variable for DM devices and then push them somehow for the trigger events that are used in rootfs to properly initialize udev database with existing devices. Since we can't create the environment variable directly when doing the trigger, only with the SYNTH_ARG_ prefix (https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-uevent), the handling around this can add unnecessary source of complication in the udev ruling, creating an exception here, which, as I remember, nobody has asked for those 15 years we moved to udev in LVM.

MD is very similar here - it also has that OPTIONS+="db_persist" in its rules so it also depends on transfering the udev records from initrd to rootfs. Maybe MD is a bit better positioned here as it has more sysfs content from which we can deduce the state of the device (the /sys/block/mdX/md/array_state for example), which DM doesn't provide (it has only /sys/block/dm-X/dm/{name,uuid,suspended} which is not that useful for tracking the state).

Why do DM and MD need that? Because they're device abstraction mechanisms usually encompassing more than one device Unfortunately, udev was not designed for these, not taking them into account when designed to be more concrete (DM pre-dates udev). Udev was designed for more simpler devices without activation consisting of several steps  which we need to do to attach the desired functionality to the existing block-device infrastructure in kernel.

Frankly, I think the best here is just to use udev in initrd. Unless there's a strong justification not to do that.

desultory commented 3 months ago

Well, thing is, that upstream moved to dynamic discovery of devices. That's the fact. That means using udev as that's what we have available at the moment for this purpose.

I guess it depends on your definition of upstream, not every distro or person uses udev, it's good to have choice. If this is a LVM requirement, that's news to me, as it works fine without udev on OpenRC systems.

The udev is core part of the system, if we like it or not, and many projects (including LVM and systemd) depend on that now. The change was happening 15 years ago. For LVM, it's still possible to switch to non-udev operation, but for systemd this is a core dependency. Unless there's a decent replacement, we need to use and depend on udev. I don't mean only userspace part of udev (the udev daemon and rules), but also the whole notification infrastructure bound to kobjects we have in kernel. Unless there's someone brave enough to change the kernel side as well, we should probably not rant about it.

Core part of systemd, yes, not necessarily part of "Linux" as a whole. I think a monoculture should be avoided and can lead to poor design choices.

I think there are these main reasons for the udev usage in initrd:

* Sharing a common solution for both rootfs and initrd environment so we don't need to duplicate the functionality using different ways of discovery/activation.

* The number of combinations that could be used for a storage stack is rising as Linux ecosystem evolves through time (we simply have more possibilities now than 15-20 years ago, we have LVM with thin, RAID, VDO support; we have MD; iSCSI, NVMe, NVMe-oF.... lots of them... interleaved with crypt, whatever else).

Sure, it doesn't seem to be that much to support if we say it's just raw disk, and then maybe encrypted and then maybe some LVM/MD layer on top of that. But that simply doesn't cover all the possible use cases that users have. There's a high risk people will end up asking for more and more support and that may end up unmaintainable if doing that the static way. I rememeber this exact situation with Fedora and the pre-dracut (the pre-udev-powered-initrd) era. We simply ended up saying "not supported" here.

With the dynamic, uevent-based initrds, with sharing the udev rules from the actual rootfs, we could cover that without reinventing the wheel for using different methods. And technically, it's not that much to add for the initrd to have udev support. The notification are there anyway, can't be switched off. It's the daemon and rules, mostly copied from the rootfs.

I don't have too much of an issue with bringing in udev rules for storage only, I think this makes sense. My issue is bringing in a whole library of unverified udev rules that do who know what, with what dependencies. If things use pure udev, that's fine, but how do I know until runtime?

Saying that, sure, udev is not perfect, it has its own pitfalls. But I'll be honest, doing this discovery/activation the static way without reacting to events, is actually a step back, several years back. Yes, it may cover some portion of use cases. But to support that upstream for all, we need a justification for spending resources on trying to make the non-udev-based and udev-based environments compatible.

In some ways I think static is still better, it can be a lot simpler and easier to review. The point of the system I'm making is to create very simple images that boot a single system. There isn't really much of a need for anything dynamic, and it makes the boot process much simpler, and lowers the attack surface. I think it's important that at least one initramfs generator tries to "just mount the rootfs" instead of doing tons more and tying itself into a single particular system.

We would need to provide a workaround, some hacks here - as for DM (and its subsystems like crypt and LVM), we would need to provide a simple uevent listener in the initrd, record the DM_COOKIE udev event variable for DM devices and then push them somehow for the trigger events that are used in rootfs to properly initialize udev database with existing devices. Since we can't create the environment variable directly when doing the trigger, only with the SYNTHARG prefix (https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-uevent), the handling around this can add unnecessary source of complication in the udev ruling, creating an exception here, which, as I remember, nobody has asked for those 15 years we moved to udev in LVM.

I have been considering making my own system like this. I think it's a real problem that only one tool does what udev does and is widely used (as far as i can tell). I'm not sure if using a replacement would be supported. I think for the sake of booting, listening to events makes sense. That would actually solve one of the few things I dislike about UGRD's design - it doesn't know when storage is added. This isn't a big deal because I have it on a timer and users can press a key once they see the kernel line for storage, but this could obviously be better. I don't see the sense in adding all of udev to fix this small issue though, and many users don't want udev or don't even have it on the system that built it.

MD is very similar here - it also has that OPTIONS+="db_persist" in its rules so it also depends on transfering the udev records from initrd to rootfs. Maybe MD is a bit better positioned here as it has more sysfs content from which we can deduce the state of the device (the /sys/block/mdX/md/array_state for example), which DM doesn't provide (it has only /sys/block/dm-X/dm/{name,uuid,suspended} which is not that useful for tracking the state).

Why do DM and MD need that? Because they're device abstraction mechanisms usually encompassing more than one device Unfortunately, udev was not designed for these, not taking them into account when designed to be more concrete (DM pre-dates udev). Udev was designed for more simpler devices without activation consisting of several steps which we need to do to attach the desired functionality to the existing block-device infrastructure in kernel.

Frankly, I think the best here is just to use udev in initrd. Unless there's a strong justification not to do that.

The issue is that this system is supposed to boot any Linux system. OpenRC based, Systemd based, whatever. It builds that system by taking things from the host environment. I could make it try to pull udev on Systemd systems, maybe that's the best option, but then I'm maintaining a split brained codebase, or I just have udev in there as a vestigial artifact to make systemd not break.

I may just make my own simple udev recorder or something, that just captures events and passes them forward so things don't break, but I really do think there should at least be some alternative to udev. I don't think it should be relied on so much that things don't work without it, especially when it's not a given in the software world. I use udev on a lot of my systems, but not all of them. I know some people use avoid it entirely. I don't think it's healthy to just expect everyone is using a particular component, especially one that is somewhat fragile.

teigland commented 3 months ago

When it comes to improving LVM, I think it would make sense if there was a simple/straightforward way to resolve the vg/lv info from a certain mountpoint, without extra steps or guesses. I mean simply saying "I know something is mounted at /mnt/whatever, is that a LVM volume? if so, what is the vg/lv info so I can know how to address it." At the moment, you can see the mount source device from /proc/mounts, but that seems to use the /dev/mapper path and not the /dev/vgname/lvname path, meaning I have to use several operations to get vg/lv info.

Translating between path names isn't difficult, but requires a couple steps.

# df /mnt
Filesystem        1K-blocks  Used Available Use% Mounted on
/dev/mapper/rr-ri    125600  7696    117904   7% /mnt
# ls -l /dev/mapper/rr-ri
lrwxrwxrwx 1 root root 8 Aug  7 11:42 /dev/mapper/rr-ri -> ../dm-10
# dmsetup splitname rr-ri
VG   LV   LVLayer
rr   ri          
# ls -l /dev/rr/ri
lrwxrwxrwx 1 root root 8 Aug  7 11:42 /dev/rr/ri -> ../dm-10
desultory commented 3 months ago

When it comes to improving LVM, I think it would make sense if there was a simple/straightforward way to resolve the vg/lv info from a certain mountpoint, without extra steps or guesses. I mean simply saying "I know something is mounted at /mnt/whatever, is that a LVM volume? if so, what is the vg/lv info so I can know how to address it." At the moment, you can see the mount source device from /proc/mounts, but that seems to use the /dev/mapper path and not the /dev/vgname/lvname path, meaning I have to use several operations to get vg/lv info.

Translating between path names isn't difficult, but requires a couple steps.

# df /mnt
Filesystem        1K-blocks  Used Available Use% Mounted on
/dev/mapper/rr-ri    125600  7696    117904   7% /mnt
# ls -l /dev/mapper/rr-ri
lrwxrwxrwx 1 root root 8 Aug  7 11:42 /dev/mapper/rr-ri -> ../dm-10
# dmsetup splitname rr-ri
VG   LV   LVLayer
rr   ri          
# ls -l /dev/rr/ri
lrwxrwxrwx 1 root root 8 Aug  7 11:42 /dev/rr/ri -> ../dm-10

It may be a hardlink, not a symlink. If that is the case, you need to iterate over all device nodes and see where there is a maj/min match

prajnoha commented 3 months ago

I may just make my own simple udev recorder or something, that just captures events and passes them forward so things don't break, but I really do think there should at least be some alternative to udev. I don't think it should be relied on so much that things don't work without it, especially when it's not a given in the software world. I use udev on a lot of my systems, but not all of them. I know some people use avoid it entirely. I don't think it's healthy to just expect everyone is using a particular component, especially one that is somewhat fragile.

The options are:

It's up to you. If you have an idea for tweaking existing udev rules and it will look good, we can discuss that for upstream inclusion.

desultory commented 3 months ago

I may just make my own simple udev recorder or something, that just captures events and passes them forward so things don't break, but I really do think there should at least be some alternative to udev. I don't think it should be relied on so much that things don't work without it, especially when it's not a given in the software world. I use udev on a lot of my systems, but not all of them. I know some people use avoid it entirely. I don't think it's healthy to just expect everyone is using a particular component, especially one that is somewhat fragile.

The options are:

* not using udev in rootfs

* creating udev event recorder in initrd and then replaying that through the SYNTH_ARG interface (kernels >= 4.13), recognizing this in the udev rules, executing that before the actual udevadm trigger (a la systemd-udevd-trigger.service)

* tweak existing udev rules in some way to deal with this

It's up to you. If you have an idea for tweaking existing udev rules and it will look good, we can discuss that for upstream inclusion.

I think not using udev in the rootfs is a reasonable option to have, but doesn't seem to be an option for people using Systemd. This currently seems to work fine with other inits, as the missing event handling doesn't break any expectations.

If some system were to record those rules, would that require some additional setup on the host? Like the creation of some unit that runs early and looks for an event log, and replays that? Or can the SYNTH_ARG interface be used near the end of the intird execution to load things up for systemd?

I think it would be interesting if whatever kernel mechanism sends events to the udev listener could throw things in a ring buffer of some size until a listener is available. It would be super nice if this system had a softer failure mechanism, instead of sending events into the void.

If none of those options are reasonable, I will look into making the existing udev rules more reliable/failure tolerant. The main reason I hesitate to go in this direction is because this seems to be an underlying issue, and I'm not sure how many udev rules would have to be adjusted to account for this (btrfs subvol issues occur too). I also recognize that using a systemd-free initrd on a sytemd based system is a very small use case, but I think it's totally valid.

zkabelac commented 3 months ago

As it's been already mentioned

lvm2 supports the use with udevd as well as without udevd - when properly configured.

The problem comes from your invalid usage where do you try to use ramdisk without udev - and then switch to the rootfs with udev - this is simply not simply unsupportable - you can dispute it works in your case ;) but globally it's just broken setup.

So noone here is forcing anyone to use i.e. systemd with udev if you want to use lvm2 - but once your start to invent your 'hybrid' frankenstein - it something you need to do your own research how that idea can 'fly' - from my POV - it's possibly a dead horse from the start - but I can be wrong ....

prajnoha commented 3 months ago

If some system were to record those rules, would that require some additional setup on the host? Like the creation of some unit that runs early and looks for an event log, and replays that? Or can the SYNTH_ARG interface be used near the end of the intird execution to load things up for systemd?

The SYNTH_ARG interface is just reading arguments given when triggering synthetic uevents, that is, writing the action name + optional args to /sys/.../uevent - that's actually what the udevadm trigger does as well. You would need to record the events to a log somewhere where it can be transfered from initrd to rootfs (so tmpfs, like systemd/udev already does through mounting /run and passing things there). Then you would need to create a new systemd unit, order it before the systemd-udev-trigger.service and replay the events from there. The rules would need to be edited too (at least for DM) - somewhere where we detect synthetic uevents, but we'd add a rule matching some dedicated variable, say SYNTHARG REPLAY=1, that would indicate this condition and we could import the recorded original variables from initrd passed through the SYNTH_ARG interface. Just a theory now, something that came to my mind, but maybe there's other way - alternative would be to test for an existence of a file somewhere and then importing the variables from that file. Honestly, I don't like any of this approach, but it's at least something that could provide some solution.

I think it would be interesting if whatever kernel mechanism sends events to the udev listener could throw things in a ring buffer of some size until a listener is available. It would be super nice if this system had a softer failure mechanism, instead of sending events into the void.

Currently, the events are passed through a netlink interface - see man 7 netlink if you're interested more. The type of the notification is NETLINK_KOBJECT_UEVENT. Also, we need to keep in mind this "nice" feature, citing from that manual: "However, reliable transmissions from kernel to user are impossible in any case. The kernel can't send a netlink message if the socket buffer is full: the message will be dropped and the kernel and the user-space process will no longer have the same view of kernel state. It is up to the application to detect when this happens (via the ENOBUFS error returned by recvmsg(2)) and resynchronize."

If none of those options are reasonable, I will look into making the existing udev rules more reliable/failure tolerant. The main reason I hesitate to go in this direction is because this seems to be an underlying issue, and I'm not sure how many udev rules would have to be adjusted to account for this (btrfs subvol issues occur too). I also recognize that using a systemd-free initrd on a sytemd based system is a very small use case, but I think it's totally valid.

The whole issue here revolves around the fact that it is hard to recognize the events properly and you have that multi-step activation scheme. And it's just that ADD and CHANGE events and you need to have proper environment variables attached to them + you can store vars from previous event and import them and try to compare somehow with current variable set. The synthetic uevents coming from the udevadm trigger are not marked in any way, maybe just the SYNTH_UUID that is set automatically so you need to at least see that this is not a genuine event. Then you need to deduce in what state are you right now with your device. For that, you need to prepare the udev records before that during device activation by setting certain variables. And that's exactly why we need to preserve that state from initrd to properly track the device is already up and running, possibly reading flags that may make the device "private" so that certain actions are not executed or certain properties set (device scanning not executed, not creating certain symlinks, giving priority to certain symlinks [in case of snapshots] etc.). Also, the synthetic uevents may appear anytime - the trigger can be called anytime and you need to take great care to be sure about the current state you're in. The SID project tries to deal with this by providing more possibilities here. But it still relies on udev events as we currently doesn't have anything better from kernel side.

desultory commented 3 months ago

...

The whole issue here revolves around the fact that it is hard to recognize the events properly and you have that multi-step activation scheme. And it's just that ADD and CHANGE events and you need to have proper environment variables attached to them + you can store vars from previous event and import them and try to compare somehow with current variable set. The synthetic uevents coming from the udevadm trigger are not marked in any way, maybe just the SYNTH_UUID that is set automatically so you need to at least see that this is not a genuine event. Then you need to deduce in what state are you right now with your device. For that, you need to prepare the udev records before that during device activation by setting certain variables. And that's exactly why we need to preserve that state from initrd to properly track the device is already up and running, possibly reading flags that may make the device "private" so that certain actions are not executed or certain properties set (device scanning not executed, not creating certain symlinks, giving priority to certain symlinks [in case of snapshots] etc.). Also, the synthetic uevents may appear anytime - the trigger can be called anytime and you need to take great care to be sure about the current state you're in. The SID project tries to deal with this by providing more possibilities here. But it still relies on udev events as we currently doesn't have anything better from kernel side.

Thank you very much for this explanation.

SID relies on udevd, not just udev events? The idea of some framework designed purely for storage instantiation that makes simpler use of the kernel events sounds great to me. I think this working very well with udevd makes a lot of sense, but if this were possible to be used entirely separately, that sounds amazing.

I think the concept of the kernel sending events the userspace can use to handle hardware addition makes sense, but having a single service used to manage every device under the sun without realistic alternatives does not.