stratis-storage / project

A holding place for issues that affect more than one repository in the project
4 stars 0 forks source link

Work on udev synchronization in stratisd and devicemapper #242

Open jbaublitz opened 3 years ago

jbaublitz commented 3 years ago

Currently, there is a race condition in our udev rule for symlinks to stratisd filesystems. @mvollmer brought this to our attention and it appears that there are three improvements that need to be made:

@mvollmer Do you have any other comments here?

jbaublitz commented 3 years ago

One additional note is that we should explore using libdevmapper in our devicemapper library before deciding to reimplement the functionality that we need in Rust.

mvollmer commented 3 years ago

John Baublitz notifications@github.com writes:

Currently, there is a race condition in our udev rule for symlinks to stratisd filesystems.

Just to try to paint a clearer picture: The race condition is not in your udev rule. The race is between mkfs.xfs (started by stratisd) and 13-dm-disk.rules. Your udev rule is just so slow sometimes (no offence) that it pretty reliably determines the outcome of that race, depending on whether it runs before 13-dm-disk.rules or after it.

@mvollmer brought this to our attention and it appears that there are three improvements that need to be made:

  • Edit devicemapper to properly set cookies when creating devicemapper devices to include the necessary udev flags and a semaphore ID

  • Wait on the semaphore count to drop to 0 prior to creating an xfs filesystem in stratisd

This will reduce the likelyhood of the race going the wrong way to near zero, but it is still there.

  • Send a synthetic uevent from the filesystem devicemapper device when the filesystem creation has been completed.

This would be a complete fix, AFAICS, and it should ideally go into mkfs, parted, lvm, and all the tools that write things that libblkid reads. The comment in that code could be something like

We can not rely on udev synthesizing a "change" event when we close
the file descriptor here.  Udev might not be watching the device
node at all times, and there is no way to make sure that it does.

Or we fix udev to never stop watching.

jbaublitz commented 3 years ago

@mvollmer Okay, despite my lack of clarity in the issue, it sounds like we're on the same page!

Just to try to paint a clearer picture: The race condition is not in your udev rule. The race is between mkfs.xfs (started by stratisd) and 13-dm-disk.rules. Your udev rule is just so slow sometimes (no offence) that it pretty reliably determines the outcome of that race, depending on whether it runs before 13-dm-disk.rules or after it.

One note here is that the rule should speed up significantly once we get multithreading in so this will be a good thing to fix before then.

This would be a complete fix, AFAICS, and it should ideally go into mkfs, parted, lvm, and all the tools that write things that libblkid reads.

For now, I'm going to discuss this with the team and see how we want to proceed. I agree that it should go in stratisd temporarily as this work is being done and those utilities long-term.

mulkieran commented 3 years ago

@jbaublitz Can we consider this obsoleted and close it?

jbaublitz commented 3 years ago

I'm uncertain. Parts of the udev cookie are still necessary particularly around disabling the first scan of the devicemapper device that was created, but I'm unsure if we need synchronization via the semaphore.

The race condition is as follows:

Now @prajnoha had mentioned that we should absolutely disable the initial scan of the devicemapper device when it's created using udev flags in the cookie (DM_UDEV_DISABLE_DISK_RULES_FLAG, DM_UDEV_DISABLE_OTHER_RULES_FLAG, and optionally DM_UDEV_DISABLE_SUBSYSTEM_RULES_FLAG to be specific) so we will at the very least need this. This avoids picking up an old signature on the devicemapper device before it is formatted.

What I'm less clear on is now that we generate a synthetic udev change event when we create the filesystem whether we still need the udev synchronization aspect of things. The reason this is unclear to me is that the condition where we need to wait for udev to finish are is if we use any of the /dev/mapper or /dev/stratis symlinks internally. For filesystem creation, we absolutely use the /dev/dm* device nodes (which are safe without udev synchronization) as that's what devicemapper-rs returns as a device node. We also use this when we generate the synthetic udev event (partially because I tried using the /dev/mapper links but we didn't have synchronization in place yet, so there were timing issues).

One additional note is that the way we generate the synthetic event currently may need to be modified because:

  1. we don't provide a UUID to the synthetic udev event. This is currently unnecessary because we're matching on the device name of the devicemapper device which, as I understand it, will be present in the kernel independent of the symlinks and the symlinks are simply generated based off of the device name itself. This device name contains the pool and filesystem UUID. However, if we use this interface for the synthetic udev event, we could potentially obsolete stratis_uuids_to_names altogether by passing the filesystem UUID via the interface and then POOL_NAME=[pool name] FS_NAME=[filesystem name] so that we never need to access our own D-Bus API to determine those things. This would be a huge win from the perspective of some of the failures we've seen around the time between something being available on our D-Bus API vs. when the filesystem is actually created. I would really like to take this approach if we are able.
  2. synthetic udev events don't have a cookie and thus don't have the DISABLE_* flags so we would potentially need to set them in our udev rule for our synthetic udev event.

Last note is that if we change the udev rule to specifically expect the udev synthetic event (by matching on SYNTH_UUID), I think the filesystem creation change event, if it happened, would become completely irrelevant because our rule would no longer match that one, but please correct me if I'm wrong.

@mvollmer @prajnoha Correct me if I've missed or misunderstood anything. This is based off of your emails. :)

prajnoha commented 3 years ago

Now @prajnoha had mentioned that we should absolutely disable the initial scan of the devicemapper device when it's created using udev flags in the cookie (DM_UDEV_DISABLE_DISK_RULES_FLAG, DM_UDEV_DISABLE_OTHER_RULES_FLAG, and optionally DM_UDEV_DISABLE_SUBSYSTEM_RULES_FLAG to be specific) so we will at the very least need this. This avoids picking up an old signature on the devicemapper device before it is formatted.

Yes, this is really necessary to be able to overwrite possible previous signatures that might otherwise fire udev rules and associated actions and then interfere...

What I'm less clear on is now that we generate a synthetic udev change event when we create the filesystem whether we still need the udev synchronization aspect of things. The reason this is unclear to me is that the condition where we need to wait for udev to finish are is if we use any of the /dev/mapper or /dev/stratis symlinks internally. For filesystem creation, we absolutely use the /dev/dm* device nodes (which are safe without udev synchronization) as that's what devicemapper-rs returns as a device node. We also use this when we generate the synthetic udev event (partially because I tried using the /dev/mapper links but we didn't have synchronization in place yet, so there were timing issues).

You're right. If you're using /dev/dm*, they're available right away when created in kernel so it's not like with all the other symlinks (like /dev/mapper/*) for which we need to wait for udevd to create them. Just two notes here which come to my mind:

One additional note is that the way we generate the synthetic event currently may need to be modified because:

  1. we don't provide a UUID to the synthetic udev event. This is currently unnecessary because we're matching on the device name of the devicemapper device which, as I understand it, will be present in the kernel independent of the symlinks and the symlinks are simply generated based off of the device name itself. This device name contains the pool and filesystem UUID. However, if we use this interface for the synthetic udev event, we could potentially obsolete stratis_uuids_to_names altogether by passing the filesystem UUID via the interface and then POOL_NAME=[pool name] FS_NAME=[filesystem name] so that we never need to access our own D-Bus API to determine those things. This would be a huge win from the perspective of some of the failures we've seen around the time between something being available on our D-Bus API vs. when the filesystem is actually created. I would really like to take this approach if we are able.

Yes, the interface is provided for cases like these exactly where we need to pass some additional info through the synthetic uevent. You just need to be aware that the interface has been added later and it is available in kernel version >= 4.13.

  1. synthetic udev events don't have a cookie and thus don't have the DISABLE_* flags so we would potentially need to set them in our udev rule for our synthetic udev event.

Yes, it needs to be done this way. The DM cookie is solely DM's own invention before the extended synthetic uevent interface was in place. The synthetic uevent interface is generic and targeted for all possible devices so it has its own naming scheme for the passed keys which are prefixed with SYNTH_ARG_ here. So if we need to pass the DM flags as we'd do with the DM cookie, we need to pass them through the synthetic uevent interface and then when they come back as SYNTH_ARG_<KEY> keys in the synthetic uevent, we need to translate them back to what DM udev rules understand (e.g.translating SYNTH_ARG_STRATIS_DM_UDEV_DISABLE_DISK_RULES to DM_UDEV_DISABLE_DISK_RULES). Currently, we don't have a rule for this. Such translation rule would probably need to go even before 10-dm.rules so even the DM_UDEV_DISABLE_DM_RULES flag can be used to change the way 10-dm.rules are executed which are the very first DM udev rules to execute.

I'm just thinking whether such key translation udev rule shouldn't be provided by device-mapper package itself and then letting other subsystems, like stratis, just use SYNTH_ARG_DM_UDEV_... that would be translated to DM_UDEV_... keys.

Last note is that if we change the udev rule to specifically expect the udev synthetic event (by matching on SYNTH_UUID), I think the filesystem creation change event, if it happened, would become completely irrelevant because our rule would no longer match that one, but please correct me if I'm wrong.

Yes, exactly! The SYNTH_UUID is specifically designed to make such matching possible.

jbaublitz commented 3 years ago

@prajnoha Thanks so much for the response! I have a prototype udev rule under /usr/lib/udev/rules.d/14-stratisd.rules:

ACTION=="change", ENV{DM_NAME}=="stratis-1-[0-9a-f]*-thin-fs-[0-9a-f]*", ENV{SYNTH_UUID}=="[0-9a-f]*-[0-9a-f]*-[0-9a-f]*-[0-9a-f]*-[0-9a=f]*", ENV{SYNTH_ARG_POOLNAME}!="", ENV{SYNTH_ARG_FSNAME}!="", SYMLINK+="stratis/$env{SYNTH_ARG_POOLNAME}/$env{SYNTH_ARG_FSNAME}"

This works for creating the symlink but there was one additional thing that I wanted to do: match ENV{ID_FS_UUID} with ENV{SYNTH_UUID}. Based on browsing the bundled udev rules on Fedora, it doesn't look like something like this is possible, is it?

prajnoha commented 3 years ago

@prajnoha Thanks so much for the response! I have a prototype udev rule under /usr/lib/udev/rules.d/14-stratisd.rules:

You're welcome :)


ACTION=="change", ENV{DM_NAME}=="stratis-1-[0-9a-f]*-thin-fs-[0-9a-f]*", ENV{SYNTH_UUID}=="[0-9a-f]*-[0-9a-f]*-[0-9a-f]*-[0-9a-f]*-[0-9a=f]*",

Here's [0-9a=f], was that meant to be [0-9a-f] - "dash" instead of "equal"?

ENV{SYNTH_ARG_POOLNAME}!="", ENV{SYNTH_ARG_FSNAME}!="", SYMLINK+="stratis/$env{SYNTH_ARG_POOLNAME}/$env{SYNTH_ARG_FSNAME}"

Maybe it would be better if you used some prefix like "STRATIS" for the synth args, so they end up like SYNTH_ARG_STRATIS_POOLNAME, SYNTH_ARGSTRATIS.... Just future-proofing that it doesn't clash with possible other uses in the future where something else than Stratis itself could generate the uevents for whatever reason. Then, we make sure that we have much better position in identifying who generated the synthetic uevent if there is more than one "generator".

This works for creating the symlink but there was one additional thing that I wanted to do: match ENV{ID_FS_UUID} with ENV{SYNTH_UUID}. Based on browsing the bundled udev rules on Fedora, it doesn't look like something like this is possible, is it?

Unfortunately, with udev rules, you can only match the value against a constant or regex. But you can't compare values of two keys against each other - this is udev's huge shortcoming and I've hit this once too, it's really annoying! I'm afraid you need to call external tool to do the comparison (through PROGRAM=... udev rule and then checking the result using RESULT rule... or alternatively, calling IMPORT{program} rule and letting the external program to export environment variable that is then imported into udev environment and then can check it directly with ENV{a_key_with_result}=... rule).

prajnoha commented 3 years ago

need to call external tool to do the comparison (through PROGRAM=... udev rule and then checking the result using RESULT rule... or alternatively, calling IMPORT{program} rule and letting the external program to export environment variable that is then imported into udev environment and then can check it directly with ENV{a_key_with_result}=... rule).

(When calling external programs from udev rules, udevd automatically sets environment variables for that program so they contain all the existing udev environment variables - so you can read them all in the external program just like you would read usual environment variables. Exporting is then just printing on output in key=value format.)

jbaublitz commented 3 years ago

@prajnoha Thanks so much for the response! I have a prototype udev rule under /usr/lib/udev/rules.d/14-stratisd.rules:

You're welcome :)

ACTION=="change", ENV{DM_NAME}=="stratis-1-[0-9a-f]*-thin-fs-[0-9a-f]*", ENV{SYNTH_UUID}=="[0-9a-f]*-[0-9a-f]*-[0-9a-f]*-[0-9a-f]*-[0-9a=f]*",

Here's [0-9a=f], was that meant to be [0-9a-f] - "dash" instead of "equal"?

Whoops! This will be completely removed with the next revision as I'll be using an external program to compare the two UUIDs.

ENV{SYNTH_ARG_POOLNAME}!="", ENV{SYNTH_ARG_FSNAME}!="", SYMLINK+="stratis/$env{SYNTH_ARG_POOLNAME}/$env{SYNTH_ARG_FSNAME}"

Maybe it would be better if you used some prefix like "STRATIS" for the synth args, so they end up like SYNTH_ARG_STRATIS_POOLNAME, SYNTH_ARGSTRATIS.... Just future-proofing that it doesn't clash with possible other uses in the future where something else than Stratis itself could generate the uevents for whatever reason. Then, we make sure that we have much better position in identifying who generated the synthetic uevent if there is more than one "generator".

One interesting thing that I noticed when working with the synthetic udev event interface is that write returns EINVAL when there are underscores in the variable names. I will go with something like STRATISPOOLNAME and STRATISFSNAME though; that should work!

This works for creating the symlink but there was one additional thing that I wanted to do: match ENV{ID_FS_UUID} with ENV{SYNTH_UUID}. Based on browsing the bundled udev rules on Fedora, it doesn't look like something like this is possible, is it?

Unfortunately, with udev rules, you can only match the value against a constant or regex. But you can't compare values of two keys against each other - this is udev's huge shortcoming and I've hit this once too, it's really annoying! I'm afraid you need to call external tool to do the comparison (through PROGRAM=... udev rule and then checking the result using RESULT rule... or alternatively, calling IMPORT{program} rule and letting the external program to export environment variable that is then imported into udev environment and then can check it directly with ENV{a_key_with_result}=... rule).

In this case we have two options: use a quick sh string comparison or write an executable to compare two strings. Usually shell invocations are discouraged in things like udev from what I've heard, but if it's simply a string comparison, does that seem like a problem to you?

jbaublitz commented 3 years ago

@prajnoha If you'd like to see the full pull request, it's stratis-storage/stratisd#2461.

prajnoha commented 3 years ago

One interesting thing that I noticed when working with the synthetic udev event interface is that write returns EINVAL when there are underscores in the variable names. I will go with something like STRATISPOOLNAME and STRATISFSNAME though; that should work!

Ah, yes! That's to minimize the allowed character set for the key names to absolute minimum (isalnum) to avoid interference with other "genuine" keys.

Unfortunately, with udev rules, you can only match the value against a constant or regex. But you can't compare values of two keys against each other - this is udev's huge shortcoming and I've hit this once too, it's really annoying! I'm afraid you need to call external tool to do the comparison (through PROGRAM=... udev rule and then checking the result using RESULT rule... or alternatively, calling IMPORT{program} rule and letting the external program to export environment variable that is then imported into udev environment and then can check it directly with ENV{a_key_with_result}=... rule).

In this case we have two options: use a quick sh string comparison or write an executable to compare two strings. Usually shell invocations are discouraged in things like udev from what I've heard, but if it's simply a string comparison, does that seem like a problem to you?

Actually, there are a few udev rules already which use sh (just grep in /lib/udev/rules.d for /bin/sh - I can see a few). I think the use of shell in udev rules is discouraged because with sh you can't be sure which shell it is actually that you're using on target system and so you can't expect that certain functionality is supported everywhere (...on all possible shells and their configuration combinations). For upstream solution, that might be a source of problems with certain distros and/or user customization or individual configurations. As for the performance impact, I can't tell, I've never profiled that. Honestly, I'm not quite sure why this simple string comparison isn't directly supported by udevd - it's really annoying and we're probably not the only ones who needed this...

jbaublitz commented 3 years ago

@mulkieran This may actually be resolved. I'm going to do a little bit more investigation into any places we may use /dev/mapper symlinks internally. If we are only using /dev/dm* device nodes, we don't need to rely on udev synchronization in stratisd. However, we may want to consider a separate action item around devicemapper-rs supporting udev cookies. That one is definitely not required though and will require more consideration because we would either have to reimplement bits of libdevmapper or do more investigation into libdevmapper's thread safety.

mulkieran commented 3 years ago

Dropping from any specific board.