multipathd: Implement `shutdownifnomultipath`

sergiodj commented 1 year ago

Currently, socket activation doesn't work properly with multipath-tools (see #76 for a lengthy discussion about the issue). Due to the way multipath-tools and udev interact, it is not possible to guarantee that the daemon will be stopped even if there are no multipath devices/maps present in the system.

In order to workaround that limitation, this commit proposes a new daemon command called shutdownifnomultipath. It does what it says on the tin: if no multipath maps have been found during the system initialization, the daemon will shut itself down. On top of that, we use a oneshot (actually "exec") systemd service that will be executed only once, after everything's been set up, to send the command mentioned above to the daemon.

Note that this approach is still not perfect: if new devices are added to the system while it's running, multipathd will be started, even if no multipath'ing is done. But that's better than the current state where the daemon stays on unconditionally.

sergiodj commented 1 year ago

@mwilck Hey, could you take a look at this, please? Is it too naïve, or does it meet what we've been discussing on #76? TIA.

bmarzins commented 1 year ago

I would prefer to see this broken up into two commits. One to add the functionality to multipathd, and the other to add the new service.

Also, since you don't have DefaultDependencies=no in multipathd/multipathd-shutdown-no-multipath.service, it should already be starting after basic.target, so your current "After" dependencies are unnecessary. But like Martin mentioned before, you might want to wait even longer for this to run, perhaps "After=multi-user.target". This would give devices unnecessary in boot as much time as possible to appear and get multipathed before you check if any multipath devices exist.

sergiodj commented 1 year ago

I would prefer to see this broken up into two commits. One to add the functionality to multipathd, and the other to add the new service.

Sure thing. I'll work on splitting the changes soon.

Also, since you don't have DefaultDependencies=no in multipathd/multipathd-shutdown-no-multipath.service, it should already be starting after basic.target, so your current "After" dependencies are unnecessary. But like Martin mentioned before, you might want to wait even longer for this to run, perhaps "After=multi-user.target". This would give devices unnecessary in boot as much time as possible to appear and get multipathed before you check if any multipath devices exist.

Interesting, indeed I thought about using After=multi-user.target in order to make sure that this is the last command to run, but wasn't sure if that'd have unintended consequences. I'll change the service file to do that. I'll also remove the current After= since it'll become unnecessary.

Thanks.

mwilck commented 12 months ago

Some remarks:

IIRC, the first step for fixing this issue should be to disable multipathd socket activation; at least nobody has contradicted my statement saying so in #76.

Secondly, I dislike the "shutdownifnomultipath" verb. multipathd supports multi-word commands, so we should be able to implement something like shutdown if-idle or even shutdown if idle.

Wrt waiting after boot, we could even combine the auto-shutdown service with a systemd timer. Basically what we could do is implement a timer that waits for (say) 60s after activation, and runs "shutdown if idle" after that. Such a timer could even be combined with a setup where multipathd.socket was active, and might fix the issue of multipathd being restarted when new devices are added.

sergiodj commented 12 months ago

IIRC, the first step for fixing this issue should be to disable multipathd socket activation; at least nobody has contradicted my statement saying so in #76.

I don't have much time to reply right now, but IIRC leaving the socket activation enabled has the benefit of covering the case when the user adds a device to a live system. Maybe I'm misremembering, though.

Secondly, I dislike the "shutdownifnomultipath" verb. multipathd supports multi-word commands, so we should be able to implement something like shutdown if-idle or even shutdown if idle.

I'll take a look at implementing the multi-word command. shutdown if-idle sounds good.

Wrt waiting after boot, we could even combine the auto-shutdown service with a systemd timer. Basically what we could do is implement a timer that waits for (say) 60s after activation, and runs "shutdown if idle" after that. Such a timer could even be combined with a setup where multipathd.socket was active, and might fix the issue of multipathd being restarted when new devices are added.

A systemd timer works, too. I like the idea of having multipathd shutting itself off even after the addition of new devices to a live system.

mwilck commented 12 months ago

I don't have much time to reply right now, but IIRC leaving the socket activation enabled has the benefit of covering the case when the user adds a device to a live system. Maybe I'm misremembering, though.

I still think this is a corner case, as I said before ("We are now discussing automation of 4. to avoid 5."). In general I think users you have multipath hardware should generally enable multipathd, and users who don't should not. Is that asking too much from users?

But of course, if you want a fully automated setup that works for everyone without investing thought in their setup, we may want to go down that route.

Secondly, I dislike the "shutdownifnomultipath" verb. multipathd supports multi-word commands, so we should be able to implement something like shutdown if-idle or even shutdown if idle.

I'll take a look at implementing the multi-word command. shutdown if-idle sounds good.

Instead of creating just VRB_SHUTDOWN_NO_MPATH you'd create KEY_IF_IDLE and Q1_IF_IDLE. In callbacks.c you would do

    set_handler_callback(VRB_SHUTDOWN | Q1_IF_IDLE, 
                HANDLER(cli_shutdown_if_no_multipath));

A systemd timer works, too. I like the idea of having multipathd shutting itself off even after the addition of new devices to a live system.

The timer would work for both boot and and device addition. By setting After=multi-user.target on the timer, we can make sure that it doesn't expire prematurely even if booting takes a long time.

We should also be aware of the find_multipaths smart case. Currently, check_path_valid() calls is_path_valid(..., check_multipathd=true) for uevents, which would trigger the socket, before calling print_cmd_valid(), which will check for timed-out devices in the "smart" case. I suppose we could check /run/multipath/find_multipaths/$DEVNODE early on to see if the device had been considered before and has timed out, before connecting to the socket. This requires changes to the is_path_valid() logic though. I don't expect you to do it, but it's a thing we should keep in mind.

All this said, if you have patches, please submit them to the dm-devel mailing list as described in README.md.

sergiodj commented 11 months ago

Sorry, this fell between the cracks...

I've now updated the branch following the suggestions from @mwilck. I haven't tested it fully yet, but would appreciate if you could take another look meanwhile and check if it follows what you had in mind.

As for the find_multipaths smart case, I confess I haven't thought about it yet. Please correct me if I'm wrong, but IIUC it's not really a blocker to this PR. If that's indeed the case, then I'd rather leave it for a subsequent iteration.

zeha commented 11 months ago

A command that gets sent some fixed time later sounds like a good place to introduce race conditions under load. Can this not be a config option that is always active on distros choosing such a setup?

mwilck commented 11 months ago

@zeha:

A command that gets sent some fixed time later sounds like a good place to introduce race conditions under load. Can this not be a config option that is always active on distros choosing such a setup?

I am not sure I understand what you're asking for. What config option do you want, and how should its semantics be? I figure that both you and @sergiodj are working for the Debian project, so perhaps the two of you want to find an agreement on the desired behavior?

Wrt the race condition, this is a valid concern. Something like the following could happen:

an idle timer is runnning for multipathd
some uevent arrives that would cause a multipath map to be created
"multipath -u" is called in udev rules and finds multipathd running
timer expires and multipathd exits
the uevent would be passed to multipathd after rule processing, but it has exited

We could avoid this scenario by doing the equivalent of an "udevadm settle" (i.e. wait for pending uevents to finish, and make sure multipathd has processed all uevents it is aware of, i.e. multipathd's internal uevent queue is empty) when multipathd receives an "shutdown if-idle" command.

I think if we did this, we'd be safe. I can't perceive other race conditions. If the timer expired before 3. (or 2.) above, "multipath -u" would be called while processing the udev rules, and start another multipathd instance.

Anyway, @sergiodj, I need to ask you to send this to dm-devel@lists.linux.dev (with @bmarzins and myself on CC). I think this needs a broader audience.

sergiodj commented 11 months ago

@zeha:

A command that gets sent some fixed time later sounds like a good place to introduce race conditions under load. Can this not be a config option that is always active on distros choosing such a setup?

I am not sure I understand what you're asking for. What config option do you want, and how should its semantics be? I figure that both you and @sergiodj are working for the Debian project, so perhaps the two of you want to find an agreement on the desired behavior?

@zeha and I talked in private yesterday and I believe we've reached a consensus on how to incorporate these changes into Debian.

Wrt the race condition, this is a valid concern. Something like the following could happen:
1. an idle timer is runnning for multipathd

2. some uevent arrives that would cause a multipath map to be created

3. "multipath -u" is called in udev rules and finds multipathd running

4. timer expires and multipathd exits

5. the uevent would be passed to multipathd after rule processing, but it has exited
We could avoid this scenario by doing the equivalent of an "udevadm settle" (i.e. wait for pending uevents to finish, and make sure multipathd has processed all uevents it is aware of, i.e. multipathd's internal uevent queue is empty) when multipathd receives an "shutdown if-idle" command.

I think if we did this, we'd be safe. I can't perceive other race conditions. If the timer expired before 3. (or 2.) above, "multipath -u" would be called while processing the udev rules, and start another multipathd instance.

OK, I'll see about implementing this change. I'll have to discover how to access multipathd's internal uevent queue and check if it's empty.

Anyway, @sergiodj, I need to ask you to send this to dm-devel@lists.linux.dev (with @bmarzins and myself on CC). I think this needs a broader audience.

Will do. Thanks.

sergiodj commented 11 months ago

Wrt the race condition, this is a valid concern. Something like the following could happen:
1. an idle timer is runnning for multipathd

2. some uevent arrives that would cause a multipath map to be created

3. "multipath -u" is called in udev rules and finds multipathd running

4. timer expires and multipathd exits

5. the uevent would be passed to multipathd after rule processing, but it has exited
We could avoid this scenario by doing the equivalent of an "udevadm settle" (i.e. wait for pending uevents to finish, and make sure multipathd has processed all uevents it is aware of, i.e. multipathd's internal uevent queue is empty) when multipathd receives an "shutdown if-idle" command. I think if we did this, we'd be safe. I can't perceive other race conditions. If the timer expired before 3. (or 2.) above, "multipath -u" would be called while processing the udev rules, and start another multipathd instance.
OK, I'll see about implementing this change. I'll have to discover how to access multipathd's internal uevent queue and check if it's empty.

BTW, I'd appreciate some pointers here. My experience dealing with libudev is very limited, and I'm no expert on multipath-tools either.

I know programs aren't supposed to use udev_queue, and I don't know offhand if there's a proper way to query udev regarding the list of pending events that need to be processed.

mwilck commented 10 months ago

BTW, I'd appreciate some pointers here. My experience dealing with libudev is very limited, and I'm no expert on multipath-tools either.

Right. On the multipathd side, you will find the relevant code in libmultipath/uevent.c. multipathd processes uevents in multiple steps. The uevent listener thread (uevent_listen()) monitors finished uevents and adds them to the uevq list. The uevent dispatcher thread (uevent_dispatch()) then passes them to uevent_trigger() to be actually processed. libmultipath provides a function is_uevent_busy() to check whether any uevents are pending. It should be sufficient to call this function to obtain multipathd's internal state.

I know programs aren't supposed to use udev_queue, and I don't know offhand if there's a proper way to query udev regarding the list of pending events that need to be processed.

I need to research this myself. There seems to be no official API for this. udevadm settle checks if /run/udev/queue exists, but that's not a documented API. I guess our "best" option would be to call udevadm settle -t 0, even though I don't like that much. Even if that command returns success, we can't be 100% certain that multipathd's uevent listener has already seen the events and added them to uevq. AFAICS, the only way to ensure this would be to make sure the uevent listener runs at a higher priority then the thread that services the socket, and call sched_yield().

A naïve implementation might look roughly like this:

int _is_udevq_idle() {
    int st;

    if (is_uevent_busy())
        return false;
    st = system("/usr/bin udevadm settle -t 0")
    if (!WIFEXITED(st) || WEXITSTATUS(st))
         return false;
    return true;
}

int is_udevq_idle() {
     if (!_is_udevq_idle())
         return false;
     // give uevent listener has a chance to  run
     // listener should have higher prio than this thread to be certain
     sched_yield();
     // then check again
     return _is_udevq_idle();
}

This isn't completely safe because new uevents can arrive at any time, but I suppose it's safe enough for practical purposes. The question is what we should do if this returns false (just fail, wait for some time, or call udevadm settle with some positive timeout).

This begs the question, again, whether the "automatic exit" idea makes sense at all. It's a lot of complexity added for a minor improvement of user experience (see my previous breakdown).

zeha commented 10 months ago

What config option do you want, and how should its semantics be?

My idea would basically be:

New config option, "multipathd_no_devices_found". (Upstream) defaults to "stay_running", keeping today's behaviour. Interested distros or users could set "multipathd_no_devices_found exit", where if multipathd doesn't see any multipath-eligible devices it should exit again. Maybe also after the last multipath device disappeared.

But this is an idle idea, without having looked at the multipathd startup code.

sergiodj commented 10 months ago

Hi,

Sorry for the delay in replying. I've been doing this work as a side task, but $DAYJOB is now a bit more hectic than expected. Mainly for this reason, but also because I agree with the assessment that this is a lot of work/complexity for a relatively minor gain, I do not intend to continue working on this feature.

Thank you all for the invaluable help and interesting discussions that made me understand more about the world of multipath'ed devices.

@zeha, I will send a Merge Request against Debian's multipath-tools package in order to revert part or all of the changes that I'd suggested a while ago regarding the socket activation feature. This whole "side project" made me realize that the changes don't work as expected, and as such should not be advertised in the d/NEWS file.

opensvc / multipath-tools

multipathd: Implement `shutdownifnomultipath` #78