Irreversibility of the cleanup actions

vorner / signal-hook

Rust library allowing to register multiple handlers for the same signal

Apache License 2.0

727 stars 71 forks source link

Irreversibility of the cleanup actions #30

Closed vorner closed 3 years ago

vorner commented 4 years ago

The cleanup module added in 0.1.12 has one drawback. The cleanup is irreversible. Once it has been done, it is no longer possible to revive signal hook for that one signal.

In a sense it is not a big problem, as the cleanup is supposed to be run just before shutdown. But it still lacks some kind of elegance and might be a good idea to do something about eventually.

There are two problems in there. The first one that any handlers chained before or after our own handler are not preserved anywhere and we don't really know about them.

The second problem is that registry still thinks it is registered while it is not. So it will never reinsert our handler.

The challenge is to figure what we would like to do in the first place. Do we want some kind of „pause/resume“ of a signal handler? If so, we want to be able to manipulate that from within a signal handler, as we want to be able to pause them as part of our actions. Maybe some kind of AtomicPtr for each signal?

Do we want to resume automatically on inserting a new signal? If so, we still want to have a manual way to enable it too, without adding anything in.

chrisduerr commented 3 years ago

Okay, so I have a few questions about a possible implementation for this. First of all this seems to be a testcase that can reproduce the problem:

Test

```rust #[cfg(test)] mod tests { use std::sync::Arc; use std::sync::atomic::{AtomicUsize, Ordering}; use libc::SIGWINCH; use crate::register; use crate::cleanup::cleanup_signal; // TODO: Since the issue is in `unregister_signal not removing the signal completely, this // entire testcase can probably be moved to the signal-hook-registry crate. // /// Attempt to temporarily reset a signal handler to the default action and then proceed with /// the signal-hook handler again. #[test] fn reregister() { let status = Arc::new(AtomicUsize::new(0)); // Register the signal. unsafe { let status = status.clone(); register(SIGWINCH, move || { status.fetch_add(1, Ordering::Relaxed); }).unwrap(); } // Verify the handler is working. unsafe { libc::raise(SIGWINCH); } assert_eq!(status.load(Ordering::Relaxed), 1); // Remove the signal handler. cleanup_signal(SIGWINCH).unwrap(); // Verify the signal handler is reset. unsafe { libc::raise(SIGWINCH); } assert_eq!(status.load(Ordering::Relaxed), 1); // Register a new signal handler. unsafe { let status = status.clone(); register(SIGWINCH, move || { status.fetch_add(2, Ordering::Relaxed); }).unwrap(); } // Verify the signal handler works again. unsafe { libc::raise(SIGWINCH); } assert_eq!(status.load(Ordering::Relaxed), 3); } } ```

It looks to me like the problem is mainly around unregister_signal in signal-hook-registry. It seems like this does not actually remove the signal handler, but instead decides to just remove all actions for the signal handler. This leaves the signal handler as "occupied" for the future.

So I have two questions about that: I'd assume it's not removed to preserve Prev, which is probably the previous action that was registered to that signal? So I'd assume in theory this could be solved by just removing the signal and replacing the current signal with Prev?

My second question is why unregister_signal cannot remove the signal completely, if the goal is to set the signal to SIG_DFL anyways. If the user calls cleanup_signal, I'd assume the goal is to remove all signal handlers and cleanup_raw also doesn't care about any previous installed handlers. In this case wouldn't it be possible to just remove it so that the signal is unoccupied again?

Lastly I'm a bit confused about how the Occupied thing works. It looks like it just inserts a new signal, which of course makes sense. I'd assume this doesn't work when everything got unregistered because the signal handler isn't re-registered with the OS? Wouldn't it be possible to check if the length of the actions is zero and conditionally then re-register with the OS?

vorner commented 3 years ago

I believe the test case illustrates the exact problem, yes.

Before answering the specific question, let me give a bit of a background.

The signal-hook-registry is not just a separation of lower-level APIs. If the application ends up with multiple versions of signal-hook in the dep graph (it can happen, through transitive dependency), then they would fight over the registration tables and signal handlers. Therefore, I used the trick from rayon and extracted the bare minimum to a separate crate that's supposed to be forever 1.*. As there's no good migration strategy (a bad one through the dtolnay trick might be possible, but I haven't tried if it actually would work), I'm a bit reluctant to add anything to it, as that API will have to live there for ever, no matter how bad I design it.
The current cleanup API was added mostly for the double-CTRL+C use case (first one for graceful shutdown, second one kills). Now, as most patterns in signal-hook work in a way to postponing anything interesting outside of the actual handler (because of the async-signal-safe pain), if the application is really stuck, that removal of the signal handler needs to happen inside the signal handler, because the postponed stuff might wait for an infinite loop to end or something. Therefore, it is split into the part that only kills the signal handler (and is async-signal-safe) and the rest of the cleanup, that can happen during the graceful shutdown.
While I would like every signal handling to be based on at least signal-hook-registry, having it the only thing dealing with low-level signal handlers, I can't reasonably assume to be true. But I've looked through several libraries that touch signal handlers and most of them seem to do a similar chained-signal-handlers trick as signal-hook-registry. That means that I have to assume that there are signal handlers chained after us (through the Prev), but also before us, when someone else registers after us, we become their prev and get called from them.

Now, to the specific answers:

It looks to me like the problem is mainly around unregister_signal in signal-hook-registry.

This is a design based on both 1. and 2. When designing it, I was asking „What is the minimum to support the cleanup, in the split way, that needs to go into signal-hook-registry.“ I was sure I needed this in there and I wasn't sure about the exact semantics of what the rest of the signal cleanup should look like, so the other half is in signal-hook.

I think changing the scope of what that function does would be breaking change. But if needed, the second half can go (probably as separate function) into signal-hook-registry too, once we figure out what exactly the second half should be.

So I have two questions about that: I'd assume it's not removed to preserve Prev

There are actually two reasons why it is not done right now:

The actual removal of the signal handler might have already happened, in the injected part. But we can't access the Prev in there, because it needs to be async-signal-safe and accessing that one requires locking.
Some other library, that registered a signal handler after us, is sitting in the signal handler and chain-calling us. If we did just that, we would make the other library not work.

My second question is why unregister_signal cannot remove the signal completely

For one, it is supposed to be only the one half of the handling. For another, having an API that destroys signal handlers of others, while pretending to be reversible seems like a footgun. Both of these together made me uneasy about bundling it in. And it kind of seemed useful to be able to wipe the hooks (that is reversible) from destroying signal handling for ever in the application.

I'd assume this doesn't work when everything got unregistered because the signal handler isn't re-registered with the OS?

Yes, that's the actual problem.

Wouldn't it be possible to check if the length of the actions is zero and conditionally then re-register with the OS?

That's tricky. For one, the „kill“ of the signal handler is not tied to the actual removal of actions, for the reasons of being able to run it inside the signal handler itself. So even if we just split it to the two parts (kill/cleanup), there's a short time in between when this assumption doesn't hold.

On the other hand, it is possible to reach 0 actions by simply unregistering the last action. That doesn't unregister the signal handler in that case and therefore re-registering it with the OS would result in calling everything twice (we could check for the previous action being us, but then, we could get someone else in the case we are being chained-called).

All in all, I don't have a clear and obviously correct answer even about what exactly should happen when, and that probably needs answering first. I think the goal is something in lines of:

Unregistering of the last action should not remove the signal handler, but it should be possible to explicitly ask for it.
It should be possible to do at least the most critical part inside a signal handler.
It should be possible to recover from that removal ‒ this is the last, missing part. We could try doing that as part of registering of a new action automatically, or, when the removal of the handler is explicit, have an explicit call for this too.

One idea (not really finished) is to add a pair of „pause_signal“ and „resume_signal“ calls to signal-hook-registry. It might be possible to have set of AtomicPtr<handler>. The pause_signal would store it there and the resume_signal could take it out it.

But I'm throwing it mostly as a brainstorming idea, not necessarily the solution.

chrisduerr commented 3 years ago

Unregistering of the last action should not remove the signal handler, but it should be possible to explicitly ask for it.

That seems fine for unregister(SigId), but I don't think it needs to be considered for cleanup_signal(SIGNAL). I suppose that's what "explicitly ask for it" means here.

It should be possible to recover from that removal ‒ this is the last, missing part. We could try doing that as part of registering of a new action automatically, or, when the removal of the handler is explicit, have an explicit call for this too.

What exactly do you mean by "recover from that removal"? If we just cleanup_signal(SIGNAL) to reset to the default signal handler and remove all internal handlers, we are in a completely clean state for that specific signal, right? So in that case it seems like the recovery is automatic and all the user needs to do is register things again?

One idea (not really finished) is to add a pair of „pause_signal“ and „resume_signal“ calls to signal-hook-registry. It might be possible to have set of AtomicPtr. The pause_signal would store it there and the resume_signal could take it out it.

I don't see why this would be necessary. I can't really follow your jump from the problems to this solution.

I've also left out some other questions about your reply, but hopefully drilling down on the "goals" that we are trying to solve is more productive than diving too deep into the details around it. Though I'm probably not understanding you entirely on that, so I just want to make this clear.

vorner commented 3 years ago

What I'm trying to get to. It makes sense to split it into the removal of actions and to manipulating the signal handler, for three reasons:

We can't manipulate the actions inside a signal handler. But we may want to remove the signal handler from within the signal handler, to make 100% sure the second CTRL+C really does kill.
It is potentially useful (though I don't have a specific use case) to be able to separately remove all the actions while leaving the signal handler in place, since removing the signal handler has other consequences outside of this library.
It feels like changing the behaviour of signal_hook_registry::unregister_signal to touching the signal handlers themselves is a breaking change, to a crate that's 1.0 and doesn't really have a good way to go 2.0.

That's why I'm reluctant to bundle both into the same call, at least on the side of signal_hook_registry (there's more flexibility on the signal_hook side).

In that sense, the recover from removal is meant as dealing with the state that the async-signal-safe minimal kill of the handler happened, but the removal of actions or adjusting the nearby data structures didn't (since they are inside a Mutex).

That's also why I'm thinking in around atomics, since one can use them inside a signal handler.

chrisduerr commented 3 years ago

What I'm trying to get to. It makes sense to split it into the removal of actions and to manipulating the signal handler, for three reasons:

But cleanup_signal already isn't safe to be called from within the signal handler, right?

We can't manipulate the actions inside a signal handler. But we may want to remove the signal handler from within the signal handler, to make 100% sure the second CTRL+C really does kill.

So people here should call cleanup_raw/cleanup::register and ignore the signal handler bit, right? Isn't this already how this works?

It is potentially useful (though I don't have a specific use case) to be able to separately remove all the actions while leaving the signal handler in place, since removing the signal handler has other consequences outside of this library.

This cannot be done from within the signal handler, but should be handled by unregister(SigId), right?

It feels like changing the behaviour of signal_hook_registry::unregister_signal to touching the signal handlers themselves is a breaking change, to a crate that's 1.0 and doesn't really have a good way to go 2.0.

This is fair, unregister_signal explicitly states that it does not manipulate the signal handler in the OS. But there wouldn't be anything stopping us from adding a new method, right? Like unregister_signal_and_then_we_reset_the_signal_handler_to_the_default? That would be perfectly fine for semver. I have no issue with trying as hard as possible to prevent a breaking change, but of course this will mean that sooner or later changes might affect the API and things like deprecations and duplication of code will occur.

That's why I'm reluctant to bundle both into the same call, at least on the side of signal_hook_registry (there's more flexibility on the signal_hook side).

There's more flexibility on the signal_hook side, but I doubt we'd want to try and modify the signal handler from the outside? So I don't think that's a great idea? This seems like something that would just integrate nicely right into signal-hook-registry while being a tacked on hack in signal-hook. Though I have only a very limited understanding of the two libraries.

vorner commented 3 years ago

There are generally two options for users:

Use something like for sig in Signals::new(...) in its own thread and call cleanup_signal the first thing if any of the termination signals come. This is not a perfect solution, as there's still a chance something will get stuck somewhere in that particular thread and it never happens, but the chance is pretty low. You're right, this call is not async-signal-safe, so it can't be called from within the signal handler.
By doing cleanup::register, the cleanup_raw happens inside the signal handler. Then the user can either decide not to care about the actions, or, as part of that thread with for signal in Signals::new(...), call signal_hook_registry::unregister_signal (the second half of cleanup_signal) manually, in a delayed fashion, to finish the task.

The idea behind unregister_signal is that's just like multiple calls to unregister, except that one doesn't necessarily need to know all the SigIDs.

But there wouldn't be anything stopping us from adding a new method, right?

That's true. But I feel like once we have one method to remove the handler (cleanup_raw), and another to clean the actions (unregister_signal), it makes sense to support users calling both in sequence and be able to reverse that. And such method as you propose could then be just these two calls (that actually is cleanup_signal).

It would feel a bit odd to have two methods that together do effectively the same as the new one with the sole difference of irrecoverably breaking the internal state, while the new method can be reversed. I feel like the right fix is to make it work with the two separate methods (both of them now existing, but we would probably need to move the cleanup_raw, with a more appropriate name, into signal-hook-registry to be able to access whatever internals as needed).

What I mean with the flexibility is that once the ground work ‒ fixing the cleanup_raw so it can be reversed ‒ is done, in some minimal fashion, the next steps can be discussing how the signal_hook::cleanup should act to the end users and that while the recovery can be explicit on the registry side, it could be made automatic on the higher level in the signal-hook.

You seem to be trying to concentrate on unregister_signal, but that one is reversible. The irreversible one is cleanup_raw, and the one that should be fixed, so whatever it does can be turned back to the previous state.

And, if it can be turned back and the recovery is explicit, it would also cover, in a way, the use case you had in #63 ‒ it is basically equivalent to the workaround, except with some support from the library.

chrisduerr commented 3 years ago

The idea behind unregister_signal is that's just like multiple calls to unregister, except that one doesn't necessarily need to know all the SigIDs.

This doesn't make sense to me. You say the point is that it's just like multiple calls to unregister, but unregister_signal is also supposed to call unregister_raw, while unregister is not?

That's true. But I feel like once we have one method to remove the handler (cleanup_raw), and another to clean the actions (unregister_signal), it makes sense to support users calling both in sequence and be able to reverse that. And such method as you propose could then be just these two calls (that actually is cleanup_signal).

So you're saying that you want the existing methods to do this, by removing the OS handler when possible? That seems commendable, but realistically that sounds like it would have to track state and just be unnecessarily complicated? Though I suppose it's fair to assume that one might want to remove one signal at a time, while multiple are registered, but I feel like usually when you just want to go back to the default handler it would be trivial to just instruct the library to do so? I'm not sure about the usecase that would justify this complication.

both of them now existing, but we would probably need to move the cleanup_raw, with a more appropriate name, into signal-hook-registry to be able to access whatever internals as needed

At that point why not support just calling cleanup_raw in signal-hook-registry to nuke everything? That seems like the easiest solution. Why do you want to move cleanup_raw into signal-hook-registry otherwise? If all signals fallback to the default handler, there's no reason to not just call signals.remove(SIGNAL) on the internal action handler, right?

The irreversible one is cleanup_raw, and the one that should be fixed, so whatever it does can be turned back to the previous state.

I don't see how it is? That's just a libc call which can be easily reversed by registring a new signal. The unregister_signal is the one that puts signal-hook-registry in a state that it cannot recover from, since it thinks incorrectly that there is still a handler attached to the OS.

And, if it can be turned back and the recovery is explicit, it would also cover, in a way, the use case you had in #63 ‒ it is basically equivalent to the workaround, except with some support from the library.

Are you suggesting to keep track of the last signal handler after explicitly requesting to delete all signal handlers? Why? It seems like you're trying to go back to the original signal handler after the user explicitly asked to destroy it? That seems backwards to me and I'd personally expect that when I register a new signal handler after clearing all signal handlers it doesn't just restore from some internal state to the previous internally preserved signal handler? Seems kinda leaky.

vorner commented 3 years ago

This doesn't make sense to me. You say the point is that it's just like multiple calls to unregister, but unregister_signal is also supposed to call unregister_raw, while unregister is not?

Where did I say that? What is even unregister_raw? That thing doesn't exist.

So you're saying that you want the existing methods to do this, by removing the OS handler when possible?

I want the existing methods to start behaving in some better way instead of adding multiple new ones and keeping the old, broken ones, around. As for being a bit more complicated, that might be true, but I'll prefer slightly more complicated code than API cluttered by broken methods.

I don't see how it is? That's just a libc call which can be easily reversed by registring a new signal.

By doing just that, you've lost the whole chain of the handlers, including the ones from other libraries. It's irreversible because there's no way to find them after the fact. Right now it is also irreversible because it nukes our signal handler, but doesn't mark it anywhere, so nobody knows it should re-instate it (that's what this bug is about).

My idea for the API is to have:

unregister/unregister_signal, which just remove the actions, either one by one or the whole lot of a single signal, but not touch a signal handler.
pause_signal_handler(signum)/resume_signal_handler(signum) that goes to SIG_DFL/back to what there was.

I feel like that API kind of makes sense/is explainable and also flexible. It supports your scenario:

pause_signal_handler(signum)
raise(signum)
resume_signal_handler(signum)

It also supports double CTRL+C:

pause_signal_handler(signum) (inside signal handler)
unregister_signal(signum) to remove the actual actions and deallocate anything they hold

That seems backwards to me and I'd personally expect that when I register a new signal handler after clearing all signal handlers it doesn't just restore from some internal state to the previous internally preserved signal handler? Seems kinda leaky.

That's why I don't want to call it „destroy them“ but „pause them“ and that's why want the „restore“ to be an explicit call.

chrisduerr commented 3 years ago

By doing just that, you've lost the whole chain of the handlers, including the ones from other libraries. It's irreversible because there's no way to find them after the fact. Right now it is also irreversible because it nukes our signal handler, but doesn't mark it anywhere, so nobody knows it should re-instate it (that's what this bug is about).

But isn't that the point? I don't see why you would want to recover to the previous state? Isn't the goal just to clean up the internal state?

Imagine this: Someone else has a signal handler registered (say stored in prev), then you cleanup everything and add a new signal handler. You don't actually want to recover and restore the old prev sigaction. What you want is everything to get cleaned up and then just add your one action back.

I'm not sure if I'm understanding you correctly, but it sounds like you want to restore to the state from before the handler was reset? I don't see why you would want that.

unregister/unregister_signal, which just remove the actions, either one by one or the whole lot of a single signal, but not touch a signal handler.

Okay, that is already the case today, right? So no touchy this, which makes sense since we don't want to break anything.

pause_signal_handler(signum)/resume_signal_handler(signum) that goes to SIG_DFL/back to what there was.

Why pause? I mean maybe we just have different goals, but it seems very unusual to pause a signal handler to me. I either add stuff or remove stuff. If I want it "paused", I probably just want to remove the signal handler?

I think this is probably the main reason for conflict here, that we seem to want two different things that are just loosely related. You want to be able to go back to the default and then recover back to the previous state. I want to be able to go back to the default and clean up the internal state completely (I don't want anything left after I've cleaned up the signal).

But at that point the library still doesn't have a way to completely clean itself up, no matter what is done it will try to stick around and stay alive. And arguably looking for pause when you want a cleanup will probably make this difficult to find for people that want to just remove something properly.

Also let's just go with pause/resume for now: How do you want to handle the case where someone else has modified the signal handler after you've reset it to the default? As far as I can see, that's the only real problem with using libc for this directly? How does your proposed solution resolve this?

Do you want to check if the action is still SIG_DFL and add the current callback to prev if it is not on resume? That could potentially add a new callback you have to execute every time you resume, right?

vorner commented 3 years ago

I'm not sure if I'm understanding you correctly, but it sounds like you want to restore to the state from before the handler was reset? I don't see why you would want that.

Yes, I want to restore the handler as it was before, with everything. For few reasons:

You seem to want this to remove the actions. But for that we have other APIs. What we are dealing with here is the low level signal handler, not the actions.
Signal hook should not be responsible to cleaning after other libraries. They should have their own ways to disable them.

So, if we have pause/resume + ability to remove the actions, we already can do a full cleanup, assuming the other (potential) libraries have a way to disable too (actually, pause already does that).

If you want to clean everything and start over, you don't need to kill signal-hooks handler, but actions.

If you want to just invoke the default handler, you actually do need a way to return all the original handlers.

There's another use case. Let's say we want to have that double-CTRL+C thing. But also during the graceful shutdown, the application can decide to abort the shutdown. So one would do pause inside the signal handler so the second one kills. But after deciding, it needs to return it back, fully.

I think this is probably the main reason for conflict here, that we seem to want two different things that are just loosely related. You want to be able to go back to the default and then recover back to the previous state. I want to be able to go back to the default and clean up the internal state completely (I don't want anything left after I've cleaned up the signal).

But there's already API for the second half.

Also, we are talking about the low-level API. We can build some general patterns (like „clean up everything, kill it with fire“ function built from these), I just want to have the universally usable building blocks

.As far as I can see, that's the only real problem with using libc for this directly? How does your proposed solution resolve this?

The problem with using libc directly is the end user needs to use unsafe, which I think most applications should not need to do directly, in the middle of business logic. And there's the need to store the previous state somewhere, in a way that can be put into a signal handler. That's not that big deal, but not something one would like to do every time.

As for what to do about these corner cases. I think:

When pausing and it is already SIG_DFL, do nothing (do not overwrite the already saved state, if there's one).
When resuming and there's nothing to resume, also do nothing.
When resuming and the signal handler got replaced, panic (contract violation, one shall not do that).

As for your wish to remove everything and start over ‒ do you have a use case where it's not well served by combination of pause/resume and unregister_signal? If there's one, even though I don't think the library should be responsible for removing handlers of others, some kind of hard_reset_handler that replaces it with just our own signal handler, but without the chain might be possible. Or maybe having some kind of ResumeMode parameter to resume (non-exhaustive enum) that would say how to restore… But I think that should still touch only the handler itself, not the actions, to keep the API orthogonal.

chrisduerr commented 3 years ago

Signal hook should not be responsible to cleaning after other libraries. They should have their own ways to disable them.

Just to be very clear here: Signal hook currently doesn't have any way to clean itself up completely. I'm aware that just resetting everything and falling back to default also cleans up other libraries, but that is still necessary to clean up signal-hook completely.

If you end up in a state where signal hook's internal state is different from when the application was launched, that's not "cleaned up". Sure the signal handling might be removed from the OS, but signal-hook is still juggling state that you must be aware of.

That's why I disagree with the notion that a pause is equivalent to a full cleanup.

If you want to clean everything and start over, you don't need to kill signal-hooks handler, but actions.

You need both. You need to reset the handler to the default and actually remove all actions. But it is not possible to remove the actions. All you can do is remove an action, but signal-hook always keeps an empty list of actions around that will interfere with you doing stuff in the future.

There's another use case. Let's say we want to have that double-CTRL+C thing. But also during the graceful shutdown, the application can decide to abort the shutdown. So one would do pause inside the signal handler so the second one kills. But after deciding, it needs to return it back, fully.

I'm not sure what you mean by this. As soon as you decide to unhook the Ctrl+C, you cannot reverse the action? Unless you mean that the second Ctrl+C isn't sent immediately but the application keeps going for some reason.

I want to be able to go back to the default and clean up the internal state completely (I don't want anything left after I've cleaned up the signal).

But there's already API for the second half.

Which API are you talking about? There's no way to clean up signal-hook's internal state completely as far as I'm aware?

The problem with using libc directly is the end user needs to use unsafe, which I think most applications should not need to do directly, in the middle of business logic. And there's the need to store the previous state somewhere, in a way that can be put into a signal handler. That's not that big deal, but not something one would like to do every time.

That generally doesn't sound like a massive deal to me. Though I suppose offering an abstraction is fine, after all abstractions are what libraries are there for.

When pausing and it is already SIG_DFL, do nothing (do not overwrite the already saved state, if there's one).

By which you mean it was already set to SIG_DFL by signal-hook, so our internal "last sigaction" storage is Some (just as an example in Rust-y terms). That would seem reasonable, just to not throw it all away unnecessarily.

On the other hand, that opens the question of what should happen when other stuff is done with signal-hook in between. What if I want to pause for a while, but then decide to handle other signals in between? Seems like this would put signal-hook in a really awkward API position where it just has this one global state thing that is basically a flag the user must be aware of at all times (even if they themselves didn't initialize it, as you've said, you want signal-hook-registry to be shared between libraries).

When resuming and there's nothing to resume, also do nothing.

So basically calling pause on SIG_DFL signal? That's fair.

When resuming and the signal handler got replaced, panic (contract violation, one shall not do that).

This I think is a really bad idea. You basically run in scenarios where racing conditions that could have okay normal behavior just cause random explosions. This is one of the reasons why putting the storage on the client side would make things much more reasonable, the user can decide if they want to force overwrite to the last action.

An API that has global state like that with multiple libraries arbitrarily interacting with each other would basically behave completely unpredictably. That's not a safe design if you ask me and would be something that discourages me from using libraries that rely on signal-hook out of fear for unreliability.

vorner commented 3 years ago

Honestly, I have doubts this discussion is going anywhere. I mean, I appreciate that you want to help and I'm glad you brought that use case up, in the other issue. But I have the impression that we really have a lot of trouble understanding each other and it is draining my energy. I don't want to give up completely, but I'm unsure what it is you're actually trying to solve (as in, what the root problem is, I suspect there might be some X-Y problem in play here). Is it something more than what you mentioned in the other issue, that you want to invoke the SIG_DFL as part of handling the signal? Maybe we should try first resolving that, maybe in some form of user stories.

Can you explain, maybe with a list of APIs and what they would do, what you have in mind?

Signal hook currently doesn't have any way to clean itself up completely.

Can you explain why you think there should be such a way? First, a lot of libraries don't have such a thing. One example might be rayon, once it starts its threadpool, there's no way to shut it down. Another such example might be the global allocator in libc. Once you do a malloc, it is often not going to return all the memory back to the OS, even if you properly free it, because it will be put into some of its thread-local caches and whereever.

Second, the user doesn't really care about internal state of the library, but about external behaviour. There are generally (for each separate signal) two states:

Signal hook is handling the signal and dispatches it to a (possibly empty) set of actions.
Signal hook is not handling the signal.

The application starts in the second state. During the first registration of an action, the transition to the first state is implicit, because the user clearly wants to have the signals handled by signal-hook. I'm proposing that future switches between the two are to be done explicitly, because the user clearly wants to have a fine-grained control over it. If you're worried about this difference (first vs. second transition), notice that I want to have a resume from empty state to be a NOP, so you can always resume+register.

That's why I disagree with the notion that a pause is equivalent to a full cleanup.

It is not. It is not meant to be a full cleanup. It is transition to the second state, with preserving enough information to return to the first state if need be in the future. But the second state is the user-observable state the application started in, regarding signal handling. It is when the SIG_DFL is run and that's probably the only reason one would want to be in that state. Otherwise, signal-hook with empty set of actions does nothing, only dispatching to other signal handlers that were there.

But it is not possible to remove the actions. All you can do is remove an action, but signal-hook always keeps an empty list of actions around that will interfere with you doing stuff in the future.

What do you mean, that it'll interfere? An empty set of actions is empty, just like at the beginning. You can remove all actions, and have an empty set.

If you mean that it will not transition to the first state implicitly any more, then that is true, but that's not because of the empty set of actions. It is because it remembers that it did set its own signal handler already. That's orthogonal to having an empty set of actions or not (semantically, it starts with an empty set, the fact the vector is not in there is just mere technical implementation detail). If you want signal-hook to „forget“ that it ever set up the signal handler, then that's a different API. I still don't see why you would need it (please, demonstrate), but we can talk about it. That's, however, probably a different problem than the one in this issue (because such call would still be irreversible, since it also throws away all the other handlers in the chain and can't put them back).

I'm not sure what you mean by this. As soon as you decide to unhook the Ctrl+C, you cannot reverse the action? Unless you mean that the second Ctrl+C isn't sent immediately but the application keeps going for some reason.

I mean, to reverse the un-hooking of the handler, I need to both preserve the list of actions in signal hook and the full chain of handlers. If I just „clean up completely“, I've lost both and can't go back.

n the other hand, that opens the question of what should happen when other stuff is done with signal-hook in between. What if I want to pause for a while, but then decide to handle other signals in between?

You resume and alter the list of actions you want to run, by adding and removing.

I think there's still a bit of misunderstanding on how I envision the library should be used.

The basic way to use signal-hook is just by adding and removing actions, either by the low-level register and unregister, or by using some of the higher-level wrappers about these. As long as you stay with these, all the actions are fully independent of each other, so parts of an application, including libraries, don't have to know about each other doing things. That was actually the only way to interact with signal-hook in the beginning. As you start or tear down plugins/areas of your application, each can register actions in a similar way as it would start its own threads or allocate memory and it would be responsible for removing them later.

Nevertheless, the real life is not as nice around signal handling and there are valid use cases that require doing weird things. Removing the signal handler, to get back to SIG_DFL belongs in this category, as well as unregister_signal. The existence of these APIs breaks the part where I don't interact with other's actions. These things are able to break unrelated parts of code. That's why they come with a warning and in general, you should never use them inside a library. They can be used in the top-level application, where the author has some knowledge of the whole global state so they can make a judgment that it's OK to just break the actions of everything by nuking them.

The pause/resume belongs to the same category. Anything that touches the raw signal handler necessarily must belong to this category, because by doing so you potentially break other libraries. If you are going to randomly add and remove and replace handlers belonging to others, your application will eventually end up in some weird broken state where parts think they are hooked to a signal, but will miss it. Such random misbehaviours are, in my opinion, even worse than well-behaved panic, because then you at least know you need to fix it.

That's not a safe design if you ask me and would be something that discourages me from using libraries that rely on signal-hook out of fear for unreliability.

Again, libraries are not supposed to pause/resume. They are supposed to only add their action and remove it, as they need. If they do pause/resume, they're doing it wrong and yes, they would be unreliable, but not more so than if they just randomly juggle signal handlers that don't belong to them on their own. If some library is messing with the signal handlers without your knowledge, in some unsynchronized way, you're already in trouble.

As for putting the storage to the user. I don't know if that's better, but it at least makes it really hard for the user to use that inside the signal handler. First, users should not need to write unsafe to use the library, unless they do something weird. Second, extracting infomation from inside a signal handler is probably arcane enough to scare 90% of users of the library.

So, to recap, please provide the actual problem you're trying to solve, because I suspect there's some deep misunderstanding somewhere.

chrisduerr commented 3 years ago

Can you explain why you think there should be such a way? First, a lot of libraries don't have such a thing. One example might be rayon, once it starts its threadpool, there's no way to shut it down. Another such example might be the global allocator in libc. Once you do a malloc, it is often not going to return all the memory back to the OS, even if you properly free it, because it will be put into some of its thread-local caches and whereever.

Other libraries having to do this doesn't mean it's a good thing or shouldn't be removed whenever possible. The problem with not having the internal state cleared up should be quite clear: I read the documentation, expected something to happen, but because the state did not get cleared up the library behaved unexpectedly for me. A library not doing what the user expects it to do is generally never a good thing.

The application starts in the second state. During the first registration of an action, the transition to the first state is implicit, because the user clearly wants to have the signals handled by signal-hook.

And by performing a "cleanup" the user doesn't "clearly want" to remove signal-hook handling? At that point why even call it cleanup when it's impossible to clean anything up?

What do you mean, that it'll interfere? An empty set of actions is empty, just like at the beginning. You can remove all actions, and have an empty set.

It's exactly not like at the beginning. That's the point. You remove everything, but you never get back to the beginning. You think you might, but you do not. You cannot get back to the beginning. Because of that, there's no way to add a signal handler again after throwing everything away, because you're not in an equivalent state to the beginning.

Even if you add the pause and unpause APIs, that wouldn't change anything about the existing APIs being footguns when it comes to actually using them. You'd have to wrap around it in signal-hook somehow to create a sane API and even then signal-hook-registry would still be unusable for anyone trying to talk to it directly.

If you mean that it will not transition to the first state implicitly any more, then that is true, but that's not because of the empty set of actions. It is because it remembers that it did set its own signal handler already. That's orthogonal to having an empty set of actions or not (semantically, it starts with an empty set, the fact the vector is not in there is just mere technical implementation detail).

It's not orthogonal at all though. Having empty actions is how signal-hook tracks the state of existing OS event handlers. The two are directly connected.

If you want signal-hook to „forget“ that it ever set up the signal handler, then that's a different API.

So at that point there's another cleanup method? At that point this library probably is a minefield more complicated to navigate than the libc API.

I still don't see why you would need it (please, demonstrate), but we can talk about it.

You don't see why someone would want to clean up the signal-hook state completely to the state it was in during the beginning, even though you've explicitly pointed out that libraries are supposed to share signal-hook-registry while the application interacts with signal hook too?

It's difficult for me to see how this couldn't lead to issues with libraries having no way to transition from their state to the application state. It would effectively be impossible for a library that might have setup a signal handler for some reason to return to the default and allow the application itself to take over. I don't see how pause/unpause could help here? I strongly doubt that you would expect the library to clean up all actions, then call pause, just for the application to require knowledge that the library did that, call unpause and setup actions again?

I think there's still a bit of misunderstanding on how I envision the library should be used.

I do not think this is the issue. I think the issue is that I disagree with your vision for the signal-hook API. Of course I can only make suggestions, it's your library after all. But I think the API you're going for is problematic.

Nevertheless, the real life is not as nice around signal handling and there are valid use cases that require doing weird things. Removing the signal handler, to get back to SIG_DFL belongs in this category, as well as unregister_signal.

Labeling something so standard a "weird thing" seems very strange to me. I'd assume you just mean this is more complicated than just registering a signal handler and you don't mean that this is something that shouldn't be focused upon, but just to be clear: If a signal handling library doesn't concern itself with these things and just makes the assumption that it's basically "some weird edge case", that library is not very helpful over doing it yourself. It ignores significant usecases that are commonly used by applications (I'd hazard a guess and say almost every reasonably sized terminal application does this). Handling these more complicated scenarios is what you (as in I) want a library for, if my application is the only thing dealing with signals and I just want to register a handler and not care about anything else, libc makes this trivially easy.

These things are able to break unrelated parts of code. That's why they come with a warning and in general, you should never use them inside a library.

So what are libraries supposed to do? Not use signal-hook? I mean if that's your suggestion, fair enough, at least that's transparent. But if libraries are only supposed to use some minimal APIs, is that even helpful anymore at that point? Or wouldn't it be easier to just say "signal-hook only focuses on applications, do not use in a library".

It just strikes me as odd that some things are done because signal-hook might be used in a library, while then at the same time saying half the things in signal-hook should never be used by a library. Having such a vague stance of course gives great flexibility, but I think it's ultimately harmful.

Such random misbehaviours are, in my opinion, even worse than well-behaved panic, because then you at least know you need to fix it.

I'm not so sure about that. As long as misbehaviors are somewhat sane, it is possible for a racing condition to exist without causing any problems. But crashing is always a problem. If you have an application that somewhat cares about reliability and staying up and running, then it just crashes because of some racing condition with a possibly unclear error caused by a dependency, that could get quite annoying. Of course this makes the assumption that all possible results of the racing condition are "sane".

Though I certainly agree that a simple and clear issue telling the developer what to fix can be much more beneficial than trying to hunt down a bug you do not understand.

Making the assumption that people read and understand the documentation completely is also quite a big ask. If the API isn't structured in a way that would prevent abuse, it will be abused. There's no documentation you can write to fix that, that's one of the great strengths of Rust after all (designing safety into APIs).

Again, libraries are not supposed to pause/resume. They are supposed to only add their action and remove it, as they need.

So basically just like libc, you have a bunch of invariants you need to ensure are met or otherwise everything might explode.

Second, extracting infomation from inside a signal handler is probably arcane enough to scare 90% of users of the library.

It seems to me like scared is exactly what users should be though. There's so many don'ts here that aren't even obvious to me after having thought about the issue, that it seems almost impossible for someone to use it without doing something stupid.

So, to recap, please provide the actual problem you're trying to solve, because I suspect there's some deep misunderstanding somewhere.

The problem I'm trying to solve is creating a sane and reasonable API that provides real benefits over the C API. I'm only making suggestions here of course, but with signal-hook's position in Rust's ecosystem I think people would benefit more from one great library instead of everyone just writing their own libraries because what exists already doesn't work.

vorner commented 3 years ago

OK, thanks for once again not providing a use case. I'm done here for now, I'm shelving this for later. You still haven't provided a good reason why you'd want to clean up and start creating actions again from scratch. What is such an event in application life time to throw everything out and construct it again?

And no, I didn't say „Don't use signal-hook in libraries“. I said „Use the composable register/unregister APIs in libraries“. For that matter, prefer them in application too.

chrisduerr commented 3 years ago

I've provided a usecase, you just don't care about it. Whenever I talk about libraries you just say that you don't want to actually properly support signal handling from libraries (no, register/unregister is not sufficient).

I'm always happy to provide feedback and suggestions, but it doesn't seem to me like you're interested in anything but finding a reason for why your solution should be the one that definitely works best.

I've moved on to just use libc completely in my library, so I don't have to rely on the janky signal-hook behavior anymore. But I just think that this is very unfortunate and I'd be happy to see that changed in the future.

If you're actually looking for feedback, please do let me know, I'm always happy to help with that. But I don't think you're interested in any alternative API suggestions at this point.

vorner commented 3 years ago

No, you haven't provided a use case. Well, you have, the one in the other issue and that is covered by the API I propose. You haven't provided a use case where register/unregister + pause/resume is not sufficient. And use case is something like „I need to listen on SIGTERM and when it comes, invoke the default action of SIGWINCH“. „Properly support signal handling from libraries“ is not a use case and it's not clear what you actually mean by properly support.

You seem to be asking for some kind of „Nuke everything and start over, from a library“ thing. For one, I don't see why you would need such a thing. For another, I don't see how that would ever be a good idea to do that, because that can nuke everything that belongs to others. That's a similar API to „kill all threads in the application except the one I live in“ and I hope you see why that might be a bad idea, because they are threads the library you're in knows nothing about. And that's why fork and multithreading doesn't really go together. I don't see how using such API would ever be possible without invoking race conditions at best.

You seem to be stating that the way libc does it is the only true way. If that's the case, then signal-hook is not for you, I'm afraid. Because signal-hook is an attempt to bring some sanity into signal handling, saying that register/unregister should, in general, be enough. That's similar to what Rust does with lifetimes and such ‒ it certainly doesn't allow you to do everything you could with raw pointers (doubly-linked linked lists are a prime example), but most of the time you don't need and want to risk it. Signal-hook is meant to not let you do all crazy things libc allows you to. The existence of the cleanup module (which is probably a bit misnomer anyway) goes a bit against that and that's what this issue is meant to fix.

If you have some actual proposal for API, then share it, but please do it in a clean way that lists what calls you would want to have, what they would do and how they would not break the idea of signal actions being composable and independent and why you want it that way ‒ what use cases does it solve. Because I seriously don't understand what you're trying to do (except redesign the API for some API with different goals which are not even clearly stated).

chrisduerr commented 3 years ago

that is covered by the API I propose.

It is not, since it is made from a library. So your API does not support this usecase at all.

Because signal-hook is an attempt to bring some sanity into signal handling

I fail to see where the sanity is in signal-hook. In fact I find the libc API to be a lot more clear about its own constraints, which in turn makes it easier to reason about and more sane. That's why I'd advocate to make signal-hook behave sanely, which it currently does not.

vorner commented 3 years ago

It is not, since it is made from a library. So your API does not support this usecase at all.

It might be my lack of imagination, but IMO disconnecting the whole application, including handlers I have nothing to do with, from the signal they registered to is never correct from a library. Not even libc gives you that ability. That's why I want to clearly say such use case is not supported, because it is impossible to support. Now, prove me wrong and show me a code that does that without hurting other parts of the program you know nothing about.

I fail to see where the sanity is in signal-hook.

In being able to add or remove signal actions without knowing about other parts of application doing the same.
Doing so without race conditions that can cause you to lose signal occurences even in a multi-threaded program.
Not having to care about async-signal-safe.

That is, as long as you do register/unregister only. The pause/resume do not come into this category, unfortunately. Same as uregister_signal. But AFAIK nor does any solution that just nukes anything. All these need to be done with proper care and from the point where one has a global knowledge of everything happening inside the program.

chrisduerr commented 3 years ago

Not even libc gives you that ability.

Yes it does. ncurses does this for example.

vorner commented 3 years ago

Can you explain how? Or, does ncurses assume that it is the only one dealing with the signals and, in effect, being the place with the global knowledge (where „application“ vs „library“ is a simplification; the distinction being if one can or can't make assumptions of what other bits of code do)? Or does it just hope for the best? What happens if I put two ncurses-like libraries into the same program, will they fight?

vorner commented 3 years ago

The whole cleanup module this issue is about is now gone in 0.3, replaced by tricks to emulate the default handlers instead of restoring them.