ros2 / launch

Tools for launching multiple processes and for writing tests involving multiple processes.
Apache License 2.0
124 stars 139 forks source link

Fix #666 handle SIGTERM #712

Open ciandonovan opened 1 year ago

ciandonovan commented 1 year ago

Fixes #666

SIGTERM

Instead of installing signal handlers with the OS directly, ROS2 Launch uses Python's signal.set_wakeup_fd to queue and forward signals to handlers registered from the async event loop.

Although one of these handlers has been set for SIGTERM, it never gets called since when that signal gets delivered, the kernel terminates the process immediately. The process never changes the default disposition of that signal. Delivery of SIGINT however, doesn't cause the kernel to terminate the process, and instead that signal get dutifully forwarded to the appropriate handler. This is despite there being no difference in the ROS2 Launch code in setting them both up.

After running an strace on a one-line Python program and seeing that a handler was still automatically installed for SIGINT, I came across this StackOverflow post. Depending on whether a Python script is run in some combination of foreground process and in an interactive shell, a handler will be installed for SIGINT for use in the KeyboardInterrupt exception. It just so happens that it also gets forwarded by signal.set_wakeup_fd, and since the handler prevents the program from being automatically terminated, it can actually do so.

This is why the handler for SIGINT by chance works, but the one for SIGTERM does not.

The fix for this was to simply install the default interrupt signal handler on any signal the async event loop requested handled, so that they would no longer cause the program to terminate and instead be forwarded to signal.set_wakeup_fd.

SIGQUIT

SIGQUIT signals are often generated by the user hitting 'CTRL+\' in an interactive shell. However, when the terminal driver receives those characters, it doesn't just send a SIGQUIT to the running process, but to the entire foreground process group.

By default, the ROS2 Client Libraries don't register a handler for SIGQUIT, as they do for SIGINT and SIGTERM, so the ROS2 node will be terminated immediately and its core dumpted on receipt. So there is little reason for ROS2 Launch to catch this signal or the purposes of forwarding it on to its children.

Also, not catching SIGQUIT means that if there is a problem with the Python signal handling, then the user still has available a shortcut on their keyboards to unilaterally kill the process.

The signal handler for SIGQUIT was removed.

Future work

This Pull Request allows ROS2 Launch to handle SIGTERMs and initiate shutdown, but sends SIGINTs instead of SIGTERMs to the child processes. This will usually be fine since the ROS2 Client Libraries install handlers for both SIGINT and SIGTERM by default, and is no worse than the current situation, but it would be more flexible to allow users to send whatever signals they want to their nodes.

To do this would require removing the due_to_sigint and send_sigint arguments of many functions and replacing them with an argument passing the name of the signal to be forwarded. Care would need to be taken of other signals besides SIGINT such as SIGQUIT and SIGTSTP that also tend to get sent to the entire process group to avoid duplicates. Currently if the child node doesn't shutdown within 5 seconds an "escalated" SIGTERM is sent - but if SIGTERM was the initial signal then should it progress straight to SIGKILL? Or wait 10 seconds?

If others feel this is worth developing let me know and I'll work on another PR, but I feel this is in good shape now and will already be of benefit to others upstream.

@wjwwood - mentioned in the relevant ToDo for this feature

ciandonovan commented 1 year ago

I had to modify test_signal_management.py to account for the fact I'm using signal.default_int_handler by catching and ignoring KeyboardInterrupt like in the main code.

Maybe some sort of global scope noop function might be more appropriate, needs to be global otherwise the assert won't pass with separate noop functions in the main and test code.

I'd argue however that the changes to the tests would be justified even without the rest of this Pull Request. The current upstream behaviour is that signal.default_int_handler is still installed implicitly, but only for SIGINT, and a KeyboardInterrupt catch is also present to stop it from exiting the program immediately. I've just extended this behaviour to cover any arbitrary signal, and explicitly installed a handler for SIGTERM too.

ciandonovan commented 4 months ago

@fujitatomoya has been a long time since I worked on this, some other regressions were introduced during the rebase also. Agree with your suggestions, but might rework the patches more in general over the next week, take a fresh look at it. Glad to see interest in getting this resolved, happy to work with you if you have further suggestions or ideas on approach. maybe aligning with that's being done in https://github.com/ros2/ros2cli/issues/895?

fujitatomoya commented 4 months ago

@ciandonovan i believe your suggestion posted on here makes sense. launch and run process does not really know how to handle the signals, so that they should forward the signals to the application logic to shutdown or finalize the application. application by here could be, storing some data using memory cache, accessing physical hardware device or sensors. in these cases, application logics need to receive the signals to shutdown or finalize it properly, otherwise it could be left unexpected state or error result.

ciandonovan commented 4 months ago

@fujitatomoya @wjwwood @clalancette is there a case for catching and handling any signals other than SIGINT and SIGTERM?

Currently only SIGINT is handled as no signal handler on the process level is ever installed for anything else, despite the current code assuming so.

My pull request as it stands catches both SIGINT and SIGTERM, and forwards SIGINT to the child processes in both cases. I wrote it to handle any signals installed to the AsyncSafeSignalManager.handle(), but this flexibility might not be required? Also, it doesn't forward any signal if launch is in "noninteractive" mode, which shouldn't be the case for SIGTERM at least as it has different semantics to SIGINT (which is commonly sent to the entire foreground process group on Ctrl-C from the terminal).

If it is the case that only those two signals are of interest, I'll adjust to handle both accordingly - both forwarded when nonintereactive, and only SIGTERM forwarded when interactive. The ROS 2 Client Libraries already handle both signals gracefully so shouldn't introduce any unexpected issues by forwarding SIGTERM.

fujitatomoya commented 4 months ago

@ciandonovan i am not sure what signals launch process should deal, because launch does and provides much more than ros2 run that is just a wrapper to execute the command from the arguments and parameters. that is the main reason that i just forward the signals to child process how they want to deal with https://github.com/ros2/ros2cli/issues/895. in general sigint and sigterm need to be handled, but there could be specific signal that launch should handle... that is something i would like to know too.

ciandonovan commented 4 months ago
    send_sigint? True : False
    +----------------+--------+---------+
    |                | SIGINT | SIGTERM |
    +----------------+--------+---------+
    | interactive    | False  | True    |
    | noninteractive | True   | True    |
    +----------------+--------+---------+

Launch either detects, through checking if there's a controlling terminal, or the value of the flag --noninteractive, to determine if launch was launched interactively or not.

If it was launched interactively, then the receipt of SIGINT is implicitly assumed to have come from the user pressing Ctrl-C and their terminal sending a SIGINT to the entire foreground process group. In such case, it is redundant to also send SIGINT from launch to the child ROS nodes, however it should be noted that even if another signal was sent there would be no ill-effects.

In all other cases, assume the child nodes were not themselves signalled, so manually send SIGINT, and later escalate to SIGTERM after timeout, to the nodes.

ciandonovan commented 4 months ago

@fujitatomoya from reading the Linux signal man page it's hard to make a case for handling any signals other than SIGINT and SIGTERM.

The signals SIGABRT, SIGALRM, SIGCHLD, SIGIO, SIGLOST, SIGPIPE, SIGPOLL, SIGURG, SIGVTALRM, are raised in response to actions taken by the program itself by design. The signals SIGEMT, SIGPROF, SIGTRAP are only relevant for in-built debugging and profiling (not relevant here). The signals SIGBUS, SIGFPE, SIGILL, SIGSEGV, SIGSYS are unlikely to be raised in a Python program. The signals SIGCONT, SIGSTOP, SIGTSTP, SIGTTIN, SIGTTOU shouldn't be handled as is their actions are normal expected behavior. The other signals are either unused, synonyms, irrelevant unhandleable, or would be delivered to all processes anyway.

I had considered SIGHUP, but then again if a user explicitly wants their remote processes to continue after a terminal disconnect, then that's already possible with the well-known nohup, so doesn't make sense to alter that default behavior.

Further, only SIGINT and SIGTERM are currently handled by the ROS 2 Client Libraries used by the nodes, so forwarding arbitrary other signals to them would just cause them to immediately terminate anyway, which is not desired.

I think there could be a bigger conversation around changing some of the existing semantics of Launch, in particular I don't think the current behavior of sending a SIGINT, then a SIGTERM after timeout makes much sense. The POSIX standard doesn't specify any order of precedence between them, and rcl handles them both identically. There is no escalation in practice, nor do the signals' semantics indicate so either. There's also code for handling SIGKILL which just isn't possible as that's an uncatchable signal. However, that should probably be addressed in a separate pull request.

As it stands, this pull request is essentially a bugfix. Its behavior is now as it seems it was designed for from reading the docs, the code's comments, and its structure. SIGTERM was never actually handled before at all, as no handler was installed, and now it works quite nicely both being run as a systemd service and in a Podman/Docker container. Also works nice as a Podman container running inside a systemd service ;)

wjwwood commented 3 months ago

We've been discussing this pr and the ros2run signal handling pr (https://github.com/ros2/ros2cli/pull/899) in our Jazzy release channels and I think the plan is to make this a bugfix shortly after the release, and after it has some time to soak in Rolling.

So we'll try to get these merged, and then backport after the release as a bugfix.