Request to modify multirun behavior regarding SIGTERM handling

guoard commented 6 months ago

Greetings! I'd like to express my appreciation for this fantastic tool.

Within my container environment, I'm utilizing Gunicorn and Celery orchestrated by multirun. Celery functions as a master process, spawning numerous worker processes based on its configurations. However, when multirun receives a SIGTERM signal, it propagates it to all processes, including those spawned by Celery but not directly managed by multirun.

The crux of the matter arises during the graceful shutdown of Celery. Ideally, we seek to terminate only the main Celery worker process, without affecting its child processes. Additional insights into this requirement can be found here: Celery FAQ.

Is there a way to disable this default behavior or tailor it for specific process groups during the startup of multirun?

nicolas-van commented 6 months ago

Hello. That behavior is part of the normal behavior of multirun as explained on the main page: https://github.com/nicolas-van/multirun?tab=readme-ov-file#technical-description-of-behavior

To explain further how this behavior works, it's similar to the default setting (control-group) of systemd's KillMode configuration parameter: https://www.freedesktop.org/software/systemd/man/latest/systemd.kill.html#KillMode= .

There's a good reason for systemd to choose this mode as the default, which is that it's the safest one while still technically allowing to handle the lifecycle of sub-children manually (by spawning sub-children in sub control groups, exactly like all init processes do).

The only difference between systemd and multirun is that systemd provides multiple complex modes through configuration (so it's technically possible to ask systemd to work as-per Celery's requirement by setting its KillMode parameter to process) and multirun is meant to remain very simple, so I'm not willing to implement multiple modes of process management for now.

guoard commented 6 months ago

Thank you for your response.

I've made the necessary adjustments to the source code on my fork. By removing the - character in the specified location (https://github.com/nicolas-van/multirun/blob/master/multirun.c#L225C24-L225C25), it appears that we are now sending the signal only to direct children processes, rather than to the entire process group.

Would this modification be enough for our requirements, or are there any additional changes you would recommend? If this change aligns with our needs, would you like me to submit a pull request to introduce this option as a command-line argument?

nicolas-van commented 6 months ago

Normally that's all you need if you want to fork the project.

For the pull request proposal, thanks but I'll decline. As explained, the current behavior is on purpose.

nicolas-van / multirun

Request to modify multirun behavior regarding SIGTERM handling #22