metwork-framework / mfserv

metwork/mfserv module
http://metwork-framework.org
BSD 3-Clause "New" or "Revised" License
12 stars 6 forks source link

[mfserv] Conflict between process termination by circus and exit handler functions coded in out plugin #636

Open corentin-bl opened 1 week ago

corentin-bl commented 1 week ago

We have an mfserv plugin that responds to HTTP requests using data fetched from a database, and we want to control database connections closures.

Normally, this plugin is configured so that circus spawns 8 processes, lasting 1 hour each. The process hierarchy is as such : circus -> signal_wrapper.py -> bjoern_wrapper.py. When max_age is reached, circus sends a signal to signal_wrapper.py, which forwards it to bjoern_wrapper.py, and eventually to our plugin. We understand that SIGTERM is first sent, and then after some timeout (3s according to the arguments given to signal_wrapper.py), SIGKILL is sent.

We have implemented a function in our plugin that is triggered upon signal reception. It simply retrieves the database connection and closes it. In practice, we see that SIGTERM is well caught and that the function starts. However, we can tell from the log that the function is never fully executed, and SIGKILL is eventually sent after 3s.

From the log messages, it is clear that the function did not exceed the timeout as the part of the function that was executed lasted ~0.4ms, so it is not clear why it does not run until the end.

@mrechte and I tried to investigate the propagation of signals through processes, but it seems that everything works normally. @matthieumarrast and I tried to dive deeper in the code of the socket_down function in signal_wrapper ; an idea would be to keep the socket busy during the execution of our exit handler, but this remains to be tested, and this would still look like a patch-up job. If anybody has a better understanding of what happens when circus terminates processes, it would be much appreciated.

Steps to reproduce (mfserv 2.2) :

matthieumarrast commented 1 week ago

There is a way to reproduce:

# create a python/flask plugin
bootstrap_plugin.py create foo
cd foo && make develop

in plugin's config.ini, set:

max_age=30

for circus to restart plugin each 30 to 60 second

Edit the main/wsgi.py script by adding a sigterm handler, and making the process lasting 30s:

import os
import mflog
import time
import signal
from flask import Flask, render_template

app = Flask(__name__)
if os.environ.get('MFSERV_CURRENT_APP_DEBUG', '0') == '1':
    app.config['PROPAGATE_EXCEPTIONS'] = True
logger = mflog.get_logger(__name__)

def handle_sigterm(signum, frame):
    logger.info(f"signal {signum} received")
    time.sleep(30)
    logger.info(f"end of signal {signum} handling")
    exit(0)

signal.signal(signal.SIGTERM, handle_sigterm)

@app.route("/foo/")
def hello_world():
    time.sleep(30)
    logger.info("This is an info message")
    return "Hello World !"

Check log:

# plugin is up and running
2024-10-16T13:54:52.204106Z    [DEBUG] (bjoern_wrapper#4068258) process start
2024-10-16T13:54:54.936047Z    [DEBUG] (signal_wrapper#4068173) socket_up on nginx
2024-10-16T13:53:31.698795Z    [DEBUG] (urllib3.connectionpool#4066145) Starting new HTTP connection (1): 127.0.0.1:18868
2024-10-16T13:53:31.700266Z    [DEBUG] (urllib3.connectionpool#4066145) http://127.0.0.1:18868 "GET /__socket_up/L2hvbWUvbWZzZXJ2L3Zhci9hcHBfZm9vX21haW5fMS5zb2NrZXQ= HTTP/1.1" 200 None
# we execute the program through url http://localhost:18868/foo/ (30s duration) in browser
# circus send a SIGTERM for restarting the plugin
2024-10-16T13:55:38.177021Z     [INFO] (signal_wrapper#4068173) received SIGTERM => smart closing...
2024-10-16T13:55:38.178173Z     [INFO] (signal_wrapper#4068173) stopping new connections and waiting for no connection left on /home/mfserv/var/app_foo_main_1.socket (timeout: 60 seconds)
2024-10-16T13:55:38.180883Z    [DEBUG] (urllib3.connectionpool#4068173) Starting new HTTP connection (1): 127.0.0.1:18868
# metwork is waiting for no connection left on socket, calling __socket_down...
# end of the program execution:
2024-10-16T13:55:54.411389Z     [INFO] (main.wsgi#4068258) This is an info message
# __socket_down now answers OK (no connection left on the socket):
2024-10-16T13:55:54.729325Z    [DEBUG] (urllib3.connectionpool#4068173) http://127.0.0.1:18868 "GET /__socket_down/L2hvbWUvbWZzZXJ2L3Zhci9hcHBfZm9vX21haW5fMS5zb2NrZXQ=?wait=1 HTTP/1.1" 200 None
# SIGTERM is send to the bjoern_wrapper child process:
2024-10-16T13:55:54.730175Z     [INFO] (signal_wrapper#4068173) sending SIGTERM to 4068258
# so the handle_sigterm functions is called:
2024-10-16T13:55:54.817808Z     [INFO] (main.wsgi#4068258) signal 15 received
# SIGKILL received by the process, the handle_sigterm function cannot ends !
2024-10-16T13:55:58.015018Z  [WARNING] (signal_wrapper#4068173) sending SIGKILL to 4068258 and its children
2024-10-16T13:55:58.050907Z     [INFO] (signal_wrapper#4068173) process: 4068258 (smart) stopped
2024-10-16T13:55:58.431029Z    [DEBUG] (bjoern_wrapper#4071232) process start

=> we notice the handle_sigterm function cannot ends as a SIGKILL is received...

do you agree with this way to reproduce @corentin-bl ?

matthieumarrast commented 1 week ago

If we look at the above log and the code of https://github.com/metwork-framework/mfserv/blob/master/adm/signal_wrapper.py we are in this part of code when the SIGKILL is triggered:

if SIG_SHUTDOWN is not None and args.timeout_after_signal > 0:
            now = datetime.datetime.now()
            delta = now - SIG_SHUTDOWN
            if delta.total_seconds() > args.timeout_after_signal:
                LOGGER.warning("sending SIGKILL to %i and its children", pid)
                kill_process_and_children(pid)

moreover we notice the 3 secondes of delay in the log: 2024-10-16T13:55:54.817808Z [INFO] (main.wsgi#4068258) signal 15 received 2024-10-16T13:55:58.015018Z [WARNING] (signal_wrapper#4068173) sending SIGKILL to 4068258 and its children

so the timeout_after_signal delay (3s) is responsible of the SIGKILL during the sigterm handling.

if we look in the circus.ini the command: args = --stdout /home/mfserv/log/app_foo_main.log --stderr STDOUT -- plugin_wrapper foo -- signal_wrapper.py --timeout=60 --signal=15 --timeout-after-signal=3 --socket-up-after=3 /home/mfserv/var/app_foo_main_$(circus.wid).socket -- bjoern_wrapper.py --timeout=60 main.wsgi:app "/home/mfserv/var/app_foo_main_$(circus.wid).socket" we notice this --timeout-after-signal=3 this is the smart_stop_delay=3 in the plugin's config.ini (cf. plugins.py)

So I set smart_stop_delay=40 then it works fine:

2024-10-16T16:34:43.876512Z     [INFO] (signal_wrapper#158942) sending SIGTERM to 158983
2024-10-16T16:34:43.917286Z     [INFO] (main.wsgi#158983) signal 15 received
2024-10-16T16:35:13.917849Z     [INFO] (main.wsgi#158983) end of signal 15 handling
corentin-bl commented 6 days ago

Hi Matthieu, thanks for looking at this so quickly! I will try changing the configuration parameter to see if it does anything. However, as mentionned in the initial post, it seems that the part of our function that was executed lasted only 0.4ms, way lower than 3s, so I don't know if increasing the timeout will help ... I'll test it today and let you know!

corentin-bl commented 6 days ago

Unfortunately, and as expected, increasing smart_stop_delay to a greater value (60s here) did not change anything:

2024-10-17T13:54:56.452578Z     [INFO] (signal_wrapper#2491393) sending SIGTERM to 2491470
2024-10-17T13:54:56.525893Z     [INFO] (sacdp#2491470) entering exit_handler function
2024-10-17T13:54:56.526036Z     [INFO] (sacdp#2491470) current_thread_name='MainThread'
2024-10-17T13:54:56.526088Z     [INFO] (sacdp#2491470) exit_handler: received signal SIGTERM
2024-10-17T13:54:56.526129Z     [INFO] (sacdp#2491470) exit_handler: getting connection variable
2024-10-17T13:54:56.526168Z     [INFO] (sacdp#2491470) after get_local
2024-10-17T13:55:56.858723Z  [WARNING] (signal_wrapper#2491393) sending SIGKILL to 2491470 and its children
corentin-bl commented 6 days ago

Regarding the way to reproduce, we did not use Flask to build our plugin. We have a script that defines the application(environ, start_response) function, which is called by bjoern_wrapper.py. I don't know if this is different from a Flask app, I would say that it doesn't matter, but I prefer to let you know.

corentin-bl commented 5 days ago

Hello @matthieumarrast, I have made some progress on the debug and I have found two things:

And a final question: now that I am able to catch the NameError, I managed to run the function until the end. The last statement is a sys.exit(0), which would imply that the SIGTERM is enough to stop the process. However, we still see in the logs that a SIGKILL is sent, is it normal?

matthieumarrast commented 2 days ago
corentin-bl commented 9 hours ago

Hi Matthieu, here are the two examples :

mfdata :

#!/usr/bin/python3
"""
Demo code for GitHub issue
"""

from acquisition import AcquisitionStep

class GitHubDemo(AcquisitionStep):
    """Demo class."""

    plugin_name = "insert_cdp"
    step_name = "github_issue_demo"

    def process(self, xaf):
        """Demo method"""
        self.info(f"process for file {xaf.filepath}")

        my_var = 1

        assert my_var == 0

        return True

    def plugin_path(self):
        """Other method"""
        path = self.get_plugin_directory_path()

        return path

if __name__ == "__main__":
    x = GitHubDemo()
    x.run()

mfserv :

"""
Demo code for GitHub issue
"""

import signal

def exit_handler(sig, frame):
    """Exit function where the error happens."""
    print("Entering exit_handler")
    my_var = 1
    assert my_var == 0
    print("After the assertion error")
    return True

signal.signal(signal.SIGTERM, exit_handler)

def application(environ, start_response):
    """Main function"""
    print("Entering application function")
    return True

Let me know if this works as is or if you need further info (e.g. from the config file for instance).

While preparing the examples, I noticed that in the case of mfserv, errors happenning in the application function were caught, but as you will see in the example, errors in the exit_handler are not ...