Open corentin-bl opened 1 week ago
There is a way to reproduce:
# create a python/flask plugin
bootstrap_plugin.py create foo
cd foo && make develop
in plugin's config.ini, set:
max_age=30
for circus to restart plugin each 30 to 60 second
Edit the main/wsgi.py
script by adding a sigterm handler, and making the process lasting 30s:
import os
import mflog
import time
import signal
from flask import Flask, render_template
app = Flask(__name__)
if os.environ.get('MFSERV_CURRENT_APP_DEBUG', '0') == '1':
app.config['PROPAGATE_EXCEPTIONS'] = True
logger = mflog.get_logger(__name__)
def handle_sigterm(signum, frame):
logger.info(f"signal {signum} received")
time.sleep(30)
logger.info(f"end of signal {signum} handling")
exit(0)
signal.signal(signal.SIGTERM, handle_sigterm)
@app.route("/foo/")
def hello_world():
time.sleep(30)
logger.info("This is an info message")
return "Hello World !"
Check log:
# plugin is up and running
2024-10-16T13:54:52.204106Z [DEBUG] (bjoern_wrapper#4068258) process start
2024-10-16T13:54:54.936047Z [DEBUG] (signal_wrapper#4068173) socket_up on nginx
2024-10-16T13:53:31.698795Z [DEBUG] (urllib3.connectionpool#4066145) Starting new HTTP connection (1): 127.0.0.1:18868
2024-10-16T13:53:31.700266Z [DEBUG] (urllib3.connectionpool#4066145) http://127.0.0.1:18868 "GET /__socket_up/L2hvbWUvbWZzZXJ2L3Zhci9hcHBfZm9vX21haW5fMS5zb2NrZXQ= HTTP/1.1" 200 None
# we execute the program through url http://localhost:18868/foo/ (30s duration) in browser
# circus send a SIGTERM for restarting the plugin
2024-10-16T13:55:38.177021Z [INFO] (signal_wrapper#4068173) received SIGTERM => smart closing...
2024-10-16T13:55:38.178173Z [INFO] (signal_wrapper#4068173) stopping new connections and waiting for no connection left on /home/mfserv/var/app_foo_main_1.socket (timeout: 60 seconds)
2024-10-16T13:55:38.180883Z [DEBUG] (urllib3.connectionpool#4068173) Starting new HTTP connection (1): 127.0.0.1:18868
# metwork is waiting for no connection left on socket, calling __socket_down...
# end of the program execution:
2024-10-16T13:55:54.411389Z [INFO] (main.wsgi#4068258) This is an info message
# __socket_down now answers OK (no connection left on the socket):
2024-10-16T13:55:54.729325Z [DEBUG] (urllib3.connectionpool#4068173) http://127.0.0.1:18868 "GET /__socket_down/L2hvbWUvbWZzZXJ2L3Zhci9hcHBfZm9vX21haW5fMS5zb2NrZXQ=?wait=1 HTTP/1.1" 200 None
# SIGTERM is send to the bjoern_wrapper child process:
2024-10-16T13:55:54.730175Z [INFO] (signal_wrapper#4068173) sending SIGTERM to 4068258
# so the handle_sigterm functions is called:
2024-10-16T13:55:54.817808Z [INFO] (main.wsgi#4068258) signal 15 received
# SIGKILL received by the process, the handle_sigterm function cannot ends !
2024-10-16T13:55:58.015018Z [WARNING] (signal_wrapper#4068173) sending SIGKILL to 4068258 and its children
2024-10-16T13:55:58.050907Z [INFO] (signal_wrapper#4068173) process: 4068258 (smart) stopped
2024-10-16T13:55:58.431029Z [DEBUG] (bjoern_wrapper#4071232) process start
=> we notice the handle_sigterm function cannot ends as a SIGKILL is received...
do you agree with this way to reproduce @corentin-bl ?
If we look at the above log and the code of https://github.com/metwork-framework/mfserv/blob/master/adm/signal_wrapper.py we are in this part of code when the SIGKILL is triggered:
if SIG_SHUTDOWN is not None and args.timeout_after_signal > 0:
now = datetime.datetime.now()
delta = now - SIG_SHUTDOWN
if delta.total_seconds() > args.timeout_after_signal:
LOGGER.warning("sending SIGKILL to %i and its children", pid)
kill_process_and_children(pid)
moreover we notice the 3 secondes of delay in the log:
2024-10-16T13:55:54.817808Z [INFO] (main.wsgi#4068258) signal 15 received
2024-10-16T13:55:58.015018Z [WARNING] (signal_wrapper#4068173) sending SIGKILL to 4068258 and its children
so the timeout_after_signal delay (3s) is responsible of the SIGKILL during the sigterm handling.
if we look in the circus.ini
the command:
args = --stdout /home/mfserv/log/app_foo_main.log --stderr STDOUT -- plugin_wrapper foo -- signal_wrapper.py --timeout=60 --signal=15 --timeout-after-signal=3 --socket-up-after=3 /home/mfserv/var/app_foo_main_$(circus.wid).socket -- bjoern_wrapper.py --timeout=60 main.wsgi:app "/home/mfserv/var/app_foo_main_$(circus.wid).socket"
we notice this --timeout-after-signal=3
this is the smart_stop_delay=3
in the plugin's config.ini (cf. plugins.py)
So I set smart_stop_delay=40
then it works fine:
2024-10-16T16:34:43.876512Z [INFO] (signal_wrapper#158942) sending SIGTERM to 158983
2024-10-16T16:34:43.917286Z [INFO] (main.wsgi#158983) signal 15 received
2024-10-16T16:35:13.917849Z [INFO] (main.wsgi#158983) end of signal 15 handling
Hi Matthieu, thanks for looking at this so quickly! I will try changing the configuration parameter to see if it does anything. However, as mentionned in the initial post, it seems that the part of our function that was executed lasted only 0.4ms, way lower than 3s, so I don't know if increasing the timeout will help ... I'll test it today and let you know!
Unfortunately, and as expected, increasing smart_stop_delay
to a greater value (60s here) did not change anything:
2024-10-17T13:54:56.452578Z [INFO] (signal_wrapper#2491393) sending SIGTERM to 2491470
2024-10-17T13:54:56.525893Z [INFO] (sacdp#2491470) entering exit_handler function
2024-10-17T13:54:56.526036Z [INFO] (sacdp#2491470) current_thread_name='MainThread'
2024-10-17T13:54:56.526088Z [INFO] (sacdp#2491470) exit_handler: received signal SIGTERM
2024-10-17T13:54:56.526129Z [INFO] (sacdp#2491470) exit_handler: getting connection variable
2024-10-17T13:54:56.526168Z [INFO] (sacdp#2491470) after get_local
2024-10-17T13:55:56.858723Z [WARNING] (signal_wrapper#2491393) sending SIGKILL to 2491470 and its children
Regarding the way to reproduce, we did not use Flask to build our plugin. We have a script that defines the application(environ, start_response)
function, which is called by bjoern_wrapper.py
. I don't know if this is different from a Flask app, I would say that it doesn't matter, but I prefer to let you know.
Hello @matthieumarrast, I have made some progress on the debug and I have found two things:
getattr
in a custom function. When we run that custom function, getattr
is called, and it is setup so that if the attribute is not found, None
is returned. During my tests, I saw that, even if getattr
returns None
, our wrapper function does not return anything (I don't understand why so far). The variable is later reused in the function, which is fine if it exists, but causes a NameError
if not. This seems to be a problem for me (and @mrechte, who wrote the initial code).NameError
if we try to use a non-existing variable, but we could never see it before. I discovered it because you advised me to use try/except
statements (which I combined with traceback.format_exc()
), but otherwise mfserv
would never have told me. I think mfdata
is able to catch bubbled-up errors and to display them in the logs, is normal that mfserv
acts differently?And a final question: now that I am able to catch the NameError
, I managed to run the function until the end. The last statement is a sys.exit(0)
, which would imply that the SIGTERM
is enough to stop the process. However, we still see in the logs that a SIGKILL
is sent, is it normal?
Can you provide here a piece of code that shows mfserv and mfdata behave differently regarding "bubbled-up" errors ?
I also noticed the sys.exit(0)
has no effect on the loop waiting for the smart/stop delay and a sigkill is sent anyway... but I have not analysed that point. I agree that this doesn't seem normal.
Hi Matthieu, here are the two examples :
mfdata :
#!/usr/bin/python3
"""
Demo code for GitHub issue
"""
from acquisition import AcquisitionStep
class GitHubDemo(AcquisitionStep):
"""Demo class."""
plugin_name = "insert_cdp"
step_name = "github_issue_demo"
def process(self, xaf):
"""Demo method"""
self.info(f"process for file {xaf.filepath}")
my_var = 1
assert my_var == 0
return True
def plugin_path(self):
"""Other method"""
path = self.get_plugin_directory_path()
return path
if __name__ == "__main__":
x = GitHubDemo()
x.run()
mfserv :
"""
Demo code for GitHub issue
"""
import signal
def exit_handler(sig, frame):
"""Exit function where the error happens."""
print("Entering exit_handler")
my_var = 1
assert my_var == 0
print("After the assertion error")
return True
signal.signal(signal.SIGTERM, exit_handler)
def application(environ, start_response):
"""Main function"""
print("Entering application function")
return True
Let me know if this works as is or if you need further info (e.g. from the config file for instance).
While preparing the examples, I noticed that in the case of mfserv, errors happenning in the application
function were caught, but as you will see in the example, errors in the exit_handler
are not ...
We have an mfserv plugin that responds to HTTP requests using data fetched from a database, and we want to control database connections closures.
Normally, this plugin is configured so that circus spawns 8 processes, lasting 1 hour each. The process hierarchy is as such :
circus
->signal_wrapper.py
->bjoern_wrapper.py
. Whenmax_age
is reached, circus sends a signal tosignal_wrapper.py
, which forwards it tobjoern_wrapper.py
, and eventually to our plugin. We understand thatSIGTERM
is first sent, and then after some timeout (3s according to the arguments given tosignal_wrapper.py
),SIGKILL
is sent.We have implemented a function in our plugin that is triggered upon signal reception. It simply retrieves the database connection and closes it. In practice, we see that
SIGTERM
is well caught and that the function starts. However, we can tell from the log that the function is never fully executed, andSIGKILL
is eventually sent after 3s.From the log messages, it is clear that the function did not exceed the timeout as the part of the function that was executed lasted ~0.4ms, so it is not clear why it does not run until the end.
@mrechte and I tried to investigate the propagation of signals through processes, but it seems that everything works normally. @matthieumarrast and I tried to dive deeper in the code of the
socket_down
function insignal_wrapper
; an idea would be to keep the socket busy during the execution of our exit handler, but this remains to be tested, and this would still look like a patch-up job. If anybody has a better understanding of what happens when circus terminates processes, it would be much appreciated.Steps to reproduce (mfserv 2.2) :
signal
from the standard library to bind it to signal catch; make the function long enough (e.g. usingtime.sleep()
) and print messages in the log file,