Add handling for SIGTERM

pnuu commented 1 month ago

This PR adds handling for SIGTERM signal, so that Trollflow2 can be terminated gracefully in for example container environment.

[ ] Closes #xxxx
[x] Tests added
[x] Tests passed
[x] Passes flake8 trollflow2
[x] Fully documented
[ ] Add your name to AUTHORS.md if not there already

pnuu commented 1 month ago

There are currently two tests. One running generate_messages in a subprocess, another that does the same with Runner. With the latter case I get an error with zmq.error.ZMQError: Invalid argument (addr='localhost:40000') eventhough the connection_parameters are equal to the former. This causes the exit code to be 1 instead of the expected 0.

Another problem is, that I'd expect the process to be up for 5 seconds due to the message iteration loop to wait for 5 seconds for new messages before timeout. Instead the process terminates immediately.

pnuu commented 1 month ago

If I test with this script

#!/usr/bin/env python

import os
import signal
import time

def main():
    """"""
    print(os.getpid())
    keep_looping = True
    tic = None

    def _signal_handler(signum, frame):
        nonlocal keep_looping
        nonlocal tic
        print("signal received")
        keep_looping = False
        tic = time.time()

    signal.signal(signal.SIGTERM, _signal_handler)

    while keep_looping:
        time.sleep(10)

    print("terminated with", signal.SIGTERM)
    print("time after termination", time.time() - tic)

if __name__ == "__main__":
    main()

and run kill from another terminal for the printed PID I see the termination time to be in relation to the time.sleep(). I thought these should be pretty equivalent.

pnuu commented 1 month ago

Also using

    while keep_looping:
        try:
            _ = in_queue.get(True, 10)
        except queue.Empty:
            pass

works as expected and the 10 second timeout is respected.

pnuu commented 1 month ago

I'm guessing the premature termination is because of the posttroll.listener.ListenerContainer runs in a separate thread and is killed so the queue also exits.

I'll see if I can adapt the test_sigterm_runner so that there is a proper "workload" running and that it runs until completion.

pnuu commented 1 month ago

With https://github.com/pytroll/trollflow2/pull/217/commits/11e5dff668a3ecf6d66fb12c466d4b94f3fe8156 the processing should be simulated and shown that we are waiting for it to finish before exiting.

I'll add a note about the SIGTERM signaling to documentation.

codecov[bot] commented 1 month ago

Codecov Report

Attention: Patch coverage is 96.00000% with 2 lines in your changes missing coverage. Please review.

Project coverage is 95.93%. Comparing base (6b6e6b1) to head (4aebf81). Report is 16 commits behind head on main.

Files with missing lines	Patch %	Lines
trollflow2/launcher.py	77.77%	2 Missing :warning:

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #217 +/- ## ========================================== - Coverage 95.93% 95.93% -0.01% ========================================== Files 12 12 Lines 3052 3098 +46 ========================================== + Hits 2928 2972 +44 - Misses 124 126 +2 ``` | [Flag](https://app.codecov.io/gh/pytroll/trollflow2/pull/217/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=pytroll) | Coverage Δ | | |---|---|---| | [unittests](https://app.codecov.io/gh/pytroll/trollflow2/pull/217/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=pytroll) | `95.93% <96.00%> (-0.01%)` | :arrow_down: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=pytroll#carryforward-flags-in-the-pull-request-comment) to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

pnuu commented 1 month ago

It is interesting that Codecov complains that _signal_handler is not covered, running all the tests locally with pytest -s prints out the signaling log message. I did try with caplog to catch those messages (the fixture seems to be still there in test_sigterm_runner()...), but I guess it didn't work because of the logs being in a separate thread/process and the messages are not propageted in my test case. I'll have another go.

pnuu commented 1 month ago

Hmm, no, I can't capture the log messages even if I get them to print on the terminal. I also tried wrapping the Process() until proc.join() with trollflow2.logging.logging_on decorator but still only see the message on terminal and caplog.text stays empty.

pnuu commented 1 month ago

And this is what is printed to terminal when using with logging_on() context:

[DEBUG: 2024-10-21 15:48:53 : trollflow2.launcher] Launching trollflow2 with subprocesses
[INFO: 2024-10-21 15:48:53 : trollflow2.launcher] New message received.
[DEBUG: 2024-10-21 15:48:53 : trollflow2.launcher] pytroll://my/topic file lahtinep@lahtinep21lk 2024-10-21T12:48:53.117334 v1.01 application/json {"filename": "foo"}
[INFO: 2024-10-21 15:48:54 : trollflow2.launcher] Caught signal to stop processing new messages and to terminate Trollflow2.
[INFO: 2024-10-21 15:48:59 : trollflow2.launcher] All 0 files produced nominally in 0:00:06.460422

mraspaud commented 1 month ago

And this is what is printed to terminal when using with logging_on() context:

[DEBUG: 2024-10-21 15:48:53 : trollflow2.launcher] Launching trollflow2 with subprocesses
[INFO: 2024-10-21 15:48:53 : trollflow2.launcher] New message received.
[DEBUG: 2024-10-21 15:48:53 : trollflow2.launcher] pytroll://my/topic file lahtinep@lahtinep21lk 2024-10-21T12:48:53.117334 v1.01 application/json {"filename": "foo"}
[INFO: 2024-10-21 15:48:54 : trollflow2.launcher] Caught signal to stop processing new messages and to terminate Trollflow2.
[INFO: 2024-10-21 15:48:59 : trollflow2.launcher] All 0 files produced nominally in 0:00:06.460422

yes, caplog and the queue logger sometimes ignore each other. Question is in which thread/process does the signal handling run? That maybe influences where the log goes?

pnuu commented 1 month ago

Question is in which thread/process does the signal handling run? That maybe influences where the log goes?

In test_sigterm_generate_messages() the signal handling is in the main thread, there's only the one process.

It gets more complicated in test_sigterm_runner(). generate_messages() is called in launcher.Runner._get_message_iterator() and the iterator is passed on to _run_threaded() or _run_subprocess(). The queue_logged_process is given a message from the message receiving iterator. The message receiving part should thus be within the main process of Runner that is started in the test case.

pnuu commented 1 month ago

Looks like the messages do not end up in the listener.output_queue. If I revert https://github.com/pytroll/trollflow2/pull/217/commits/aefd2a33fd131f24065dcbafd123d968680c2406 the _fake_queue_logged_process is called again.

mraspaud commented 1 month ago

Looks like the messages do not end up in the listener.output_queue. If I revert aefd2a3 the _fake_queue_logged_process is called again.

ok, then I suppose the patched subscriber doesn't work with the listenercontainer. That is another issue that we need to fix in another PR, so feel free to put the mocking back.

pnuu commented 1 month ago

Ok, I reverted from patched_subscriber_recv to mocking ListenerContainer. I also shortened the output queue wait time from 5 to 1 second to make testing a bit more streamlined. And removed unused caplog from test_sigterm_runner().

pytroll / trollflow2

Add handling for SIGTERM #217

Codecov Report