joblib / loky

Robust and reusable Executor for joblib
http://loky.readthedocs.io/en/stable/
BSD 3-Clause "New" or "Revised" License
520 stars 47 forks source link

Unstable test_processes_crash_handling_after_executor_gc #388

Open ogrisel opened 1 year ago

ogrisel commented 1 year ago

Here is a failure I can trigger regularly on macOS M1 with Python 3.11.0.

Note that while the problem always happens in test_processes_crash_handling_after_executor_gc, this failure does not happen if I run this test in isolation, so it must have a dependency on some side effects of the previous tests run in TestsProcessPoolLokyShutdown.

pytest -v tests/test_process_executor_loky.py -x -k TestsProcessPoolLokyShutdown
============================================================================================== test session starts ==============================================================================================
platform darwin -- Python 3.11.0, pytest-7.2.0, pluggy-1.0.0 -- /Users/ogrisel/mambaforge/envs/dev/bin/python3.11
cachedir: .pytest_cache
hypothesis profile 'default' -> database=DirectoryBasedExampleDatabase('/Users/ogrisel/code/loky/.hypothesis/examples')
Using --randomly-seed=1436039263
benchmark: 4.0.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /Users/ogrisel/code/loky
plugins: snapshot-0.9.0, xdist-3.1.0, mock-3.10.0, clarity-1.0.1, hypothesis-6.62.0, randomly-3.12.0, profiling-1.7.0, cov-4.0.0, repeat-0.9.1, anyio-3.6.2, benchmark-4.0.0
collected 50 items / 36 deselected / 14 selected                                                                                                                                                                

tests/test_process_executor_loky.py::TestsProcessPoolLokyShutdown::test_context_manager_shutdown PASSED                                                                                                   [  7%]
tests/test_process_executor_loky.py::TestsProcessPoolLokyShutdown::test_interpreter_shutdown PASSED                                                                                                       [ 14%]
tests/test_process_executor_loky.py::TestsProcessPoolLokyShutdown::test_shutdown_and_kill_workers PASSED                                                                                                  [ 21%]
tests/test_process_executor_loky.py::TestsProcessPoolLokyShutdown::test_processes_terminate PASSED                                                                                                        [ 28%]
tests/test_process_executor_loky.py::TestsProcessPoolLokyShutdown::test_shutdown_deadlock_pickle PASSED                                                                                                   [ 35%]
tests/test_process_executor_loky.py::TestsProcessPoolLokyShutdown::test_processes_crash_handling_after_executor_gc [DEBUG:MainProcess:ExecutorManagerThread] terminate process LokyProcess-36, reason: broken executor
[DEBUG:MainProcess:ExecutorManagerThread] terminate process LokyProcess-35, reason: broken executor
[DEBUG:MainProcess:ExecutorManagerThread] terminate process LokyProcess-34, reason: broken executor
[DEBUG:MainProcess:ExecutorManagerThread] terminate process LokyProcess-33, reason: broken executor
[DEBUG:MainProcess:ExecutorManagerThread] found 0 processes to stop
[DEBUG:MainProcess:ExecutorManagerThread] sent 0 sentinels to the call queue
[DEBUG:MainProcess:ExecutorManagerThread] closing call_queue
[DEBUG:MainProcess:ExecutorManagerThread] telling queue thread to quit
[DEBUG:MainProcess:ExecutorManagerThread] Queue.join_thread()
[DEBUG:MainProcess:ExecutorManagerThread] closing result_queue
[DEBUG:MainProcess:ExecutorManagerThread] closing thread_wakeup
[DEBUG:MainProcess:QueueFeederThread] feeder thread got sentinel -- exiting
[DEBUG:MainProcess:ExecutorManagerThread] joining 0 processes
[DEBUG:MainProcess:ExecutorManagerThread] executor management thread clean shutdown of 0 workers
FAILED                                                                                 [ 42%]teardown class with <loky.backend.context.LokyContext object at 0x1518db310>

=================================================================================================== FAILURES ====================================================================================================
_________________________________________________________________ TestsProcessPoolLokyShutdown.test_processes_crash_handling_after_executor_gc __________________________________________________________________

self = <tests.test_process_executor_loky.TestsProcessPoolLokyShutdown object at 0x151b1ab50>

    def test_processes_crash_handling_after_executor_gc(self):
        # Start 5 easy jobs on 5 workers
        results = self.executor.map(sleep_and_return, [0.01] * 5, range(5))

        # Enqueue a job that will trigger a crash of one of the workers.
        # Make sure this crash does not happen before the non-failing jobs
        # have returned their results by using and multiprocessing Event
        # instance

        crash_result = self.executor.submit(self._wait_and_crash)
        assert len(self.executor._processes) == self.worker_count
        processes = self.executor._processes
        executor_flags = self.executor._flags

        # The following should trigger the GC and therefore shutdown of
        # workers. However the shutdown wait for all the pending jobs to
        # complete first.
        executor_reference = weakref.ref(self.executor)
        self.executor = None

        if IS_PYPY:
            # Object deletion and garbage collection can be delayed under PyPy.
            time.sleep(1.0)
            gc.collect()

        # Make sure that there is not other reference to the executor object.
        assert executor_reference() is None

        # The remaining jobs should still be processed in the background
>       for result, expected in zip(results, range(5)):

tests/_test_process_executor.py:267: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
loky/process_executor.py:962: in _chain_from_iterable_of_lists
    for element in iterable:
../../mambaforge/envs/dev/lib/python3.11/concurrent/futures/_base.py:619: in result_iterator
    yield _result_or_cancel(fs.pop())
../../mambaforge/envs/dev/lib/python3.11/concurrent/futures/_base.py:317: in _result_or_cancel
    return fut.result(timeout)
../../mambaforge/envs/dev/lib/python3.11/concurrent/futures/_base.py:456: in result
    return self.__get_result()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = None

    def __get_result(self):
        if self._exception:
            try:
>               raise self._exception
E               loky.process_executor.TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.
E               
E               The exit codes of the workers are {SIGSEGV(-11)}

../../mambaforge/envs/dev/lib/python3.11/concurrent/futures/_base.py:401: TerminatedWorkerError
--------------------------------------------------------------------------------------------- Captured stderr setup ---------------------------------------------------------------------------------------------
[DEBUG:MainProcess:MainThread] created semlock with handle 19 and name "None"
[DEBUG:MainProcess:MainThread] created semlock with handle 30 and name "None"
[DEBUG:MainProcess:MainThread] created semlock with handle 31 and name "None"
[DEBUG:MainProcess:MainThread] created semlock with handle 33 and name "None"
[DEBUG:MainProcess:MainThread] created semlock with handle 36 and name "None"
[DEBUG:MainProcess:MainThread] created semlock with handle 37 and name "None"
[DEBUG:MainProcess:MainThread] ProcessPoolExecutor is setup
[DEBUG:MainProcess:MainThread] created semlock with handle 38 and name "None"
[DEBUG:MainProcess:MainThread] launched python with pid 83484 and cmd:
['/Users/ogrisel/mambaforge/envs/dev/bin/python3.11', '-m', 'loky.backend.popen_loky_posix', '--process-name', 'LokyProcess-33', '--pipe', '41']
[DEBUG:MainProcess:MainThread] created semlock with handle 40 and name "None"
[DEBUG:MainProcess:MainThread] launched python with pid 83485 and cmd:
['/Users/ogrisel/mambaforge/envs/dev/bin/python3.11', '-m', 'loky.backend.popen_loky_posix', '--process-name', 'LokyProcess-34', '--pipe', '23']
[DEBUG:MainProcess:MainThread] created semlock with handle 21 and name "None"
[DEBUG:MainProcess:MainThread] launched python with pid 83486 and cmd:
['/Users/ogrisel/mambaforge/envs/dev/bin/python3.11', '-m', 'loky.backend.popen_loky_posix', '--process-name', 'LokyProcess-35', '--pipe', '25']
[DEBUG:MainProcess:MainThread] created semlock with handle 24 and name "None"
[DEBUG:MainProcess:MainThread] launched python with pid 83487 and cmd:
['/Users/ogrisel/mambaforge/envs/dev/bin/python3.11', '-m', 'loky.backend.popen_loky_posix', '--process-name', 'LokyProcess-36', '--pipe', '29']
[DEBUG:MainProcess:MainThread] created semlock with handle 28 and name "None"
[DEBUG:MainProcess:MainThread] launched python with pid 83488 and cmd:
['/Users/ogrisel/mambaforge/envs/dev/bin/python3.11', '-m', 'loky.backend.popen_loky_posix', '--process-name', 'LokyProcess-37', '--pipe', '42']
[DEBUG:MainProcess:MainThread] Adjusted process count to 5: [('LokyProcess-33', 83484), ('LokyProcess-34', 83485), ('LokyProcess-35', 83486), ('LokyProcess-36', 83487), ('LokyProcess-37', 83488)]
[DEBUG:MainProcess:MainThread] _start_executor_manager_thread called
[DEBUG:MainProcess:ExecutorManagerThread] Queue._start_thread()
[DEBUG:MainProcess:ExecutorManagerThread] doing self._thread.start()
[DEBUG:MainProcess:QueueFeederThread] starting thread to feed data to pipe
[DEBUG:MainProcess:ExecutorManagerThread] ... done self._thread.start()
[DEBUG:LokyProcess-33:MainThread] recreated blocker with handle 30 and name "/loky-83416-b7nelueg"
[DEBUG:LokyProcess-33:MainThread] recreated blocker with handle 31 and name "/loky-83416-3p9ap7np"
[DEBUG:LokyProcess-33:MainThread] recreated blocker with handle 33 and name "/loky-83416-2k_hnmsk"
[DEBUG:LokyProcess-33:MainThread] recreated blocker with handle 36 and name "/loky-83416-53jw0x_k"
[DEBUG:LokyProcess-33:MainThread] recreated blocker with handle 37 and name "/loky-83416-5afq40qu"
[DEBUG:LokyProcess-34:MainThread] recreated blocker with handle 30 and name "/loky-83416-b7nelueg"
[DEBUG:LokyProcess-34:MainThread] recreated blocker with handle 31 and name "/loky-83416-3p9ap7np"
[DEBUG:LokyProcess-34:MainThread] recreated blocker with handle 33 and name "/loky-83416-2k_hnmsk"
[DEBUG:LokyProcess-34:MainThread] recreated blocker with handle 36 and name "/loky-83416-53jw0x_k"
[DEBUG:LokyProcess-34:MainThread] recreated blocker with handle 37 and name "/loky-83416-5afq40qu"
[DEBUG:LokyProcess-35:MainThread] recreated blocker with handle 30 and name "/loky-83416-b7nelueg"
[DEBUG:LokyProcess-35:MainThread] recreated blocker with handle 31 and name "/loky-83416-3p9ap7np"
[DEBUG:LokyProcess-35:MainThread] recreated blocker with handle 33 and name "/loky-83416-2k_hnmsk"
[DEBUG:LokyProcess-35:MainThread] recreated blocker with handle 36 and name "/loky-83416-53jw0x_k"
[DEBUG:LokyProcess-35:MainThread] recreated blocker with handle 37 and name "/loky-83416-5afq40qu"
[DEBUG:LokyProcess-36:MainThread] recreated blocker with handle 30 and name "/loky-83416-b7nelueg"
[DEBUG:LokyProcess-36:MainThread] recreated blocker with handle 31 and name "/loky-83416-3p9ap7np"
[DEBUG:LokyProcess-36:MainThread] recreated blocker with handle 33 and name "/loky-83416-2k_hnmsk"
[DEBUG:LokyProcess-36:MainThread] recreated blocker with handle 36 and name "/loky-83416-53jw0x_k"
[DEBUG:LokyProcess-36:MainThread] recreated blocker with handle 37 and name "/loky-83416-5afq40qu"
[DEBUG:LokyProcess-37:MainThread] recreated blocker with handle 30 and name "/loky-83416-b7nelueg"
[DEBUG:LokyProcess-37:MainThread] recreated blocker with handle 31 and name "/loky-83416-3p9ap7np"
[DEBUG:LokyProcess-37:MainThread] recreated blocker with handle 33 and name "/loky-83416-2k_hnmsk"
[DEBUG:LokyProcess-37:MainThread] recreated blocker with handle 36 and name "/loky-83416-53jw0x_k"
[DEBUG:LokyProcess-37:MainThread] recreated blocker with handle 37 and name "/loky-83416-5afq40qu"
[DEBUG:LokyProcess-33:MainThread] recreated blocker with handle 11 and name "/loky-83416-klhowfld"
[DEBUG:LokyProcess-33:MainThread] recreated blocker with handle 12 and name "/loky-83416-ftxbcy07"
[DEBUG:LokyProcess-33:MainThread] recreated blocker with handle 14 and name "/loky-83416-evpeqrdo"
[DEBUG:LokyProcess-33:MainThread] recreated blocker with handle 15 and name "/loky-83416-m6nkb6ze"
[DEBUG:LokyProcess-33:MainThread] recreated blocker with handle 16 and name "/loky-83416-xrxztfrx"
[DEBUG:LokyProcess-33:MainThread] recreated blocker with handle 19 and name "/loky-83416-jysyq_e1"
[DEBUG:LokyProcess-33:MainThread] recreated blocker with handle 38 and name "/loky-83416-on6nv0cb"
[INFO:LokyProcess-33:MainThread] child process calling self.run()
[DEBUG:LokyProcess-33:MainThread] Worker started with timeout=None
--------------------------------------------------------------------------------------------- Captured stderr call ----------------------------------------------------------------------------------------------
[DEBUG:MainProcess:MainThread] Executor collected: triggering callback for QueueManager wakeup
[DEBUG:LokyProcess-33:MainThread] psutil return memory size: 31064064
[DEBUG:LokyProcess-35:MainThread] recreated blocker with handle 11 and name "/loky-83416-klhowfld"
[DEBUG:LokyProcess-35:MainThread] recreated blocker with handle 12 and name "/loky-83416-ftxbcy07"
[DEBUG:LokyProcess-35:MainThread] recreated blocker with handle 14 and name "/loky-83416-evpeqrdo"
[DEBUG:LokyProcess-35:MainThread] recreated blocker with handle 15 and name "/loky-83416-m6nkb6ze"
[DEBUG:LokyProcess-35:MainThread] recreated blocker with handle 16 and name "/loky-83416-xrxztfrx"
[DEBUG:LokyProcess-35:MainThread] recreated blocker with handle 19 and name "/loky-83416-jysyq_e1"
[DEBUG:LokyProcess-35:MainThread] recreated blocker with handle 21 and name "/loky-83416-pwxs2_bx"
[INFO:LokyProcess-35:MainThread] child process calling self.run()
[DEBUG:LokyProcess-35:MainThread] Worker started with timeout=None
[DEBUG:LokyProcess-34:MainThread] recreated blocker with handle 11 and name "/loky-83416-klhowfld"
[DEBUG:LokyProcess-34:MainThread] recreated blocker with handle 12 and name "/loky-83416-ftxbcy07"
[DEBUG:LokyProcess-34:MainThread] recreated blocker with handle 14 and name "/loky-83416-evpeqrdo"
[DEBUG:LokyProcess-34:MainThread] recreated blocker with handle 15 and name "/loky-83416-m6nkb6ze"
[DEBUG:LokyProcess-34:MainThread] recreated blocker with handle 16 and name "/loky-83416-xrxztfrx"
[DEBUG:LokyProcess-34:MainThread] recreated blocker with handle 19 and name "/loky-83416-jysyq_e1"
[DEBUG:LokyProcess-34:MainThread] recreated blocker with handle 40 and name "/loky-83416-b3dktkgs"
[INFO:LokyProcess-34:MainThread] child process calling self.run()
[DEBUG:LokyProcess-34:MainThread] Worker started with timeout=None
[DEBUG:LokyProcess-36:MainThread] recreated blocker with handle 11 and name "/loky-83416-klhowfld"
[DEBUG:LokyProcess-36:MainThread] recreated blocker with handle 12 and name "/loky-83416-ftxbcy07"
[DEBUG:LokyProcess-36:MainThread] recreated blocker with handle 14 and name "/loky-83416-evpeqrdo"
[DEBUG:LokyProcess-36:MainThread] recreated blocker with handle 15 and name "/loky-83416-m6nkb6ze"
[DEBUG:LokyProcess-36:MainThread] recreated blocker with handle 16 and name "/loky-83416-xrxztfrx"
[DEBUG:LokyProcess-36:MainThread] recreated blocker with handle 19 and name "/loky-83416-jysyq_e1"
[DEBUG:LokyProcess-36:MainThread] recreated blocker with handle 24 and name "/loky-83416-hwk3h8lb"
[INFO:LokyProcess-36:MainThread] child process calling self.run()
[DEBUG:LokyProcess-36:MainThread] Worker started with timeout=None
[DEBUG:LokyProcess-37:MainThread] recreated blocker with handle 11 and name "/loky-83416-klhowfld"
[DEBUG:LokyProcess-37:MainThread] recreated blocker with handle 12 and name "/loky-83416-ftxbcy07"
[DEBUG:LokyProcess-37:MainThread] recreated blocker with handle 14 and name "/loky-83416-evpeqrdo"
[DEBUG:LokyProcess-37:MainThread] recreated blocker with handle 15 and name "/loky-83416-m6nkb6ze"
[DEBUG:LokyProcess-37:MainThread] recreated blocker with handle 16 and name "/loky-83416-xrxztfrx"
[DEBUG:LokyProcess-37:MainThread] recreated blocker with handle 19 and name "/loky-83416-jysyq_e1"
[DEBUG:LokyProcess-37:MainThread] recreated blocker with handle 28 and name "/loky-83416-htsjxz9c"
[INFO:LokyProcess-37:MainThread] child process calling self.run()
[DEBUG:LokyProcess-37:MainThread] Worker started with timeout=None
[DEBUG:MainProcess:ExecutorManagerThread] A worker unexpectedly terminated. Workers that might have caused the breakage: {'LokyProcess-33': -11}
[DEBUG:MainProcess:ExecutorManagerThread] terminate process LokyProcess-37, reason: broken executor
------------------------------------------------------------------------------------------- Captured stderr teardown --------------------------------------------------------------------------------------------
[DEBUG:LokyProcess-34:MainThread] psutil return memory size: 47251456
============================================================================================ short test summary info ============================================================================================
FAILED tests/test_process_executor_loky.py::TestsProcessPoolLokyShutdown::test_processes_crash_handling_after_executor_gc - loky.process_executor.TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessiv...
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
================================================================================== 1 failed, 5 passed, 36 deselected in 10.85s ==================================================================================
[INFO:MainProcess:MainThread] process shutting down
[DEBUG:MainProcess:MainThread] running all "atexit" finalizers with priority >= 0
[DEBUG:MainProcess:MainThread] running the remaining "atexit" finalizers
ogrisel commented 1 year ago

Actually, I read the error message and indeed my laptop was under heavy memory pressure when I ran this test. Once I shutdown a VM concurrently running on this laptop to debug a Windows-specific problem, the activity monitor said I am no longer under memory pressure and indeed, I can no longer reproduce the failure.

So everything is fine.

ogrisel commented 1 year ago

Actually I can still reproduce this failure quite often even when the memory pressure on my system is low...

ogrisel commented 1 year ago

I noticed that the ordering of the tests run by pytest when running this command is not deterministic. Sometimes, when this test is run among the firsts, I do not get the failure. So there is definitely a side effect from one of the other tests in this module that causes the problem.

Also, I can reproduce with tags/3.3.0 so it's not related to one of the recently merged PRs.

ogrisel commented 1 year ago

Ok I narrowed it down to the dependency between 2 tests only:

pytest -v tests/test_process_executor_loky.py -x -k "TestsProcessPoolLokyShutdown and (test_processes_crash_handling_after_executor_gc or test_shutdown_and_kill_workers)"

when test_shutdown_and_kill_workers is run first, then test_processes_crash_handling_after_executor_gc crashes but not otherwise.

ogrisel commented 1 year ago

Interestingly enough, I cannot reproduce with Python 3.10 because the ordering of those tests becomes deterministic and always in the order that works...

ogrisel commented 1 year ago

It seems that the ExecutorManagerThread of the executor of the previous test is still running the kill_workers method when the subsequent process starts.

ogrisel commented 1 year ago

It seems that the ExecutorManagerThread of the executor of the previous test is still running the kill_workers method when the subsequent process starts.

Looking at the logs, and adding more logs, it seems that the executor manager thread of the main process in the previous test has completed before starting to run the subsequent test. Furthermore, that would not have explained why we get a sigsev (segfault) in the subsequent GC test.