joblib / loky

Robust and reusable Executor for joblib
http://loky.readthedocs.io/en/stable/
BSD 3-Clause "New" or "Revised" License
520 stars 47 forks source link

TST test_error_in_nested_call_keeps_resource_tracker_silent #370

Closed ogrisel closed 1 year ago

ogrisel commented 1 year ago

Adapted reproducer for #369 to be part of the loky test suite.

Unfortunately I could never trigger the race condition locally so far.

ogrisel commented 1 year ago

Actually I spoke too quickly, I can reproduce with:

pip install pytest-repeat
pytest --count=10000 -vlxk test_error_in_nested_call_keeps_resource_tracker_silent
[...]
_______________________________________________ ERROR at teardown of TestExecutorInitializer.test_error_nested_call_keeps_resource_tracker_silent[1981-10000] ________________________________________________

self = <tests.test_reusable_executor.TestExecutorInitializer object at 0x49d56afb0>
method = <bound method TestExecutorInitializer.test_error_nested_call_keeps_resource_tracker_silent of <tests.test_reusable_executor.TestExecutorInitializer object at 0x49d56afb0>>

    def teardown_method(self, method):
        """Make sure the executor can be recovered after the tests"""
        executor = get_reusable_executor(max_workers=2)
        assert executor.submit(math.sqrt, 1).result() == 1
        # There can be less than 2 workers because of the worker timeout
>       _check_subprocesses_number(executor, expected_max_process_number=2)

executor   = <loky.reusable_executor._ReusablePoolExecutor object at 0x4bca0d1b0>
method     = <bound method TestExecutorInitializer.test_error_nested_call_keeps_resource_tracker_silent of <tests.test_reusable_executor.TestExecutorInitializer object at 0x49d56afb0>>
self       = <tests.test_reusable_executor.TestExecutorInitializer object at 0x49d56afb0>

tests/_executor_mixin.py:250: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

executor = <loky.reusable_executor._ReusablePoolExecutor object at 0x4bca0d1b0>, expected_process_number = None, expected_max_process_number = 2, patience = 100

    def _check_subprocesses_number(executor=None, expected_process_number=None,
                                   expected_max_process_number=None, patience=100):
        if not psutil:
            # psutil is not installed, we cannot check the number of subprocesses
            return

        # Wait for terminating processes to disappear
        children_cmdlines = _running_children_with_cmdline(psutil.Process())
        pids_cmdlines = [(c.pid, cmdline) for c, cmdline in children_cmdlines]
        children_pids = {pid for pid, _ in pids_cmdlines}
        if executor is not None:
            worker_pids = set(executor._processes.keys())
        else:
            # Bypass pids checks when executor has been garbage
            # collected
            worker_pids = children_pids
        if expected_process_number is not None:
            try:
                assert len(children_pids) == expected_process_number, pids_cmdlines
                assert len(worker_pids) == expected_process_number, pids_cmdlines
                assert worker_pids == children_pids, pids_cmdlines
            except AssertionError:
                if expected_process_number != 0:
                    raise
                # there is a race condition with the /proc/<pid>/ system clean up
                # and our utilization of psutil. The Process is considered alive by
                # psutil even though it have been terminated. Wait for the system
                # clean up in this case.
                for _ in range(patience):
                    if not _running_children_with_cmdline(psutil.Process()):
                        break
                    time.sleep(.1)
                else:
                    raise

        if expected_max_process_number is not None:
            assert len(children_pids) <= expected_max_process_number, pids_cmdlines
>           assert len(worker_pids) <= expected_max_process_number, pids_cmdlines
E           AssertionError: [(78570, '/Users/ogrisel/mambaforge/envs/dev/bin/python3.10 -m loky.backend.popen_loky_posix --process-name LokyProcess-1 --pipe 28')]

children_cmdlines = [(psutil.Process(pid=78570, name='python3.10', status='running', started='10:23:11'), '/Users/ogrisel/mambaforge/envs/dev/bin/python3.10 -m loky.backend.popen_loky_posix --process-name LokyProcess-1 --pipe 28')]
children_pids = {78570}
executor   = <loky.reusable_executor._ReusablePoolExecutor object at 0x4bca0d1b0>
expected_max_process_number = 2
expected_process_number = None
patience   = 100
pids_cmdlines = [(78570, '/Users/ogrisel/mambaforge/envs/dev/bin/python3.10 -m loky.backend.popen_loky_posix --process-name LokyProcess-1 --pipe 28')]
worker_pids = {78570, 80561, 80562}

tests/_executor_mixin.py:106: AssertionError
tests/_executor_mixin.py:106: AssertionError
------------------------------------------------------------------------------------------- Captured stderr setup --------------------------------------------------------------------------------------------
[DEBUG:MainProcess:MainThread] Reusing existing executor with max_workers=2.
-------------------------------------------------------------------------------------------- Captured stderr call --------------------------------------------------------------------------------------------
[INFO:LokyProcess-4:MainThread] Shutting down worker after timeout 10.000s
[DEBUG:LokyProcess-4:MainThread] Exited cleanly
[INFO:LokyProcess-4:MainThread] process shutting down
[DEBUG:LokyProcess-4:MainThread] running all "atexit" finalizers with priority >= 0
[DEBUG:LokyProcess-4:MainThread] running the remaining "atexit" finalizers
[INFO:LokyProcess-4:MainThread] process exiting with exitcode 0
------------------------------------------------------------------------------------------ Captured stderr teardown ------------------------------------------------------------------------------------------
[DEBUG:MainProcess:MainThread] Reusing existing executor with max_workers=2.
[DEBUG:MainProcess:MainThread] created semlock with handle 29 and name "None"
[DEBUG:MainProcess:ExecutorManagerThread] created semlock with handle 27 and name "None"
[DEBUG:MainProcess:ExecutorManagerThread] launched python with pid 80561 and cmd:
['/Users/ogrisel/mambaforge/envs/dev/bin/python3.10', '-m', 'loky.backend.popen_loky_posix', '--process-name', 'LokyProcess-5', '--pipe', '31']
[DEBUG:MainProcess:ExecutorManagerThread] Adjust process count : {78570: <LokyProcess name='LokyProcess-1' pid=78570 parent=74703 started>, 80561: <LokyProcess name='LokyProcess-5' pid=80561 parent=74703 started>}
[DEBUG:MainProcess:MainThread] launched python with pid 80562 and cmd:
['/Users/ogrisel/mambaforge/envs/dev/bin/python3.10', '-m', 'loky.backend.popen_loky_posix', '--process-name', 'LokyProcess-6', '--pipe', '32']
[DEBUG:MainProcess:MainThread] Adjust process count : {78570: <LokyProcess name='LokyProcess-1' pid=78570 parent=74703 started>, 80561: <LokyProcess name='LokyProcess-5' pid=80561 parent=74703 started>, 80562: <LokyProcess name='LokyProcess-6' pid=80562 parent=74703 started>}
============================================================================================== warnings summary ==============================================================================================
tests/test_reusable_executor.py::TestExecutorInitializer::test_error_nested_call_keeps_resource_tracker_silent[1981-10000]
  /Users/ogrisel/code/loky/loky/process_executor.py:697: UserWarning: A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.
    warnings.warn(

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
========================================================================================== short test summary info ===========================================================================================
ERROR tests/test_reusable_executor.py::TestExecutorInitializer::test_error_nested_call_keeps_resource_tracker_silent[1981-10000] - AssertionError: [(78570, '/Users/ogrisel/mambaforge/envs/dev/bin/python3...
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
================================================================== 1981 passed, 3070000 deselected, 1 warning, 1 error in 208.96s (0:03:28) ==================================================================
[INFO:LokyProcess-1:MainThread] Shutting down worker on sentinel
[DEBUG:LokyProcess-1:MainThread] Exited cleanly
[INFO:LokyProcess-1:MainThread] process shutting down
[DEBUG:LokyProcess-1:MainThread] running all "atexit" finalizers with priority >= 0
[DEBUG:LokyProcess-1:MainThread] running the remaining "atexit" finalizers
[INFO:LokyProcess-1:MainThread] process exiting with exitcode 0
[DEBUG:LokyProcess-6:MainThread] recreated blocker with handle 17 and name "/loky-74703-amd4qe41"
[DEBUG:LokyProcess-5:MainThread] recreated blocker with handle 17 and name "/loky-74703-amd4qe41"
[DEBUG:LokyProcess-6:MainThread] recreated blocker with handle 18 and name "/loky-74703-0v_leqrn"
[DEBUG:LokyProcess-5:MainThread] recreated blocker with handle 18 and name "/loky-74703-0v_leqrn"
[DEBUG:LokyProcess-6:MainThread] recreated blocker with handle 19 and name "/loky-74703-ppxc0x1k"
[DEBUG:LokyProcess-5:MainThread] recreated blocker with handle 19 and name "/loky-74703-ppxc0x1k"
[DEBUG:LokyProcess-6:MainThread] recreated blocker with handle 22 and name "/loky-74703-o75tgnqy"
[DEBUG:LokyProcess-5:MainThread] recreated blocker with handle 22 and name "/loky-74703-o75tgnqy"
[DEBUG:LokyProcess-6:MainThread] recreated blocker with handle 23 and name "/loky-74703-hqg2l_0m"
[DEBUG:LokyProcess-5:MainThread] recreated blocker with handle 23 and name "/loky-74703-hqg2l_0m"
[DEBUG:LokyProcess-6:MainThread] recreated blocker with handle 11 and name "/loky-74703-wwh3lnfy"
[DEBUG:LokyProcess-5:MainThread] recreated blocker with handle 11 and name "/loky-74703-wwh3lnfy"
[DEBUG:LokyProcess-6:MainThread] recreated blocker with handle 29 and name "/loky-74703-r4gy5e2_"
[DEBUG:LokyProcess-5:MainThread] recreated blocker with handle 27 and name "/loky-74703-rax6nky6"
[INFO:LokyProcess-6:MainThread] child process calling self.run()
[INFO:LokyProcess-5:MainThread] child process calling self.run()
[DEBUG:LokyProcess-6:MainThread] Worker started with timeout=10
[DEBUG:LokyProcess-5:MainThread] Worker started with timeout=10
[INFO:LokyProcess-6:MainThread] Shutting down worker on sentinel
[DEBUG:LokyProcess-6:MainThread] Exited cleanly
[INFO:LokyProcess-6:MainThread] process shutting down
[DEBUG:LokyProcess-6:MainThread] running all "atexit" finalizers with priority >= 0
[DEBUG:LokyProcess-6:MainThread] running the remaining "atexit" finalizers
[INFO:LokyProcess-6:MainThread] process exiting with exitcode 0
[INFO:LokyProcess-5:MainThread] Shutting down worker on sentinel
[DEBUG:LokyProcess-5:MainThread] Exited cleanly
[INFO:LokyProcess-5:MainThread] process shutting down
[DEBUG:LokyProcess-5:MainThread] running all "atexit" finalizers with priority >= 0
[DEBUG:LokyProcess-5:MainThread] running the remaining "atexit" finalizers
[INFO:LokyProcess-5:MainThread] process exiting with exitcode 0
[INFO:MainProcess:MainThread] process shutting down

will mark the test as XFAIL for now.

ogrisel commented 1 year ago

Actually this might not be the same problem...