joblib / loky

Robust and reusable Executor for joblib
http://loky.readthedocs.io/en/stable/
BSD 3-Clause "New" or "Revised" License
528 stars 45 forks source link

Nested loky calls can cause resource_tracker to detect leaked semaphores #369

Closed ogrisel closed 2 years ago

ogrisel commented 2 years ago

From the joblib test suite (rare failures, happening on linux or macos):

=================================== FAILURES ===================================
_________ test_nested_loop_error_in_grandchild_resource_tracker_silent _________

    @with_numpy
    @with_multiprocessing
    def test_nested_loop_error_in_grandchild_resource_tracker_silent():
        # Safety smoke test: test that nested parallel calls using the loky backend
        # don't yield noisy resource_tracker outputs when the grandchild errors
        # out.
        cmd = '''if 1:
            from joblib import Parallel, delayed

            def raise_error(i):
                raise ValueError

            def nested_loop(f):
                Parallel(backend="loky", n_jobs=2)(
                    delayed(f)(i) for i in range(10)
                )

            if __name__ == "__main__":
                Parallel(backend="loky", n_jobs=2)(
                    delayed(nested_loop)(func) for func in [raise_error]
                )
        '''
        p = subprocess.Popen([sys.executable, '-c', cmd],
                             stderr=subprocess.PIPE, stdout=subprocess.PIPE)
        p.wait()
        out, err = p.communicate()
        assert p.returncode == 1, out.decode()
>       assert b"resource_tracker" not in err, err.decode()
E       AssertionError: joblib.externals.loky.process_executor._RemoteTraceback: 
E         """
E         joblib.externals.loky.process_executor._RemoteTraceback: 
E         """
E         Traceback (most recent call last):
E           File "/home/vsts/work/1/s/joblib/externals/loky/process_executor.py", line 428, in _process_worker
E             r = call_item()
E           File "/home/vsts/work/1/s/joblib/externals/loky/process_executor.py", line 275, in __call__
E             return self.fn(*self.args, **self.kwargs)
E           File "/home/vsts/work/1/s/joblib/_parallel_backends.py", line 620, in __call__
E             return self.func(*args, **kwargs)
E           File "/home/vsts/work/1/s/joblib/parallel.py", line 289, in __call__
E             for func, args, kwargs in self.items]
E           File "/home/vsts/work/1/s/joblib/parallel.py", line 289, in <listcomp>
E             for func, args, kwargs in self.items]
E           File "<string>", line 6, in raise_error
E         ValueError
E         """
E         
E         The above exception was the direct cause of the following exception:
E         
E         Traceback (most recent call last):
E           File "/home/vsts/work/1/s/joblib/externals/loky/process_executor.py", line 428, in _process_worker
E             r = call_item()
E           File "/home/vsts/work/1/s/joblib/externals/loky/process_executor.py", line 275, in __call__
E             return self.fn(*self.args, **self.kwargs)
E           File "/home/vsts/work/1/s/joblib/_parallel_backends.py", line 620, in __call__
E             return self.func(*args, **kwargs)
E           File "/home/vsts/work/1/s/joblib/parallel.py", line 289, in __call__
E             for func, args, kwargs in self.items]
E           File "/home/vsts/work/1/s/joblib/parallel.py", line 289, in <listcomp>
E             for func, args, kwargs in self.items]
E           File "<string>", line 11, in nested_loop
E           File "/home/vsts/work/1/s/joblib/parallel.py", line 1098, in __call__
E             self.retrieve()
E           File "/home/vsts/work/1/s/joblib/parallel.py", line 975, in retrieve
E             self._output.extend(job.get(timeout=self.timeout))
E           File "/home/vsts/work/1/s/joblib/_parallel_backends.py", line 567, in wrap_future_result
E             return future.result(timeout=timeout)
E           File "/usr/share/miniconda/envs/testenv/lib/python3.7/concurrent/futures/_base.py", line 435, in result
E             return self.__get_result()
E           File "/usr/share/miniconda/envs/testenv/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
E             raise self._exception
E         ValueError
E         """
E         
E         The above exception was the direct cause of the following exception:
E         
E         Traceback (most recent call last):
E           File "<string>", line 17, in <module>
E           File "/home/vsts/work/1/s/joblib/parallel.py", line 1098, in __call__
E             self.retrieve()
E           File "/home/vsts/work/1/s/joblib/parallel.py", line 975, in retrieve
E             self._output.extend(job.get(timeout=self.timeout))
E           File "/home/vsts/work/1/s/joblib/_parallel_backends.py", line 567, in wrap_future_result
E             return future.result(timeout=timeout)
E           File "/usr/share/miniconda/envs/testenv/lib/python3.7/concurrent/futures/_base.py", line 435, in result
E             return self.__get_result()
E           File "/usr/share/miniconda/envs/testenv/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
E             raise self._exception
E         ValueError
E         /home/vsts/work/1/s/joblib/externals/loky/backend/resource_tracker.py:311: UserWarning: resource_tracker: There appear to be 3 leaked semlock objects to clean up at shutdown
E           'resource_tracker: There appear to be '
E         
E       assert b'resource_tracker' not in b'joblib.externals.loky.process_executor._RemoteTraceback: \n"""\njoblib.externals.loky.process_executor._RemoteTraceback: \n"""\nTraceback (most recent call last):\n  File "/home/vsts/work/1/s/joblib/externals/loky/process_executor.py", line 428, in _process_worker\n    r = call_item()\n  File "/home/vsts/work/1/s/joblib/externals/loky/process_executor.py", line 275, in __call__\n    return self.fn(*self.args, **self.kwargs)\n  File "/home/vsts/work/1/s/joblib/_parallel_backends.py", line 620, in __call__\n    return self.func(*args, **kwargs)\n  File "/home/vsts/work/1/s/joblib/parallel.py", line 289, in __call__\n    for func, args, kwargs in self.items]\n  File "/home/vsts/work/1/s/joblib/parallel.py", line 289, in <listcomp>\n    for func, args, kwargs in self.items]\n  File "<string>", line 6, in raise_error\nValueError\n"""\n\nThe above exception was the direct cause of the following exception:\n\nTraceback (most recent call last):\n  File "/home/vsts/work/1/s/joblib/externals/loky/process_executor.py", line 428, in _process_worker\n    r = call_item()\n  File "/home/vsts/work/1/s/joblib/externals/loky/process_executor.py", line 275, in __call__\n    return self.fn(*self.a... self.__get_result()\n  File "/usr/share/miniconda/envs/testenv/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result\n    raise self._exception\nValueError\n"""\n\nThe above exception was the direct cause of the following exception:\n\nTraceback (most recent call last):\n  File "<string>", line 17, in <module>\n  File "/home/vsts/work/1/s/joblib/parallel.py", line 1098, in __call__\n    self.retrieve()\n  File "/home/vsts/work/1/s/joblib/parallel.py", line 975, in retrieve\n    self._output.extend(job.get(timeout=self.timeout))\n  File "/home/vsts/work/1/s/joblib/_parallel_backends.py", line 567, in wrap_future_result\n    return future.result(timeout=timeout)\n  File "/usr/share/miniconda/envs/testenv/lib/python3.7/concurrent/futures/_base.py", line 435, in result\n    return self.__get_result()\n  File "/usr/share/miniconda/envs/testenv/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result\n    raise self._exception\nValueError\n/home/vsts/work/1/s/joblib/externals/loky/backend/resource_tracker.py:311: UserWarning: resource_tracker: There appear to be 3 leaked semlock objects to clean up at shutdown\n  \'resource_tracker: There appear to be \'\n'

cmd        = 'if 1:\n        from joblib import Parallel, delayed\n\n\n        def raise_error(i):\n            raise ValueError\n\...(backend="loky", n_jobs=2)(\n                delayed(nested_loop)(func) for func in [raise_error]\n            )\n    '
err        = b'joblib.externals.loky.process_executor._RemoteTraceback: \n"""\njoblib.externals.loky.process_executor._RemoteTraceb...er: There appear to be 3 leaked semlock objects to clean up at shutdown\n  \'resource_tracker: There appear to be \'\n'
out        = b''
p          = <subprocess.Popen object at 0x7fb3daa09b90>

joblib/test/test_memmapping.py:619: AssertionError
ogrisel commented 2 years ago

I merged #370 that introduces a new test named test_error_in_nested_call_keeps_resource_tracker_silent that should replicate the problem observed on joblib. Unfortunately I could never reproduce it locally. Let's see of the loky CI can make fail. In the mean time let's close this issue while keeping in mind that we might want to reop open it if test_error_in_nested_call_keeps_resource_tracker_silent is observed to fail from time to time.

aroyphillips commented 9 months ago

I believe I am experiencing the same issue when killing a parallel job:

[username@linux_server]$ /home/username/.conda/envs/my_env/lib/python3.11/site-packages/joblib/externals/loky/backend/resource_tracker.py:310: UserWarning: resource_tracker: There appear to be 12 leaked semlock objects to clean up at shutdown
  warnings.warn(
/home/username/.conda/envs/my_env/lib/python3.11/site-packages/joblib/externals/loky/backend/resource_tracker.py:310: UserWarning: resource_tracker: There appear to be 2 leaked folder objects to clean up at shutdown
  warnings.warn(
tomMoral commented 9 months ago

When you kill a parallel job, this makes sense that you might see warnings as some jobs might be killed without being given the chance to clean up the semaphore. As the kill signal can be matched by any thread/sub process, I don't think we can ensure this is always clean without warning so I would say you can safely ignore the warning, it is actually doing its job.

If this is something more complex, please open a dedicated issue, with a reproducer if possible.