rabyj commented 1 year ago

TL;DR: If you're experiencing a hanging issue during tests with a program using multiprocessing, try adding the following code at the beginning of your problematic test function:

def a_test():
    import multiprocessing as mp
    mp.set_start_method("spawn", force=True)
    # [... rest of test ...]

Description

I've encountered an odd test hanging issue with a program that uses multiprocessing. The program completes:

When run individually from the command line
As part of a test function invoked alone via pytest

The program hangs indefinitely when run as part of a full suite of tests.

I wanted to at least document what I experienced even if I can't create a simple minimal example, since I found a workaround that could help others, and no other issue mentioning something like this. All issues are for single tests.

Environment

python: 3.8.10 pytest: 7.2.0 pytest-mock: 3.10.0 OS: Ubuntu 20.04.6 LTS

Details

The problem occurs with a test that runs a program to completion, which internally uses multiprocessing (via pool.starmap on a generator). When the test is run individually, it completes without problems within roughly 5 seconds. However, when the test is included in a full test suite and run with pytest, it hangs indefinitely (for over 6 minutes at least).

The issue persists even when using concurrent.futures in place of multiprocessing (without a partial as seen below).

My workaround is using mp.set_start_method("spawn", force=True) at the start of the test.

Fix in immediate context:

    func = partial(run_prediction, estimator=estimator, name=name, logdir=logdir)
    items = enumerate(ea_handler.yield_split())

    mp.set_start_method("spawn", force=True) # --> the fix, also works by being at the complete start of the test
    with mp.Pool(nb_workers) as pool:
        pool.starmap(func, items)

While the fix makes the test run slower (about 20 seconds), at least it doesn't hang indefinitely.

I'm still uncertain as to why this issue occurs and why the fix works, so I would greatly appreciate any insights!

Complete pip Environment

pip list output - click me

```text Package Version Location ----------------------------- ---------------- ------------------------------------------------------------------- absl-py 1.4.0 aiohttp 3.8.4 aiosignal 1.3.1 alabaster 0.7.13 alembic 1.11.1 anyio 3.7.0 argon2-cffi 21.3.0 argon2-cffi-bindings 21.2.0 astroid 2.15.5 asttokens 2.2.1 async-timeout 4.0.2 attrs 23.1.0 autopage 0.5.1 Babel 2.12.1 backcall 0.2.0 beautifulsoup4 4.12.2 black 23.1.0 bleach 6.0.0 cachetools 5.3.1 certifi 2023.5.7 cffi 1.15.1 cfgv 3.3.1 charset-normalizer 3.1.0 click 8.1.3 cliff 4.3.0 cloudpickle 2.2.1 cmaes 0.9.1 cmd2 2.4.3 colorama 0.4.6 colorlog 6.7.0 comet-ml 3.21.0 comm 0.1.3 commonmark 0.9.1 configobj 5.0.8 cycler 0.11.0 debugpy 1.6.7 decorator 5.1.1 defusedxml 0.7.1 dill 0.3.6 distlib 0.3.6 doc2dash 2.4.1 docutils 0.20.1 dulwich 0.21.5 EpiLaP 0.1 $HOME/Projects/sources/epi_ml/src/python everett 3.2.0 exceptiongroup 1.1.1 executing 1.2.0 fastjsonschema 2.17.1 filelock 3.12.0 fonttools 4.39.4 frozenlist 1.3.3 fsspec 2023.5.0 gitdb 4.0.10 GitPython 3.1.31 google-auth 2.19.0 google-auth-oauthlib 1.0.0 greenlet 2.0.2 grpcio 1.54.2 h5py 3.6.0 identify 2.5.24 idna 3.4 imagesize 1.4.1 imbalanced-learn 0.8.0 importlib-metadata 4.13.0 importlib-resources 5.12.0 iniconfig 2.0.0 ipykernel 6.23.1 ipython 8.12.2 ipython-genutils 0.2.0 ipywidgets 8.0.6 isort 5.12.0 jedi 0.18.2 Jinja2 3.1.2 joblib 1.2.0 jsonschema 4.17.3 jupyter 1.0.0 jupyter-client 8.2.0 jupyter-console 6.6.3 jupyter-core 5.3.0 jupyter-events 0.6.3 jupyter-server 2.6.0 jupyter-server-mathjax 0.2.6 jupyter-server-terminals 0.4.4 jupyterlab-pygments 0.2.2 jupyterlab-widgets 3.0.7 kaleido 0.2.1 kiwisolver 1.4.4 lazy-object-proxy 1.9.0 lightgbm 3.3.5 llvmlite 0.40.0 Mako 1.2.4 Markdown 3.4.3 MarkupSafe 2.1.2 matplotlib 3.5.3 matplotlib-inline 0.1.6 mccabe 0.7.0 mistune 2.0.5 multidict 6.0.4 mypy-extensions 1.0.0 nbclassic 0.5.3 nbclient 0.7.2 nbconvert 7.2.9 nbdime 3.1.0 nbformat 5.7.0 nest-asyncio 1.5.6 nodeenv 1.8.0 notebook 6.5.4 notebook-shim 0.2.3 numba 0.57.0 numpy 1.22.4 nvidia-ml-py3 7.352.0 oauthlib 3.2.2 optuna 3.0.6 overrides 7.3.1 packaging 23.1 pandas 1.4.4 pandocfilters 1.5.0 parso 0.8.3 pathspec 0.11.1 pbr 5.11.1 pexpect 4.8.0 pickleshare 0.7.5 Pillow 9.5.0 pip 20.0.2 pkg-resources 0.0.0 pkgutil-resolve-name 1.3.10 platformdirs 3.5.1 plotly 5.11.0 pluggy 1.0.0 pre-commit 3.0.4 prettytable 3.7.0 prometheus-client 0.17.0 prompt-toolkit 3.0.38 protobuf 3.20.1 psutil 5.9.5 ptyprocess 0.7.0 pure-eval 0.2.2 pyaml 23.5.9 pyasn1 0.5.0 pyasn1-modules 0.3.0 pycparser 2.21 pyDeprecate 0.3.2 Pygments 2.15.1 pylint 2.16.4 pyparsing 3.0.9 pyperclip 1.8.2 pyrsistent 0.19.3 pytest 7.2.0 pytest-mock 3.10.0 pytest-timeout 2.1.0 python-dateutil 2.8.2 python-json-logger 2.0.7 pytorch-lightning 1.6.5 pytz 2023.3 PyYAML 6.0 pyzmq 25.1.0 qtconsole 5.4.3 QtPy 2.3.1 requests 2.31.0 requests-oauthlib 1.3.1 requests-toolbelt 1.0.0 rfc3339-validator 0.1.4 rfc3986-validator 0.1.1 rich 11.2.0 rsa 4.9 scikit-learn 1.1.2 scikit-optimize 0.9.0+unofficial scipy 1.8.1 semantic-version 2.10.0 Send2Trash 1.8.2 setuptools 44.0.0 shap 0.41.0 six 1.16.0 slicer 0.0.7 smmap 5.0.0 sniffio 1.3.0 snowballstemmer 2.2.0 soupsieve 2.4.1 sphinx 7.0.1 sphinx-gallery 0.11.1 sphinxcontrib-applehelp 1.0.4 sphinxcontrib-devhelp 1.0.2 sphinxcontrib-htmlhelp 2.0.1 sphinxcontrib-jsmath 1.0.1 sphinxcontrib-qthelp 1.0.3 sphinxcontrib-serializinghtml 1.1.5 SQLAlchemy 2.0.15 stack-data 0.6.2 stevedore 5.1.0 tabulate 0.8.10 tenacity 8.2.2 tensorboard 2.13.0 tensorboard-data-server 0.7.0 terminado 0.17.1 threadpoolctl 3.1.0 tinycss2 1.2.1 toml 0.10.2 tomli 2.0.1 tomlkit 0.11.8 torch 1.10.2 torchinfo 1.6.3 torchmetrics 0.7.2 tornado 6.3.2 tqdm 4.65.0 traitlets 5.9.0 typing-extensions 4.4.0 urllib3 2.0.2 virtualenv 20.23.0 wcwidth 0.2.6 webencodings 0.5.1 websocket-client 1.5.2 Werkzeug 2.3.4 wheel 0.40.0 widgetsnbextension 4.0.7 wrapt 1.15.0 wurlitzer 3.0.3 yarl 1.9.2 zipp 3.15.0 ```

RonnyPfannschmidt commented 1 year ago

do the hangups also happen with pytest -s

rabyj commented 1 year ago

yup. Traceback is not very helpful, since it's just the timeout one, but I'll still share for reference

Traceback: Stack of Threads

```text +++++++++++++++++++++++++++++++++++++++++++++++ Timeout +++++++++++++++++++++++++++++++++++++++++++++++ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Stack of Thread-6 (139742417913600) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ File "/usr/lib/python3.8/threading.py", line 890, in _bootstrap self._bootstrap_inner() File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner self.run() File "/usr/lib/python3.8/threading.py", line 870, in run self._target(*self._args, **self._kwargs) File "/usr/lib/python3.8/multiprocessing/pool.py", line 576, in _handle_results task = get() File "/usr/lib/python3.8/multiprocessing/connection.py", line 250, in recv buf = self._recv_bytes() File "/usr/lib/python3.8/multiprocessing/connection.py", line 414, in _recv_bytes buf = self._recv(4) File "/usr/lib/python3.8/multiprocessing/connection.py", line 379, in _recv chunk = read(handle, remaining) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Stack of Thread-5 (139742283679488) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ File "/usr/lib/python3.8/threading.py", line 890, in _bootstrap self._bootstrap_inner() File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner self.run() File "/usr/lib/python3.8/threading.py", line 870, in run self._target(*self._args, **self._kwargs) File "/usr/lib/python3.8/multiprocessing/pool.py", line 528, in _handle_tasks for taskseq, set_length in iter(taskqueue.get, None): ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Stack of Thread-4 (139739615721216) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ File "/usr/lib/python3.8/threading.py", line 890, in _bootstrap self._bootstrap_inner() File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner self.run() File "/usr/lib/python3.8/threading.py", line 870, in run self._target(*self._args, **self._kwargs) File "/usr/lib/python3.8/multiprocessing/pool.py", line 519, in _handle_workers cls._wait_for_updates(current_sentinels, change_notifier) File "/usr/lib/python3.8/multiprocessing/pool.py", line 499, in _wait_for_updates wait(sentinels, timeout=timeout) File "/usr/lib/python3.8/multiprocessing/connection.py", line 931, in wait ready = selector.select(timeout) File "/usr/lib/python3.8/selectors.py", line 415, in select fd_event_list = self._selector.poll(timeout) +++++++++++++++++++++++++++++++++++++++++++++++ Timeout +++++++++++++++++++++++++++++++++++++++++++++++ ```

FAILURES

```text ============================================== FAILURES =============================================== _______________________________________ test_binary_classifier ________________________________________ test_dir = PosixPath('/tmp/pytest-of-rabyj/pytest-98/other_estimators1') @pytest.mark.timeout(120) def test_binary_classifier(test_dir: Path): """Test if binary models are properly supported.""" # Setting up environment variables # import multiprocessing as mp # pylint: disable=import-outside-toplevel # mp.set_start_method("spawn", force=True) os.environ["MIN_CLASS_SIZE"] = "1" os.environ["LABEL_LIST"] = '["female", "male"]' epi_ml.core.estimators.NFOLD_PREDICT = 2 # Loading paths datasource = EpiAtlasTreatmentTestData.default_test_data( label_category="sex" ).epiatlas_dataset.datasource current_dir = Path(__file__).parent.resolve() hparams_file = current_dir.parent / "fixtures" / "other_estimators_hparams.json" sys.argv = [ "other_estimators.py", "--predict", "sex", f"{datasource.hdf5_file}", f"{datasource.chromsize_file}", f"{datasource.metadata_file}", str(test_dir), "--hyperparams", f"{hparams_file}", ] > main_module() mains/other_estimators_test.py:80: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ ../epi_ml/other_estimators.py:226: in main estimators.run_predictions(ea_handler, estimator, model_name, cli.logdir) ../epi_ml/core/estimators.py:400: in run_predictions pool.starmap(func, items) /usr/lib/python3.8/multiprocessing/pool.py:372: in starmap return self._map_async(func, iterable, starmapstar, chunksize).get() /usr/lib/python3.8/multiprocessing/pool.py:765: in get self.wait(timeout) /usr/lib/python3.8/multiprocessing/pool.py:762: in wait self._event.wait(timeout) /usr/lib/python3.8/threading.py:558: in wait signaled = self._cond.wait(timeout) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ self = , 0)>, timeout = None def wait(self, timeout=None): """Wait until notified or until a timeout occurs. If the calling thread has not acquired the lock when this method is called, a RuntimeError is raised. This method releases the underlying lock, and then blocks until it is awakened by a notify() or notify_all() call for the same condition variable in another thread, or until the optional timeout occurs. Once awakened or timed out, it re-acquires the lock and returns. When the timeout argument is present and not None, it should be a floating point number specifying a timeout for the operation in seconds (or fractions thereof). When the underlying lock is an RLock, it is not released using its release() method, since this may not actually unlock the lock when it was acquired multiple times recursively. Instead, an internal interface of the RLock class is used, which really unlocks it even when it has been recursively acquired several times. Another internal interface is then used to restore the recursion level when the lock is reacquired. """ if not self._is_owned(): raise RuntimeError("cannot wait on un-acquired lock") waiter = _allocate_lock() waiter.acquire() self._waiters.append(waiter) saved_state = self._release_save() gotit = False try: # restore state no matter what (e.g., KeyboardInterrupt) if timeout is None: > waiter.acquire() E Failed: Timeout >120.0s /usr/lib/python3.8/threading.py:302: Failed ```

nicoddemus commented 1 year ago

I'm still uncertain as to why this issue occurs and why the fix works, so I would greatly appreciate any insights!

Not sure either, but perhaps using a different timeout method might yield different results?

rabyj commented 1 year ago

perhaps using a different timeout method might yield different results?

The timeout was to avoid the specific test hanging forever, so the test suite can finish. I actually installed a timeout module (pytest-timeout: 2.1.0) BECAUSE it was hanging forever and causing my test suite to not be able to finish.

With this in mind, I honestly don't understand how the timeout method would be related to the bug :/

nicoddemus commented 1 year ago

With this in mind, I honestly don't understand how the timeout method would be related to the bug :/

Out of insights then, sorry.

lorinkoz commented 11 months ago

Same is happening for me with multiprocessing.pool.ThreadPool, but the fix is not working in that case. I assume it's because "spawn" is not being called in this context. Any suggestions? Thanks!

rabyj commented 11 months ago

@lorinkoz Recently, I thought the problem had came back for me, but by using pytest -s I noticed something else entirely was happening. So do try to debug using that.

Otherwise, make sure the following code happens before any multiprocessing import, so make sure there's no way the multiprocessing module is imported before your individual test (e.g. imported through another import)

import multiprocessing as mp
mp.set_start_method("spawn", force=True)

Those are the only two things that come to mind, hope it helps.

peterdudfield commented 11 months ago

Thanks you this helped me

jfecunha commented 9 months ago

TL;DR: If you're experiencing a hanging issue during tests with a program using multiprocessing, try adding the following code at the beginning of your problematic test function:
def a_test():
    import multiprocessing as mp
    mp.set_start_method("spawn", force=True)
    # [... rest of test ...]
Description

I've encountered an odd test hanging issue with a program that uses multiprocessing. The program completes:

When run individually from the command line

As part of a test function invoked alone via pytest

The program hangs indefinitely when run as part of a full suite of tests.

I wanted to at least document what I experienced even if I can't create a simple minimal example, since I found a workaround that could help others, and no other issue mentioning something like this. All issues are for single tests.

Environment

python: 3.8.10 pytest: 7.2.0 pytest-mock: 3.10.0 OS: Ubuntu 20.04.6 LTS

Details

The problem occurs with a test that runs a program to completion, which internally uses multiprocessing (via pool.starmap on a generator). When the test is run individually, it completes without problems within roughly 5 seconds. However, when the test is included in a full test suite and run with pytest, it hangs indefinitely (for over 6 minutes at least).

The issue persists even when using concurrent.futures in place of multiprocessing (without a partial as seen below).

My workaround is using mp.set_start_method("spawn", force=True) at the start of the test.

Fix in immediate context:
    func = partial(run_prediction, estimator=estimator, name=name, logdir=logdir)
    items = enumerate(ea_handler.yield_split())

    mp.set_start_method("spawn", force=True) # --> the fix, also works by being at the complete start of the test
    with mp.Pool(nb_workers) as pool:
        pool.starmap(func, items)
While the fix makes the test run slower (about 20 seconds), at least it doesn't hang indefinitely.

I'm still uncertain as to why this issue occurs and why the fix works, so I would greatly appreciate any insights!

Complete pip Environment

pip list output - click me

@rabyj thanks for sharing this. I had a similar problem using the pytest-console-scripts library to test some scripts. In my case, it was happening on the GitlabCI, the test was stuck until the job reached the timeout. Using the multiprocessing spawn allowed the test to work without issues.

marvin-jens commented 8 months ago

EDIT: I found that if I make sure to explicitly close all files that are opened in a sub-process, I no longer need to set mp.set_start_method("spawn", force=True.

I ran into this problem as well. However, mp.set_start_method("spawn", force=True) causes pickle errors in my code ("object is not the same..."), which do not happen with the defaults. Are there alternative ways to get around the hanging sub-process issue that avoid using spawn ?

nicoddemus commented 8 months ago

Closing this for now then.

For the record: pytest just runs each test function in sequence, if the test function is blocking somehow, probably the code being tested needs some adjustment -- my point is that pytest does not have a "known issue" with multiprocessing at all, it all depends on the user code running inside the tests.

If this issue comes up again, it is probably best to work on a MWE so a proper solution can be discussed.

pierrebarbera commented 2 months ago

for future visitors: I ran into this issue and the spawn vs fork change did not resolve it, but updating freezegun did!

rabyj commented 2 months ago

@pierrebarbera freezegun was not part of my modules, seems like a package specific problem.

Then again, maybe MY problem was package specific but I never found out the culprit

reactive-firewall commented 3 weeks ago

I'm still uncertain as to why this issue occurs and why the fix works, so I would greatly appreciate any insights!

Then again, maybe MY problem was package specific but I never found out the culprit

TL;DR

I ran into a similar issue, but to add complication, I had both a repeatable hanging example and a repeatable non-issue example.

Insights for Future visitors

I'd need to explain a bunch of off-topic stuff here about my case, so, instead, I'll just keep this to some insights I found while debugging that are relevant here:

insights on the freezegun update to fix

freezegun now indeed avoids a possible hang with pytests by now ignoring timeing calls from pytests runner and terminal reference for frezzegun and pytest
- regarding why: this issue is a good example along with this fix
- neither spawn nor fork would fix a mocked call from join to check duration, not while frezzegun is altering time anyway ... how do you timeout a value that is frozen?

Insights on the use `spawn` instead of `fork` to fix

The multiprocessing docs actually explain this one in part and the use of spawn strategy plays safer with threading, fork strategy does not (hint: remember the python GIL is interpreter scope when reading the docs, and it becomes clear if you use fork you must solve all concurrency problems in all of your code and all relevant dependencies, to avoid deadlocking)
so why wouldn't everyone use spawn? well, as @pierrebarbera observed: spawn is slower, you a basically running Python for each Process and the extra overhead does not likely (not being familiar with the exact implementation) does not scale up well for problems, but it might still scale wide.
while I can't tell you the culprit, I'd start with looking at the number of python instances (especially when using fork strategy) running when the test hangs. :shrug:

Insights from my own usage of multiprocessing and pytests

multiprocessing can be used without setting spawn as the start method with full test suites via pytests (albeit I only tested python3.9.20-python3.13.0)
if you omit the multiprocessing.Process instance's join()'s timeout value, it can hang without a context switch, (so only one process completes, and the other never starts) which means you should solve the race condition in your test code and include a timeout for join() if speed is preferred over low-energy approach like sentinels and pools.
I can confirm there is NO issue with multiprocessing (as of 3.9.20) and pytests (as of 8.3.3), outside normal concurrency.

:thinking: Hopefully this provides some of the insights you were looking for.

[!WARNING] This is comment, is only meant to provide some insights for debugging similar issues, rather than leaving the readers in the dark. This is not intended as an ultimate root cause. ymmv. :shrug:

pytest-dev / pytest

Test using multiprocessing only hangs when ran with test suite, works alone #11174

TL;DR

Insights for Future visitors

insights on the freezegun update to fix

Insights on the use `spawn` instead of `fork` to fix

Insights from my own usage of multiprocessing and pytests

pytest-dev / pytest

Test using multiprocessing only hangs when ran with test suite, works alone #11174

TL;DR

Insights for Future visitors

insights on the freezegun update to fix

Insights on the use spawn instead of fork to fix

Insights from my own usage of multiprocessing and pytests

Insights on the use `spawn` instead of `fork` to fix