pytest-dev / pytest-xdist

pytest plugin for distributed testing and loop-on-failures testing modes.
https://pytest-xdist.readthedocs.io
MIT License
1.49k stars 232 forks source link

Issue with pytest-xdist Handling Out of Memory Errors(IndexError) #1155

Open loveleenamar9 opened 1 week ago

loveleenamar9 commented 1 week ago

Hi, I am currently utilizing pytest-xdist to execute a test suite that includes subgraph tests. Sporadically, I encounter an IndexError when attempting to load a large model, which results in the process being terminated due to an Out of Memory (OOM) issue. While pytest-xdist gracefully handles other crashes, it appears to struggle with those caused by OOM errors. The worker crash is expected but the crashed worker is not getting replaced properly in this case leading to IndexError.

Below is an example of the error log:

2024-10-27T21:28:18Z  tensorflow    [gw13] [ 70%] FAILED layerwise/Mistral7b/test_model_layers_0.py::test_model_layers_0 
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow    replacing crashed worker gw13
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow    INTERNALERROR> def worker_internal_error(
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow    INTERNALERROR>         self, node: WorkerController, formatted_error: str
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow    INTERNALERROR>     ) -> None:
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow    INTERNALERROR>         """
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow    INTERNALERROR>         pytest_internalerror() was called on the worker.
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow    INTERNALERROR>     
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow    INTERNALERROR>         pytest_internalerror() arguments are an excinfo and an excrepr, which can't
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow    INTERNALERROR>         be serialized, so we go with a poor man's solution of raising an exception
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow    INTERNALERROR>         here ourselves using the formatted message.
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow    INTERNALERROR>         """
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow    INTERNALERROR>         self._active_nodes.remove(node)
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow    INTERNALERROR>         try:
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow    INTERNALERROR> >           assert False, formatted_error
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow    INTERNALERROR> E           AssertionError: Traceback (most recent call last):
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow    INTERNALERROR> E               File "/usr/local/lib/python3.10/dist-packages/_pytest/main.py", line 271, in wrap_session
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow    INTERNALERROR> E                 session.exitstatus = doit(config, session) or 0
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow    INTERNALERROR> E               File "/usr/local/lib/python3.10/dist-packages/_pytest/main.py", line 325, in _main
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow    INTERNALERROR> E                 config.hook.pytest_runtestloop(session=session)
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow    INTERNALERROR> E               File "/usr/local/lib/python3.10/dist-packages/pluggy/_hooks.py", line 513, in __call__
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow    INTERNALERROR> E                 return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow    INTERNALERROR> E               File "/usr/local/lib/python3.10/dist-packages/pluggy/_manager.py", line 120, in _hookexec
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow    INTERNALERROR> E                 return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow    INTERNALERROR> E               File "/usr/local/lib/python3.10/dist-packages/pluggy/_callers.py", line 182, in _multicall
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow    INTERNALERROR> E                 return outcome.get_result()
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow    INTERNALERROR> E               File "/usr/local/lib/python3.10/dist-packages/pluggy/_result.py", line 100, in get_result
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow    INTERNALERROR> E                 raise exc.with_traceback(exc.__traceback__)
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow    INTERNALERROR> E               File "/usr/local/lib/python3.10/dist-packages/pluggy/_callers.py", line 103, in _multicall
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow    INTERNALERROR> E                 res = hook_impl.function(*args)
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow    INTERNALERROR> E               File "/root/.local/lib/python3.10/site-packages/xdist/remote.py", line 174, in pytest_runtestloop
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow    INTERNALERROR> E                 self.run_one_test()
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow    INTERNALERROR> E               File "/root/.local/lib/python3.10/site-packages/xdist/remote.py", line 185, in run_one_test
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow    INTERNALERROR> E                 item = items[self.item_index]
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow    INTERNALERROR> E             IndexError: list index out of range
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow    INTERNALERROR> E           assert False
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow    INTERNALERROR> 
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow    INTERNALERROR> /root/.local/lib/python3.10/site-packages/xdist/dsession.py:232: AssertionError
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow    INTERNALERROR> Traceback (most recent call last):
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow    INTERNALERROR>   File "/root/.local/lib/python3.10/site-packages/_pytest/main.py", line 273, in wrap_session
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow    INTERNALERROR>   File "/root/.local/lib/python3.10/site-packages/_pytest/main.py", line 327, in _main
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow    INTERNALERROR>   File "/usr/local/lib/python3.10/dist-packages/pluggy/_hooks.py", line 513, in __call__
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow    INTERNALERROR>     return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow    INTERNALERROR>   File "/usr/local/lib/python3.10/dist-packages/pluggy/_manager.py", line 120, in _hookexec
[2024-10-27T21:28:26.325Z] 
2024-10-27T21:28:18Z  tensorflow    INTERNALERROR>     return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
[2024-10-27T21:28:26.325Z] 
2024-10-27T21:28:18Z  tensorflow    INTERNALERROR>   File "/usr/local/lib/python3.10/dist-packages/pluggy/_callers.py", line 139, in _multicall
[2024-10-27T21:28:26.325Z] 
2024-10-27T21:28:18Z  tensorflow    INTERNALERROR>     raise exception.with_traceback(exception.__traceback__)
[2024-10-27T21:28:26.325Z] 
2024-10-27T21:28:18Z  tensorflow    INTERNALERROR>   File "/usr/local/lib/python3.10/dist-packages/pluggy/_callers.py", line 122, in _multicall
[2024-10-27T21:28:26.325Z] 
2024-10-27T21:28:18Z  tensorflow    INTERNALERROR>     teardown.throw(exception)  # type: ignore[union-attr]
[2024-10-27T21:28:26.325Z] 
2024-10-27T21:28:18Z  tensorflow    INTERNALERROR>   File "/root/.local/lib/python3.10/site-packages/_pytest/logging.py", line 796, in pytest_runtestloop
[2024-10-27T21:28:26.325Z] 
2024-10-27T21:28:18Z  tensorflow    INTERNALERROR>   File "/usr/local/lib/python3.10/dist-packages/pluggy/_callers.py", line 103, in _multicall
[2024-10-27T21:28:26.325Z] 
2024-10-27T21:28:18Z  tensorflow    INTERNALERROR>     res = hook_impl.function(*args)
[2024-10-27T21:28:26.325Z] 
2024-10-27T21:28:18Z  tensorflow    INTERNALERROR>   File "/root/.local/lib/python3.10/site-packages/xdist/dsession.py", line 138, in pytest_runtestloop
[2024-10-27T21:28:26.325Z] 
2024-10-27T21:28:18Z  tensorflow    INTERNALERROR>     self.loop_once()
[2024-10-27T21:28:26.325Z] 
2024-10-27T21:28:18Z  tensorflow    INTERNALERROR>   File "/root/.local/lib/python3.10/site-packages/xdist/dsession.py", line 152, in loop_once
[2024-10-27T21:28:26.325Z] 
2024-10-27T21:28:18Z  tensorflow    INTERNALERROR>     raise RuntimeError("Unexpectedly no active workers available")
[2024-10-27T21:28:26.325Z] 
2024-10-27T21:28:18Z  tensorflow    INTERNALERROR> RuntimeError: Unexpectedly no active workers available

The issue can be reproduced by creating a dummy test that allocates a large amount of memory:

PYTHON

def test_oom():
    large_memory_allocation = []
    for _ in range(175):
        large_memory_allocation.append([0] * (1024**3 // 4))

I suspect that the synchronization between the worker and the master process is not occurring correctly, leading to incomplete communication.

Note: This issue is observed only with a large test suite.

Could you please provide support on what's causing this IndexError and how to resolve this, so that pytest-xdist can handle OOM errors gracefully?

Thanks! Loveleen.

RonnyPfannschmidt commented 1 week ago

this looks indeed like a missed case in worker restart

its possibly related to oom preventing messages due to the hard kill

most normal worker restarts get some kind of message