Hi,
I am currently utilizing pytest-xdist to execute a test suite that includes subgraph tests. Sporadically, I encounter an IndexError when attempting to load a large model, which results in the process being terminated due to an Out of Memory (OOM) issue. While pytest-xdist gracefully handles other crashes, it appears to struggle with those caused by OOM errors. The worker crash is expected but the crashed worker is not getting replaced properly in this case leading to IndexError.
Below is an example of the error log:
2024-10-27T21:28:18Z tensorflow [gw13] [ 70%] FAILED layerwise/Mistral7b/test_model_layers_0.py::test_model_layers_0
[2024-10-27T21:28:26.324Z]
2024-10-27T21:28:18Z tensorflow replacing crashed worker gw13
[2024-10-27T21:28:26.324Z]
2024-10-27T21:28:18Z tensorflow INTERNALERROR> def worker_internal_error(
[2024-10-27T21:28:26.324Z]
2024-10-27T21:28:18Z tensorflow INTERNALERROR> self, node: WorkerController, formatted_error: str
[2024-10-27T21:28:26.324Z]
2024-10-27T21:28:18Z tensorflow INTERNALERROR> ) -> None:
[2024-10-27T21:28:26.324Z]
2024-10-27T21:28:18Z tensorflow INTERNALERROR> """
[2024-10-27T21:28:26.324Z]
2024-10-27T21:28:18Z tensorflow INTERNALERROR> pytest_internalerror() was called on the worker.
[2024-10-27T21:28:26.324Z]
2024-10-27T21:28:18Z tensorflow INTERNALERROR>
[2024-10-27T21:28:26.324Z]
2024-10-27T21:28:18Z tensorflow INTERNALERROR> pytest_internalerror() arguments are an excinfo and an excrepr, which can't
[2024-10-27T21:28:26.324Z]
2024-10-27T21:28:18Z tensorflow INTERNALERROR> be serialized, so we go with a poor man's solution of raising an exception
[2024-10-27T21:28:26.324Z]
2024-10-27T21:28:18Z tensorflow INTERNALERROR> here ourselves using the formatted message.
[2024-10-27T21:28:26.324Z]
2024-10-27T21:28:18Z tensorflow INTERNALERROR> """
[2024-10-27T21:28:26.324Z]
2024-10-27T21:28:18Z tensorflow INTERNALERROR> self._active_nodes.remove(node)
[2024-10-27T21:28:26.324Z]
2024-10-27T21:28:18Z tensorflow INTERNALERROR> try:
[2024-10-27T21:28:26.324Z]
2024-10-27T21:28:18Z tensorflow INTERNALERROR> > assert False, formatted_error
[2024-10-27T21:28:26.324Z]
2024-10-27T21:28:18Z tensorflow INTERNALERROR> E AssertionError: Traceback (most recent call last):
[2024-10-27T21:28:26.324Z]
2024-10-27T21:28:18Z tensorflow INTERNALERROR> E File "/usr/local/lib/python3.10/dist-packages/_pytest/main.py", line 271, in wrap_session
[2024-10-27T21:28:26.324Z]
2024-10-27T21:28:18Z tensorflow INTERNALERROR> E session.exitstatus = doit(config, session) or 0
[2024-10-27T21:28:26.324Z]
2024-10-27T21:28:18Z tensorflow INTERNALERROR> E File "/usr/local/lib/python3.10/dist-packages/_pytest/main.py", line 325, in _main
[2024-10-27T21:28:26.324Z]
2024-10-27T21:28:18Z tensorflow INTERNALERROR> E config.hook.pytest_runtestloop(session=session)
[2024-10-27T21:28:26.324Z]
2024-10-27T21:28:18Z tensorflow INTERNALERROR> E File "/usr/local/lib/python3.10/dist-packages/pluggy/_hooks.py", line 513, in __call__
[2024-10-27T21:28:26.324Z]
2024-10-27T21:28:18Z tensorflow INTERNALERROR> E return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)
[2024-10-27T21:28:26.324Z]
2024-10-27T21:28:18Z tensorflow INTERNALERROR> E File "/usr/local/lib/python3.10/dist-packages/pluggy/_manager.py", line 120, in _hookexec
[2024-10-27T21:28:26.324Z]
2024-10-27T21:28:18Z tensorflow INTERNALERROR> E return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
[2024-10-27T21:28:26.324Z]
2024-10-27T21:28:18Z tensorflow INTERNALERROR> E File "/usr/local/lib/python3.10/dist-packages/pluggy/_callers.py", line 182, in _multicall
[2024-10-27T21:28:26.324Z]
2024-10-27T21:28:18Z tensorflow INTERNALERROR> E return outcome.get_result()
[2024-10-27T21:28:26.324Z]
2024-10-27T21:28:18Z tensorflow INTERNALERROR> E File "/usr/local/lib/python3.10/dist-packages/pluggy/_result.py", line 100, in get_result
[2024-10-27T21:28:26.324Z]
2024-10-27T21:28:18Z tensorflow INTERNALERROR> E raise exc.with_traceback(exc.__traceback__)
[2024-10-27T21:28:26.324Z]
2024-10-27T21:28:18Z tensorflow INTERNALERROR> E File "/usr/local/lib/python3.10/dist-packages/pluggy/_callers.py", line 103, in _multicall
[2024-10-27T21:28:26.324Z]
2024-10-27T21:28:18Z tensorflow INTERNALERROR> E res = hook_impl.function(*args)
[2024-10-27T21:28:26.324Z]
2024-10-27T21:28:18Z tensorflow INTERNALERROR> E File "/root/.local/lib/python3.10/site-packages/xdist/remote.py", line 174, in pytest_runtestloop
[2024-10-27T21:28:26.324Z]
2024-10-27T21:28:18Z tensorflow INTERNALERROR> E self.run_one_test()
[2024-10-27T21:28:26.324Z]
2024-10-27T21:28:18Z tensorflow INTERNALERROR> E File "/root/.local/lib/python3.10/site-packages/xdist/remote.py", line 185, in run_one_test
[2024-10-27T21:28:26.324Z]
2024-10-27T21:28:18Z tensorflow INTERNALERROR> E item = items[self.item_index]
[2024-10-27T21:28:26.324Z]
2024-10-27T21:28:18Z tensorflow INTERNALERROR> E IndexError: list index out of range
[2024-10-27T21:28:26.324Z]
2024-10-27T21:28:18Z tensorflow INTERNALERROR> E assert False
[2024-10-27T21:28:26.324Z]
2024-10-27T21:28:18Z tensorflow INTERNALERROR>
[2024-10-27T21:28:26.324Z]
2024-10-27T21:28:18Z tensorflow INTERNALERROR> /root/.local/lib/python3.10/site-packages/xdist/dsession.py:232: AssertionError
[2024-10-27T21:28:26.324Z]
2024-10-27T21:28:18Z tensorflow INTERNALERROR> Traceback (most recent call last):
[2024-10-27T21:28:26.324Z]
2024-10-27T21:28:18Z tensorflow INTERNALERROR> File "/root/.local/lib/python3.10/site-packages/_pytest/main.py", line 273, in wrap_session
[2024-10-27T21:28:26.324Z]
2024-10-27T21:28:18Z tensorflow INTERNALERROR> File "/root/.local/lib/python3.10/site-packages/_pytest/main.py", line 327, in _main
[2024-10-27T21:28:26.324Z]
2024-10-27T21:28:18Z tensorflow INTERNALERROR> File "/usr/local/lib/python3.10/dist-packages/pluggy/_hooks.py", line 513, in __call__
[2024-10-27T21:28:26.324Z]
2024-10-27T21:28:18Z tensorflow INTERNALERROR> return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)
[2024-10-27T21:28:26.324Z]
2024-10-27T21:28:18Z tensorflow INTERNALERROR> File "/usr/local/lib/python3.10/dist-packages/pluggy/_manager.py", line 120, in _hookexec
[2024-10-27T21:28:26.325Z]
2024-10-27T21:28:18Z tensorflow INTERNALERROR> return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
[2024-10-27T21:28:26.325Z]
2024-10-27T21:28:18Z tensorflow INTERNALERROR> File "/usr/local/lib/python3.10/dist-packages/pluggy/_callers.py", line 139, in _multicall
[2024-10-27T21:28:26.325Z]
2024-10-27T21:28:18Z tensorflow INTERNALERROR> raise exception.with_traceback(exception.__traceback__)
[2024-10-27T21:28:26.325Z]
2024-10-27T21:28:18Z tensorflow INTERNALERROR> File "/usr/local/lib/python3.10/dist-packages/pluggy/_callers.py", line 122, in _multicall
[2024-10-27T21:28:26.325Z]
2024-10-27T21:28:18Z tensorflow INTERNALERROR> teardown.throw(exception) # type: ignore[union-attr]
[2024-10-27T21:28:26.325Z]
2024-10-27T21:28:18Z tensorflow INTERNALERROR> File "/root/.local/lib/python3.10/site-packages/_pytest/logging.py", line 796, in pytest_runtestloop
[2024-10-27T21:28:26.325Z]
2024-10-27T21:28:18Z tensorflow INTERNALERROR> File "/usr/local/lib/python3.10/dist-packages/pluggy/_callers.py", line 103, in _multicall
[2024-10-27T21:28:26.325Z]
2024-10-27T21:28:18Z tensorflow INTERNALERROR> res = hook_impl.function(*args)
[2024-10-27T21:28:26.325Z]
2024-10-27T21:28:18Z tensorflow INTERNALERROR> File "/root/.local/lib/python3.10/site-packages/xdist/dsession.py", line 138, in pytest_runtestloop
[2024-10-27T21:28:26.325Z]
2024-10-27T21:28:18Z tensorflow INTERNALERROR> self.loop_once()
[2024-10-27T21:28:26.325Z]
2024-10-27T21:28:18Z tensorflow INTERNALERROR> File "/root/.local/lib/python3.10/site-packages/xdist/dsession.py", line 152, in loop_once
[2024-10-27T21:28:26.325Z]
2024-10-27T21:28:18Z tensorflow INTERNALERROR> raise RuntimeError("Unexpectedly no active workers available")
[2024-10-27T21:28:26.325Z]
2024-10-27T21:28:18Z tensorflow INTERNALERROR> RuntimeError: Unexpectedly no active workers available
The issue can be reproduced by creating a dummy test that allocates a large amount of memory:
PYTHON
def test_oom():
large_memory_allocation = []
for _ in range(175):
large_memory_allocation.append([0] * (1024**3 // 4))
I suspect that the synchronization between the worker and the master process is not occurring correctly, leading to incomplete communication.
Note: This issue is observed only with a large test suite.
Could you please provide support on what's causing this IndexError and how to resolve this, so that pytest-xdist can handle OOM errors gracefully?
Hi, I am currently utilizing pytest-xdist to execute a test suite that includes subgraph tests. Sporadically, I encounter an IndexError when attempting to load a large model, which results in the process being terminated due to an Out of Memory (OOM) issue. While pytest-xdist gracefully handles other crashes, it appears to struggle with those caused by OOM errors. The worker crash is expected but the crashed worker is not getting replaced properly in this case leading to IndexError.
Below is an example of the error log:
The issue can be reproduced by creating a dummy test that allocates a large amount of memory:
PYTHON
I suspect that the synchronization between the worker and the master process is not occurring correctly, leading to incomplete communication.
Note: This issue is observed only with a large test suite.
Could you please provide support on what's causing this IndexError and how to resolve this, so that pytest-xdist can handle OOM errors gracefully?
Thanks! Loveleen.