gustaveroussy / sopa

Technology-invariant pipeline for spatial omics analysis (Xenium / Visium HD / MERSCOPE / CosMx / PhenoCycler / MACSima / ...) that scales to millions of cells
https://gustaveroussy.github.io/sopa/
BSD 3-Clause "New" or "Revised" License
129 stars 15 forks source link

[Bug] Issue with dask on branch sopa2 #129

Open lguerard opened 4 weeks ago

lguerard commented 4 weeks ago

Description

When running the Cellpose segmentation using the dask backend, cell crashes after a while.

Multiple workers showed the error exceeded 95% memory budget. Restarting...". Then after a while it says that a task will bemarked as failed because 4 workers died while trying to run it`.

Then it completely crashes with these errors :

2024-09-25 16:02:04,452 - distributed.scheduler - WARNING - Removing worker 'tcp://127.0.0.1:58018' caused the cluster to lose already computed task(s), which will be recomputed elsewhere: {('array-0173291e659995cef21d3d1e6515a34d', 0)} (stimulus_id='handle-worker-cleanup-1727272924.4456077')
2024-09-25 16:02:04,458 - distributed.scheduler - WARNING - Removing worker 'tcp://127.0.0.1:57843' caused the cluster to lose already computed task(s), which will be recomputed elsewhere: {'shuffle-taker-1981a4f154033ba88983f1452daf58f3', ('block-info-_map_read_frame-b518e369790450b6bf2ef0f396523719', 0, 0, 0)} (stimulus_id='handle-worker-cleanup-1727272924.4513524')
2024-09-25 16:02:07,609 - distributed.nanny - WARNING - Worker process still alive after 4.0 seconds, killing
2024-09-25 16:02:07,610 - distributed.nanny - WARNING - Worker process still alive after 4.0 seconds, killing
2024-09-25 16:02:07,612 - distributed.nanny - WARNING - Worker process still alive after 4.0 seconds, killing
2024-09-25 16:02:07,614 - distributed.nanny - WARNING - Worker process still alive after 4.0 seconds, killing
2024-09-25 16:02:07,615 - distributed.nanny - WARNING - Worker process still alive after 4.0 seconds, killing
2024-09-25 16:02:08,612 - distributed.client - ERROR - 
Traceback (most recent call last):
  File "S:\anaconda_envs\sopa\lib\asyncio\tasks.py", line 456, in wait_for
    return fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "S:\anaconda_envs\sopa\lib\site-packages\distributed\utils.py", line 806, in wrapper
    return await func(*args, **kwargs)
  File "S:\anaconda_envs\sopa\lib\site-packages\distributed\client.py", line 1938, in _close
    await self.cluster.close()
  File "S:\anaconda_envs\sopa\lib\site-packages\distributed\deploy\spec.py", line 448, in _close
    await self._correct_state()
  File "S:\anaconda_envs\sopa\lib\site-packages\distributed\deploy\spec.py", line 359, in _correct_state_internal
    await asyncio.gather(*tasks)
  File "S:\anaconda_envs\sopa\lib\site-packages\distributed\nanny.py", line 619, in close
    await self.kill(timeout=timeout, reason=reason)
  File "S:\anaconda_envs\sopa\lib\site-packages\distributed\nanny.py", line 400, in kill
    await self.process.kill(reason=reason, timeout=timeout)
  File "S:\anaconda_envs\sopa\lib\site-packages\distributed\nanny.py", line 882, in kill
    await process.join(max(0, deadline - time()))
  File "S:\anaconda_envs\sopa\lib\site-packages\distributed\process.py", line 330, in join
    await wait_for(asyncio.shield(self._exit_future), timeout)
  File "S:\anaconda_envs\sopa\lib\site-packages\distributed\utils.py", line 1926, in wait_for
    return await asyncio.wait_for(fut, timeout)
  File "S:\anaconda_envs\sopa\lib\asyncio\tasks.py", line 458, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError
2024-09-25 16:02:08,614 - tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOMainLoop object at 0x000001478C780190>>, <Task finished name='Task-63060' coro=<SpecCluster._correct_state_internal() done, defined at S:\anaconda_envs\sopa\lib\site-packages\distributed\deploy\spec.py:346> exception=TimeoutError()>)
Traceback (most recent call last):
  File "S:\anaconda_envs\sopa\lib\asyncio\tasks.py", line 456, in wait_for
    return fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "S:\anaconda_envs\sopa\lib\site-packages\tornado\ioloop.py", line 750, in _run_callback
    ret = callback()
  File "S:\anaconda_envs\sopa\lib\site-packages\tornado\ioloop.py", line 774, in _discard_future_result
    future.result()
asyncio.exceptions.TimeoutError
Future exception was never retrieved
future: <Future finished exception=PermissionError(13, 'Access is denied', None, 5, None)>
Traceback (most recent call last):
  File "S:\anaconda_envs\sopa\lib\site-packages\distributed\process.py", line 55, in _call_and_set_future
    res = func(*args, **kwargs)
  File "S:\anaconda_envs\sopa\lib\multiprocessing\process.py", line 140, in kill
    self._popen.kill()
  File "S:\anaconda_envs\sopa\lib\multiprocessing\popen_spawn_win32.py", line 123, in terminate
    _winapi.TerminateProcess(int(self._handle), TERMINATE)
PermissionError: [WinError 5] Access is denied
Future exception was never retrieved
future: <Future finished exception=PermissionError(13, 'Access is denied', None, 5, None)>
Traceback (most recent call last):
  File "S:\anaconda_envs\sopa\lib\site-packages\distributed\process.py", line 55, in _call_and_set_future
    res = func(*args, **kwargs)
  File "S:\anaconda_envs\sopa\lib\multiprocessing\process.py", line 140, in kill
    self._popen.kill()
  File "S:\anaconda_envs\sopa\lib\multiprocessing\popen_spawn_win32.py", line 123, in terminate
    _winapi.TerminateProcess(int(self._handle), TERMINATE)
PermissionError: [WinError 5] Access is denied
Future exception was never retrieved
future: <Future finished exception=PermissionError(13, 'Access is denied', None, 5, None)>
Traceback (most recent call last):
  File "S:\anaconda_envs\sopa\lib\site-packages\distributed\process.py", line 55, in _call_and_set_future
    res = func(*args, **kwargs)
  File "S:\anaconda_envs\sopa\lib\multiprocessing\process.py", line 140, in kill
    self._popen.kill()
  File "S:\anaconda_envs\sopa\lib\multiprocessing\popen_spawn_win32.py", line 123, in terminate
    _winapi.TerminateProcess(int(self._handle), TERMINATE)
PermissionError: [WinError 5] Access is denied

Expected behavior

Cellpose patches created and processed

System

quentinblampey commented 4 weeks ago

Thanks @lguerard for detailing the issue. How much RAM and how many CPU cores do you have?

lguerard commented 4 weeks ago

I updated the post with the RAM amount.

As for the CPU, we're actually having a virtualized environment that shares resources between different VM. But each VM should have somewhere between 48 and 64 cores.

quentinblampey commented 4 weeks ago

Alright, thanks for the details. I'm still experimenting with the dask Client, so I'll try to improve it over time to have a stable release in sopa 2.0.0