theislab / cellrank

CellRank: dynamics from multi-view single-cell data
https://cellrank.org
BSD 3-Clause "New" or "Revised" License
347 stars 46 forks source link

PETSc fails when computing absorption probabilities #473

Closed Marius1311 closed 3 years ago

Marius1311 commented 3 years ago

When computing absorption probabilities on the lung using g_fwd.compute_absorption_probabilities(use_petsc=True, solver='gmres', n_jobs=8), my kernel dies and I get the following error message in the terminal:

[0]PETSC ERROR: ------------------------------------------------------------------------
[0]PETSC ERROR: Caught signal number 13 Broken Pipe: Likely while reading or writing to a socket
[0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[0]PETSC ERROR: or see https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
[0]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
[0]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run 
[0]PETSC ERROR: to get more information on the crash.
application called MPI_Abort(MPI_COMM_WORLD, 50162059) - process 0
[unset]: write_line error; fd=-1 buf=:cmd=abort exitcode=50162059
:

Versions:

cellrank==1.1.0+gb36eac8 scanpy==1.6.0 anndata==0.7.4 numpy==1.19.5 numba==0.52.0 scipy==1.5.3 pandas==1.1.3 scikit-learn==0.23.2 statsmodels==0.12.0 python-igraph==0.8.3 scvelo==0.2.2 pygam==0.8.0 matplotlib==3.2.2 seaborn==0.11.0

...

Update 1: Running just g_fwd.compute_absorption_probabilities(n_jobs=8) works fine.

Update 2: Using just a single core, i.e. g_fwd.compute_absorption_probabilities(use_petsc=True, solver='gmres', n_jobs=1) also works fine.

michalk8 commented 3 years ago

Which multiprocessing backed are you using (e.g. default is loky). Could try with backend='threading'. Could you also try with e.g. n_jobs=2?

michalk8 commented 3 years ago

Also, do you have longer error log? The one above does not really help.

Marius1311 commented 3 years ago

Updates:

6 jobs gives

---------------------------------------------------------------------------
TerminatedWorkerError                     Traceback (most recent call last)
<ipython-input-31-fb9aa4c52d9e> in <module>
----> 1 g_fwd.compute_absorption_probabilities(use_petsc=True, solver='gmres', n_jobs=6)

~/Projects/cellrank/cellrank/tl/estimators/_base_estimator.py in compute_absorption_probabilities(self, keys, check_irred, solver, use_petsc, time_to_absorption, n_jobs, backend, show_progress_bar, tol, preconditioner)
    479 
    480         # solve the linear system of equations
--> 481         mat_x = _solve_lin_system(
    482             q,
    483             s,

~/Projects/cellrank/cellrank/tl/_linear_solver.py in _solve_lin_system(mat_a, mat_b, solver, use_petsc, preconditioner, n_jobs, backend, tol, use_eye, show_progress_bar)
    463 
    464         # can't pass PETSc matrix - not pickleable
--> 465         mat_x, n_converged = parallelize(
    466             _solve_many_sparse_problems_petsc,
    467             mat_b,

~/Projects/cellrank/cellrank/ul/_parallelize.py in wrapper(*args, **kwargs)
     99             pbar, queue, thread = None, None, None
    100 
--> 101         res = jl.Parallel(n_jobs=n_jobs, backend=backend)(
    102             jl.delayed(callback)(
    103                 *((i, cs) if use_ixs else (cs,)),

~/miniconda3/envs/cellrank_revision/lib/python3.8/site-packages/joblib/parallel.py in __call__(self, iterable)
   1052 
   1053             with self._backend.retrieval_context():
-> 1054                 self.retrieve()
   1055             # Make sure that we get a last message telling us we are done
   1056             elapsed_time = time.time() - self._start_time

~/miniconda3/envs/cellrank_revision/lib/python3.8/site-packages/joblib/parallel.py in retrieve(self)
    931             try:
    932                 if getattr(self._backend, 'supports_timeout', False):
--> 933                     self._output.extend(job.get(timeout=self.timeout))
    934                 else:
    935                     self._output.extend(job.get())

~/miniconda3/envs/cellrank_revision/lib/python3.8/site-packages/joblib/_parallel_backends.py in wrap_future_result(future, timeout)
    540         AsyncResults.get from multiprocessing."""
    541         try:
--> 542             return future.result(timeout=timeout)
    543         except CfTimeoutError as e:
    544             raise TimeoutError from e

~/miniconda3/envs/cellrank_revision/lib/python3.8/concurrent/futures/_base.py in result(self, timeout)
    437                 raise CancelledError()
    438             elif self._state == FINISHED:
--> 439                 return self.__get_result()
    440             else:
    441                 raise TimeoutError()

~/miniconda3/envs/cellrank_revision/lib/python3.8/concurrent/futures/_base.py in __get_result(self)
    386     def __get_result(self):
    387         if self._exception:
--> 388             raise self._exception
    389         else:
    390             return self._result

TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.

The exit codes of the workers are {SIGABRT(-6), SIGABRT(-6), SIGABRT(-6), SIGABRT(-6), SIGABRT(-6)}
Marius1311 commented 3 years ago

Oh, after that error, also 2 jobs fails. Let me restart my kernel and try again.

Marius1311 commented 3 years ago

Okay, after restarting the kernel, it does work with 6 jobs.

Marius1311 commented 3 years ago

For 8 jobs, it still fails, longer error log below

libc++abi.dylib: terminating with uncaught exception of type std::runtime_error: Couldn't close file
libc++abi.dylib: terminating with uncaught exception of type std::runtime_error: Couldn't close file
[0]PETSC ERROR: ------------------------------------------------------------------------
[0]PETSC ERROR: Caught signal number 13 Broken Pipe: Likely while reading or writing to a socket
[0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[0]PETSC ERROR: or see https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
[0]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
[0]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run 
[0]PETSC ERROR: to get more information on the crash.
application called MPI_Abort(MPI_COMM_WORLD, 50162059) - process 0
[unset]: write_line error; fd=-1 buf=:cmd=abort exitcode=50162059
:
system msg for write_line failure : Bad file descriptor
/Users/marius/miniconda3/envs/cellrank_revision/lib/python3.8/site-packages/joblib/externals/loky/backend/resource_tracker.py:318: UserWarning: resource_tracker: There appear to be 2 leaked file objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/Users/marius/miniconda3/envs/cellrank_revision/lib/python3.8/site-packages/joblib/externals/loky/backend/resource_tracker.py:318: UserWarning: resource_tracker: There appear to be 8 leaked semlock objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/Users/marius/miniconda3/envs/cellrank_revision/lib/python3.8/site-packages/joblib/externals/loky/backend/resource_tracker.py:318: UserWarning: resource_tracker: There appear to be 2 leaked folder objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/Users/marius/miniconda3/envs/cellrank_revision/lib/python3.8/site-packages/joblib/externals/loky/backend/resource_tracker.py:333: UserWarning: resource_tracker: /var/folders/mx/0hyv8t2s26jdj79f55kvc_b80000gn/T/joblib_memmapping_folder_5593_40c65181fc744a9ca57ec0230f7941dd_2713869d33424586bc1f571723ca1820: FileNotFoundError(2, 'No such file or directory')
  warnings.warn('resource_tracker: %s: %r' % (name, e))
[I 10:11:57.137 NotebookApp] KernelRestarter: restarting kernel (1/5), keep random ports
WARNING:root:kernel 04d152c9-f0a9-4298-8f39-1a8425757e3b restarted
Marius1311 commented 3 years ago

Using backend='threading with 8 jobs as in g_fwd.compute_absorption_probabilities(use_petsc=True, solver='gmres', n_jobs=8, backend='threading') works!

Marius1311 commented 3 years ago

I'm closing this, I think this problem is specific to my machine.

BioFalcon commented 3 years ago

Hello, I've been using CellRank recently, and have been running into this same error, so far only when running cr.tl.lineages() with the parameter backwards=True, as well as with cr.tl.initial_states(). I'm using the most recent version of CellRank (v1.1.0)

Marius1311 commented 3 years ago

Hi @BioFalcon, the most recent version is 1.2, could you update and try again? If that doesn't help, could you try passing backend='threading'? The above is a parallelisation issue, if nothing else helps, you can probably resolve this by turning off parallelisation, i.e. setting n_jobs=1.

BioFalcon commented 3 years ago

Hi, I tried doing all the above and still getting the error. Somehow, there is output being inserted into the adata object, but I was just wondering if this might impact downstream analyses.

Marius1311 commented 3 years ago

Hi @BioFalcon, I will reopen this issue for you and @michalk8 will look into this with you - however, that will take until next week as it's exam season at the moment, sorry about that!

Marius1311 commented 3 years ago

@BioFalcon, can you please post the error you are getting when you run the function without parallelisation, i.e. cr.tl.lineages(n_jobs=1)?

Marius1311 commented 3 years ago

I'm assuming that this issue has been solved.

alefrol638 commented 1 year ago

Hi,

vk = VelocityKernel(adata)

vk.compute_transition_matrix()

from cellrank.tl.kernels import ConnectivityKernel

ck = ConnectivityKernel(adata).compute_transition_matrix()

combined_kernel = 0.8 vk + 0.2 ck

from cellrank.tl.estimators import GPCCA

g = GPCCA(combined_kernel) print(g)

g.compute_schur(n_components=5) g.plot_spectrum()

Always crashing at the step, where the shur decomposition is calculated.

- This is the output of cellrank.logging.print_versions(): 

cellrank==1.5.1 scanpy==1.9.1 anndata==0.8.0 numpy==1.21.2 numba==0.56.3 scipy==1.9.3 pandas==1.5.1 pygpcca==1.0.4 scikit-learn==1.1.3 statsmodels==0.13.5 python-igraph==0.10.2 scvelo==0.2.4 pygam==0.8.0 matplotlib==3.6.2 seaborn==0.11.2



Thanks!
alefrol638 commented 1 year ago

Nevermind, reinstalling environment seemed to help