theislab / cellrank

CellRank: dynamics from multi-view single-cell data
https://cellrank.org
BSD 3-Clause "New" or "Revised" License
347 stars 46 forks source link

the kernal apperas to have died. It will restart automatically #399

Closed roofya closed 4 years ago

roofya commented 4 years ago

Hi I'm running tutorial Pancreas Basics and It's working fine until to velocity part but for cr.tl.terminal_states(adata, cluster_key='clusters', weight_connectivities=0.2) Computing transition matrix based on velocity correlations using 'deterministic' mode Estimating softmax_scale using 'deterministic' mode 100% 2531/2531 [00:03<00:00, 714.91cell/s]

Setting softmax_scale=3.7951 100% 2531/2531 [00:01<00:00, 1420.24cell/s]

Finish (0:00:03)

Using a connectivity kernel with weight 0.2 Computing transition matrix based on connectivities Finish (0:00:00) Computing eigendecomposition of the transition matrix Adding .eigendecomposition adata.uns['eig_fwd'] Finish (0:00:00) Computing Schur decomposition And suddenly the kernal apperas to have died and it ger restarted. I really appreciate your help to fix this problem

Thank you

michalk8 commented 4 years ago

Hi @roofya , sounds very strange. I assume you're using the SLEPc/PETSc libraries from cellrank-krylov. If so, can you please post the output of python -c "import slepc4py; import petsc4py; print(slepc4py.__version__, petsc4py.__version__)"? Currently, only this line comes to my mind is densifying the matrix (https://github.com/msmdev/msmtools/blob/krylov_schur/msmtools/util/sorted_schur.py#L283) if SLEPc/PETSc is NOT installed (however, after testing this locally, my notebook doesn't crash).

As the next thing: could please run start your notebook as jupyter notebook --debug > log.txt 2>&1 and post the log.txt here (ideally as an attachement, might get big)? Finally, what's your Python version and OS? I've tested it using fresh conda environment with cellrank-krylov (Python3.8.5, Debian bullseye) and no crash has happened.

Apart from the above, maybe this thread can help to solve the issue: https://github.com/jupyter/notebook/issues/1892

roofya commented 4 years ago

@michalk8 Thank you so much it seems that the problem was related to SLEPC/PETSC. I have uninstall and installed them again and now works fine.

Marius1311 commented 4 years ago

Awesome! Thanks @michalk8 for fixing this so quickly! @roofya, great that you're checking out CellRank, let us know via issues in case you encounter any other problems - we're happy to help.

ccruizm commented 4 years ago

Good day,

I am having the same problem. When running cr.tl.terminal_states the kernel dies. I decided to create a new env, installing all packages through conda, but still having the same issue. In the output from command line while excecuitng the notebook, there is the next error message:

*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[n0078.compute.hpc:71628] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!

When installing thought conda on the new env, I noticed the next message:

For Linux 64, Open MPI is built with CUDA awareness but this support is disabled by default.
To enable it, please set the environmental variable OMPI_MCA_opal_cuda_support=true before launching your MPI processes. Equivalently, you can set the MCA parameter in the command line: mpiexec --mca opal_cuda_support 1 ...

I tried to export OMPI_MCA_opal_cuda_support=true before launching the Jupiter notebook, but still the kernel dies. I also tried to install SLEPC/PETSC via pip but always get an error and the packages are not installed.

Answering the questions you asked to the other person with the same problem:

python -c "import slepc4py; import petsc4py; print(slepc4py.__version__, petsc4py.__version__)"
3.13.0 3.13.0

What do you think the problem might be? Thanks in advance for your help!

michalk8 commented 4 years ago

I did a little digging, both of the links below mention that this can happen when openMPI was present on the system and can be solved by reinstalling it:

Hope this helps.

Marius1311 commented 4 years ago

@ccruizm , did this solve your issue?

ccruizm commented 4 years ago

Unfortunately, it did not. I have access to another server, so I created a new conda env with cellrank and there it ran with no issues. I do not understand why this happens with my other HPC. I had the 'same' issue before using scVelo (https://github.com/theislab/scvelo/issues/198). I could not find why it kept killing the kernel but it ran in the other server. However, after they updated the package (from v.0.2.0. to v.0.2.2. the problem was solved!), so I do not know why this keeps happening Any thoughts? Thanks!

Doris-Fu commented 2 years ago

Hi @michalk8 @Marius1311 I'm experiencing the same issue as @roofya. My kernel always dies on Computing Schur decomposition when I run cr.tl.initial_states. I'm using my own dataset which is 23k cells. I have removed and reinstalled the dependencies you mentioned but the problem persists. I also tried to downsample my data to 6k cells but that didn't help.

I ran python -c "import slepc4py; import petsc4py; print(slepc4py.__version__, petsc4py.__version__)" and get 3.16.1 3.16.1.

I also have the log.txt log.txt

Do you have any suggestions for this problem? Thank you!

Marius1311 commented 2 years ago

Hi @Doris-Fu, could you please try whether this works if you don't use SLEPSc/PETSc? You can do that easily by running via the low-level mode (check the kernels and estimators tutorial) and passing method="brandts" in estimator.compute_schur(), see https://cellrank.readthedocs.io/en/stable/api/cellrank.tl.estimators.GPCCA.compute_schur.html#cellrank.tl.estimators.GPCCA.compute_schur

Doris-Fu commented 2 years ago

@Marius1311 Yes this worked for me! Thank you a lot!

lengfei5 commented 1 year ago

Hi there,

I would also report that I am having the same issue as @Doris-Fu , which is that the kernel is died immediately after Computing Schur decomposition. The slepc and petsc versions are as follows: python -c "import slepc4py; import petsc4py; print(slepc4py.version, petsc4py.version)" 3.17.2 3.17.4

One observation is that all my slepc and petsc are installed via conda from conda-forge. I have tried to reinstall them with pip and got error message.

lengfei5 commented 1 year ago

I have tried to run the low-level mode with method='brandts' in estimator.compute_schur() and g.compute_absorption_probabilities(use_petsc = False) to get around without SLEPSc/PETSc.

I have ~50k cells and it will work in term of running time ? The analysis will make any difference between with/without SLEPSc/PETSc ?

Thanks.

Marius1311 commented 1 year ago

Hi @lengfei5, results will be equivalent, just compute time will be very much longer if you use brandts.

josegarciamanteiga commented 1 year ago

Hi all, I got into the same error. I thought it could be something about jupyter, so I launched it in the terminal. I am using cluster through SLURM srun into a node. The error was far more informative:

"[srcn02:06565] OPAL ERROR: Unreachable in file pmix3x_client.c at line 112

The application appears to have been direct launched using "srun", but OMPI was not built with SLURM's PMI support and therefore cannot execute. There are several options for building PMI support under SLURM, depending upon the SLURM version you are using:

version 16.05 or later: you can use SLURM's PMIx support. This requires that you configure and build SLURM --with-pmix.

Versions earlier than 16.05: you must use either SLURM's PMI-1 or PMI-2 support. SLURM builds PMI-1 by default, or you can manually install PMI-2. You must then build Open MPI using --with-pmi pointing to the SLURM PMI library location.

Please configure as appropriate and try again.

An error occurred in MPI_Init_thread on a NULL communicator MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, and potentially your MPI job) [srcn02:06565] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!"

jnmaciuch commented 9 months ago

Hello,

I'm experiencing the same issue as above. I was able to work around the error for compute_shur() by specifying method="brandts", however I am now receiving the same error message when trying to run compute_fate_probabilities(). I am also on a HPC using slurm.

--------------------------------------------------------------------------
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot
execute. There are several options for building PMI support under
SLURM, depending upon the SLURM version you are using:

  version 16.05 or later: you can use SLURM's PMIx support. This
  requires that you configure and build SLURM --with-pmix.

  Versions earlier than 16.05: you must use either SLURM's PMI-1 or
  PMI-2 support. SLURM builds PMI-1 by default, or you can manually
  install PMI-2. You must then build Open MPI using --with-pmi pointing
  to the SLURM PMI library location.

Please configure as appropriate and try again.
--------------------------------------------------------------------------
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[qnode2038:59792] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
Marius1311 commented 9 months ago

do you have any idea how this could be resolved @michalk8 ?

josegarciamanteiga commented 9 months ago

Hi, My workaround was to launch the JupyterLab with sbatch and not with an interaction session using srun. Apparently, the --with-pmi is automatically loaded with sbatch but not with srun. Everything worked after that. You just need to check for the node your sbatch job is in and then open the tunnels or whatever you need to open your Jupyter session on the browser. HTH Jose


Jose M. Garcia Manteiga PhD Computational Biologist Center for Translational Genomics and BioInformatics Dibit2-Basilica, 4A3 San Raffaele Scientific Institute Via Olgettina 58, 20132 Milano (MI), Italy

Tel: +39-02-2643-9211 e-mail: @.***

Il giorno mar 23 gen 2024 alle ore 00:00 jnmaciuch @.***> ha scritto:

Hello,

I'm experiencing the same issue as above. I was able to work around the error for compute_shur() by specifying method="brandts", however I am now receiving the same error message when trying to run compute_fate_probabilities(). I am also on a HPC using slurm. `OPAL ERROR: Unreachable in file pmix3x_client.c at line 111

The application appears to have been direct launched using "srun", but OMPI was not built with SLURM's PMI support and therefore cannot execute. There are several options for building PMI support under SLURM, depending upon the SLURM version you are using:

version 16.05 or later: you can use SLURM's PMIx support. This requires that you configure and build SLURM --with-pmix.

Versions earlier than 16.05: you must use either SLURM's PMI-1 or PMI-2 support. SLURM builds PMI-1 by default, or you can manually install PMI-2. You must then build Open MPI using --with-pmi pointing to the SLURM PMI library location. Please configure as appropriate and try again.

An error occurred in MPI_Init_thread on a NULL communicator MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, and potentially your MPI job) [qnode2038:59792] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed! `

— Reply to this email directly, view it on GitHub https://github.com/theislab/cellrank/issues/399#issuecomment-1904979368, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2UOMKOJDO5QC5WNF65P6TYP3VQJAVCNFSM4R3ZYQ3KU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOJQGQ4TOOJTGY4A . You are receiving this because you commented.Message ID: @.***>

jnmaciuch commented 9 months ago

@josegarciamanteiga thank you, that seemed to solve the issue! I am just getting the following warning when I run compute_fate_probabilities():

--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   qnode0465
  Local device: mlx5_0
--------------------------------------------------------------------------

However the function seems to be completing without errors, so I'm guessing this is okay to ignore.

jaredjeya commented 7 months ago

@josegarciamanteiga thank you, that seemed to solve the issue! I am just getting the following warning when I run compute_fate_probabilities():

--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   qnode0465
  Local device: mlx5_0
--------------------------------------------------------------------------

However the function seems to be completing without errors, so I'm guessing this is okay to ignore.

So I'm using PETSc/SLEPc for a completely different application, but this is one of the only results on Google for someone getting the same obscure error as me. Running on a Sun Grid Engine-managed cluster computer, I get that error and it only uses a single CPU core (despite requesting three), even though when I run on a regular computer it very efficiently parallelises. I'm not using any kind of MPI, I only specify #$ -pe smp 3. So I wonder if the lack of parallelism and that error are related.