directed paga_pie shows unknown grey color

zhangjiaobxy commented 3 years ago

Dear CellRank Team,

Thanks for your effort to make CellRank so user-friendly.

I was playing around with the pancreas data datasets/endocrinogenesis_day15.5.h5ad. After I run g.compute_macrostates, I randomly picked 4 clusters and set them as terminal states using the command set_terminal_states_from_macrostates. When I run this code cr.pl.cluster_fates to plot directed PAGA using the paga_pie mode, the directed PAGA pie figure shows an unknown grey color. However, I think if there are 4 terminal states, the paga_pie should show no more than 4 colors. Now with the grey color, there are 5 colors in total. Here is the code to reproduce the scene:

import scvelo as scv
import cellrank as cr
import scanpy as sc
import numpy as np

scv.settings.verbosity = 3
scv.settings.presenter_view = True
cr.settings.verbosity = 4

adata = cr.datasets.pancreas()
scv.pl.proportions(adata)

scv.pp.filter_and_normalize(adata, min_shared_counts=20, n_top_genes=2000)
sc.tl.pca(adata)
sc.pp.neighbors(adata, n_pcs=30, n_neighbors=30)
scv.pp.moments(adata, n_pcs=None, n_neighbors=None)

scv.tl.recover_dynamics(adata, n_jobs=4)
scv.tl.velocity(adata, mode='dynamical')
scv.tl.velocity_graph(adata)

k = cr.tl.transition_matrix(adata)
g = cr.tl.estimators.GPCCA(k)
g.compute_schur(n_components=20, method='krylov')
g.compute_macrostates(cluster_key='clusters', n_states=11)
g.plot_macrostates()

g.set_terminal_states_from_macrostates(names=['Delta', 'Ngn3 high EP_1', 'Ngn3 high EP_4', 'Ngn3 high EP_5'])

cr.tl.lineages(adata)

scv.tl.recover_latent_time(adata, root_key='initial_states_probs', end_key='terminal_states_probs')
scv.tl.paga(adata, groups='clusters', root_key='initial_states_probs', end_key='terminal_states_probs', use_time_prior='velocity_pseudotime')

cr.pl.cluster_fates(adata, mode="paga_pie", cluster_key="clusters", basis='umap',
                    legend_kwargs={'loc': 'top right out'}, legend_loc='top left out',
                    node_size_scale=5, edge_width_scale=1, max_edge_width=4, title='directed PAGA')

Versions:

cr.logging.print_versions()

cellrank==1.3.1 scanpy==1.7.1 anndata==0.7.5 numpy==1.20.2 numba==0.53.1 scipy==1.6.2 pandas==1.2.3 pygpcca==1.0.2 scikit-learn==0.24.1 statsmodels==0.12.2 python-igraph==0.9.1 scvelo==0.2.3 pygam==0.8.0 matplotlib==3.3.4 seaborn==0.11.1

zhangjiaobxy commented 3 years ago

Sorry, just following the above code. I saved the adata on my laptop, when I read in the adata and plot directed paga, I got an error as below. I was wondering how could I save the adata, so next time when I read in the adata, I can plot directed paga without running cr.tl.lineages(), scv.tl.recover_latent_time(), scv.tl.paga() again?

outDir = './datasets/'
adata.write(os.path.join(outDir, 'pan_adata_paga.h5ad'))

adata_readIn = scv.read(os.path.join(outDir, 'pan_adata_paga.h5ad'))

cr.pl.cluster_fates(adata_readIn, mode="paga_pie", cluster_key="clusters", basis='umap',
                    legend_kwargs={'loc': 'top right out'}, legend_loc='top left out',
                    node_size_scale=5, edge_width_scale=1, max_edge_width=4, title='directed PAGA')

Here is the error information:

Traceback (most recent call last):
  File "/Users/zj/anaconda3/envs/1_scMFvenv/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 3437, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-20-a804d7b4d247>", line 1, in <module>
    cr.pl.cluster_fates(adata_readIn, mode="paga_pie", cluster_key="clusters", basis='umap',
  File "/Users/zj/anaconda3/envs/1_scMFvenv/lib/python3.8/site-packages/cellrank/pl/_cluster_fates.py", line 475, in cluster_fates
    lin_names = list(adata.obsm[lk].names)
AttributeError: 'numpy.ndarray' object has no attribute 'names'

Sorry for so many questions. I'd highly appreciate any suggestions on this.

michalk8 commented 3 years ago

Hi @zhangjiaobxy ,

thanks again for the nice reproducible code! Re your first question, why is there grey: grey signifies "missing" values in the pie chart i.e. abs. probabilities not summing to 1. However, this should not happen - they should sum to 1. I've checked whether the abs. probabilities sum to 1, unfortunately, they do not - I am not sure why we're not checking this in this function (we really should..., sorry for that).

Increasing the tolerance to tol=1e-12 (from tol=1e-6) seems to work (at least for use_petsc=True, for use_petsc=False, I don't have the patience to run it - it will most likely take ages). Alternative is to specify solver='direct' - both PETSc/scipy produce correct (i.e. summing to 1 results) or use preconditioner='lu'/preconditioner='ilu' (both of these seem to have worked).

re your 2nd question: It's currently impossible to serialize custom classes to AnnData, so just the array+names+colors are saved (separately). You can either use

# just the extra Lineage class reconstruction
adata = cr.read(os.path.join(outDir, 'pan_adata_paga.h5ad'))  #  which just wraps `scv.read` by default

or do

adata.obsm['to_terminal_states'] = cr.tl.Lineage(
    adata.obsm['to_terminal_states'],
    names=adata.uns['terminal_states_names'],
    colors=adata.uns['terminal_states_colors']
)

ping @Marius1311:

[ ] why does gmres fail (the min. mass for 1 cell is 0.859912)? the cond. number of Q is 3.355078e+12 (if I set more "reasoanble term. states, such as ['Delta', 'Beta', 'Alpha'], it's ~1 order of magnitude lower, 8.591531e+11
[ ] any idea why PETSc/scipy does not return converged=False? Should we increase the tolerance?
[ ] I just noticed, the preconditioner=None' is the default (though it was 'ilu', which def. helps in this case, it almost sums to 1 [min is 0.995908], for 'lu' is 1.0), was it always None or did we change it? Should we enable it?

Marius1311 commented 3 years ago

Hi @michalk8, in @zhangjiaobxy's original code he posted above, he run cr.tl.lineages in place of g.compute_absorption_probabilities, i.e. switching from low to high level mode halfway though - did you check whether this could have caused the issue?

Re the tolerance values, if you change it from 1e-6 to 1e-12 than you make it smaller, so you decrease it, right? This confused me a bit, because above, you said you increased it.

So let me understand this - on these ill-conditioned problems, the direct solvers work but gmres fails in the sense that abs. probs don't sum to one. If you change the preconditioner for gmres or if you decrease the tolerance, then the issue is solved, is that correct?

michalk8 commented 3 years ago

Hi @michalk8, in @zhangjiaobxy's original code he posted above, he run cr.tl.lineages in place of g.compute_absorption_probabilities, i.e. switching from low to high level mode halfway though - did you check whether this could have caused the issue?

No, this is solver-related issue (I've used the low-level API all the way).

Re the tolerance values, if you change it from 1e-6 to 1e-12 than you make it smaller, so you decrease it, right? This confused me a bit, because above, you said you increased it.

Yes, I've decreased the tol, sorry for confusion, it was late.

So let me understand this - on these ill-conditioned problems, the direct solvers work but gmres fails in the sense that abs. probs don't sum to one. If you change the preconditioner for gmres or if you decrease the tolerance, then the issue is solved, is that correct?

Yes.

Marius1311 commented 3 years ago

Ok, I think this could be numerical issues - if you look at the problem, than for the later cells, all provided terminal states are extremely unlikely, you're multiplying many very small probabilities together, so numerical underflow can become an issue. So this doesn't completely surprise me. What does surprise me is that the solvers don't complain, i.e. raise some sort of flag.

We should check whether abs. probs sum to one after computations and raise a warning if they don't. Re pre-conditioning, we have to try this on a few more example before we make this change. However, let's keep in mind for the future that preconditioning with lu worked best in this case. Is that correct?

Marius1311 commented 3 years ago

BTW, @zhangjiaobxy thanks for doing such rigorous testing with CellRank! We're unearthing some real edge cases here, which is great!

zhangjiaobxy commented 3 years ago

Thanks @michalk8 and @Marius1311 for investigating this!

Hi @michalk8, your suggestions for my 2nd question solved my problem perfectly. Thank you so much!

michalk8 commented 3 years ago

Hi @michalk8, your suggestions for my 2nd question solved my problem perfectly. Thank you so much!

Glad that it worked. On the dev branch, there's a check if the abs. probs. do not sum to 1. I'm closing this, the discussion whether to enable preconditioners/change tolerance defaults/etc. should be done in the estimator refactoring issue #566 .

closed via #567

theislab / cellrank

directed paga_pie shows unknown grey color #565

Versions: