g.set_terminal_states doesn't change terminal_states_colors and terminal_states_names

zhangjiaobxy commented 3 years ago

Dear CellRank Team,

Thanks for developing this amazing tool! I am following the Kernels and estimators tutorial, and I want to manually set the terminal states for my dataset.

After I run g.set_terminal_states, I can see adata.obs['terminal_states'], adata.uns['terminal_states_colors'], and adata.uns['terminal_states_names'] are all set to my manually defined terminal states. The problem is that when I plot the terminal states cr.pl.terminal_states(adata, discrete=True), the adata.obs['terminal_states'], adata.uns['terminal_states_colors'], and adata.uns['terminal_states_names'] are setting back to the originally computed terminal states.

I tested the same code using the pancreas dataset (datasets/endocrinogenesis_day15.5.h5ad), the error is reproducible. Here is the code:

import scvelo as scv
import cellrank as cr
import scanpy as sc
import numpy as np

adata = cr.datasets.pancreas()
scv.pl.proportions(adata)
adata

scv.pp.filter_and_normalize(adata, min_shared_counts=20, n_top_genes=2000)
sc.tl.pca(adata)
sc.pp.neighbors(adata, n_pcs=30, n_neighbors=30)
scv.pp.moments(adata, n_pcs=None, n_neighbors=None)

scv.tl.recover_dynamics(adata, n_jobs=4)

scv.tl.velocity(adata, mode='dynamical')
scv.tl.velocity_graph(adata)

scv.pl.velocity_embedding_stream(adata, basis='umap', legend_fontsize=12, title='', smooth=.8, min_mass=4)

k = cr.tl.transition_matrix(adata)
g = cr.tl.estimators.GPCCA(k)
g.compute_schur(n_components=20, method='krylov')
g.compute_macrostates(cluster_key='clusters')
g.compute_terminal_states()

Then I check the automatically computed terminal states using the following code:

adata.obs['terminal_states']
adata.uns['terminal_states_colors']
adata.uns['terminal_states_names']

Here is the output:

adata.obs['terminal_states']
Out[24]: 
index
AAACCTGAGAGGGATA-1-3    NaN
AAACCTGAGGCAATTA-1-3    NaN
AAACCTGGTAAGTGGC-1-3    NaN
AAACCTGTCCCTCTTT-1-3    NaN
AAACGGGAGTAGCGGT-1-3    NaN
                       ... 
TTTGGTTTCCTTTCGG-1-3    NaN
TTTGTCAAGAATGTGT-1-3    NaN
TTTGTCAAGTGACATA-1-3    NaN
TTTGTCATCGAATGCT-1-3    NaN
TTTGTCATCTGTTTGT-1-3    NaN
Name: terminal_states, Length: 2531, dtype: category
Categories (3, object): ['Epsilon', 'Alpha', 'Beta']
adata.uns['terminal_states_colors']
Out[25]: array(['#cab2d6', '#1f78b4', '#b2df8a'], dtype='<U7')
adata.uns['terminal_states_names']
Out[26]: array(['Epsilon', 'Alpha', 'Beta'], dtype=object)

Next, manually define/set terminal states:

AlphaCell = (adata[adata.obs['clusters'] == 'Alpha']).obs_names[:30]
BetaCell = (adata[adata.obs['clusters'] == 'Beta']).obs_names[:30]
EpsilonCell = (adata[adata.obs['clusters'] == 'Epsilon']).obs_names[:30]
DeltaCell = (adata[adata.obs['clusters'] == 'Delta']).obs_names[:30]
tLable = pd.DataFrame({
    'Alpha': AlphaCell,
    'Beta': BetaCell,
    'Epsilon': EpsilonCell,
    'Delta': DeltaCell
})
tLable = tLable.to_dict(orient='list')
g.set_terminal_states(labels=tLable, cluster_key='clusters', add_to_existing=False)

Now, using the following code to check if the terminal states are manually set:

adata.obs['terminal_states']
adata.uns['terminal_states_colors']
adata.uns['terminal_states_names']

Here is the output:

adata.obs['terminal_states']
Out[33]: 
index
AAACCTGAGAGGGATA-1-3      NaN
AAACCTGAGGCAATTA-1-3    Alpha
AAACCTGGTAAGTGGC-1-3      NaN
AAACCTGTCCCTCTTT-1-3    Alpha
AAACGGGAGTAGCGGT-1-3    Delta
                        ...  
TTTGGTTTCCTTTCGG-1-3      NaN
TTTGTCAAGAATGTGT-1-3      NaN
TTTGTCAAGTGACATA-1-3      NaN
TTTGTCATCGAATGCT-1-3      NaN
TTTGTCATCTGTTTGT-1-3      NaN
Name: terminal_states, Length: 2531, dtype: category
Categories (4, object): ['Alpha', 'Beta', 'Delta', 'Epsilon']
adata.uns['terminal_states_colors']
Out[34]: ['#1f78b4', '#b2df8a', '#6a3d9a', '#cab2d6']
adata.uns['terminal_states_names']
Out[35]: array(['Alpha', 'Beta', 'Delta', 'Epsilon'], dtype=object)

Everything looks perfect so far, however, if I run this line of code cr.pl.terminal_states(adata, discrete=True). I can get an expected figure.

But if I check the terminal states again:

adata.obs['terminal_states']
adata.uns['terminal_states_colors']
adata.uns['terminal_states_names']

The terminal states are setting back to the original terminates states:

adata.obs['terminal_states']
Out[37]: 
index
AAACCTGAGAGGGATA-1-3      NaN
AAACCTGAGGCAATTA-1-3    Alpha
AAACCTGGTAAGTGGC-1-3      NaN
AAACCTGTCCCTCTTT-1-3    Alpha
AAACGGGAGTAGCGGT-1-3    Delta
                        ...  
TTTGGTTTCCTTTCGG-1-3      NaN
TTTGTCAAGAATGTGT-1-3      NaN
TTTGTCAAGTGACATA-1-3      NaN
TTTGTCATCGAATGCT-1-3      NaN
TTTGTCATCTGTTTGT-1-3      NaN
Name: terminal_states, Length: 2531, dtype: category
Categories (4, object): ['Alpha', 'Beta', 'Delta', 'Epsilon']
adata.uns['terminal_states_colors']
Out[38]: ['#1f77b4', '#ff7f0e', '#279e68']
adata.uns['terminal_states_names']
Out[39]: ['Lineage 0', 'Lineage 1', 'Lineage 2']

At this time, if I run this line of code cr.pl.terminal_states(adata, discrete=True), I got the following error:

Traceback (most recent call last):
  File "/Users/zj/anaconda3/envs/1_scMFvenv/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 3437, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-40-e2210f3bd42d>", line 1, in <module>
    cr.pl.terminal_states(adata, discrete=True)
  File "/Users/zj/anaconda3/envs/1_scMFvenv/lib/python3.8/site-packages/cellrank/pl/_init_term_states.py", line 150, in terminal_states
    return _initial_terminal(
  File "/Users/zj/anaconda3/envs/1_scMFvenv/lib/python3.8/site-packages/cellrank/pl/_init_term_states.py", line 92, in _initial_terminal
    mc.plot_terminal_states(
  File "/Users/zj/anaconda3/envs/1_scMFvenv/lib/python3.8/site-packages/cellrank/tl/estimators/_utils.py", line 148, in delegate
    return getattr(instance, attr)(
  File "/Users/zj/anaconda3/envs/1_scMFvenv/lib/python3.8/site-packages/cellrank/tl/estimators/_utils.py", line 62, in _method
    return method.__get__(obj, cls)(*args, **kwargs)
  File "/Users/zj/anaconda3/envs/1_scMFvenv/lib/python3.8/site-packages/cellrank/tl/estimators/_property.py", line 738, in _
    self._plot_discrete(data, prop, **kwargs)
  File "/Users/zj/anaconda3/envs/1_scMFvenv/lib/python3.8/site-packages/cellrank/tl/estimators/_property.py", line 481, in _plot_discrete
    self.adata.uns[f"{key}_colors"] = [
  File "/Users/zj/anaconda3/envs/1_scMFvenv/lib/python3.8/site-packages/cellrank/tl/estimators/_property.py", line 482, in <listcomp>
    colors[c] for c in data.cat.categories
KeyError: 'Epsilon'

Versions:

cr.logging.print_versions()

cellrank==1.2.0 scanpy==1.7.1 anndata==0.7.5 numpy==1.20.2 numba==0.53.1 scipy==1.6.2 pandas==1.2.3 scikit-learn==0.24.1 statsmodels==0.12.2 python-igraph==0.9.1 scvelo==0.2.3 pygam==0.8.0 matplotlib==3.3.4 seaborn==0.11.1

Please feel free to correct me if I do anything wrong. Thanks!

michalk8 commented 3 years ago

Hi @zhangjiaobxy , thanks for a really nice bug-report, I am able to reproduce this in 1.2.0! If you set cr.settings.verbosity = 4 before cr.pl.terminal_states(adata, discrete=True), you should see something like:

DEBUG: Unable to set attribute `._absorption_probabilities`, skipping
DEBUG: Unable to set attribute `._diff_potential`, skipping
DEBUG: Expected lineage names to be of length `3`, found `4`. Creating new names  # this is the problem
DEBUG: Expected lineage colors to be of length `3`, found `3`. Creating new colors  # there was a typo, should say: found `4`

We've fixed a part of this problem recently in 1.3.1, specifically in https://github.com/theislab/cellrank/pull/556 , it determined the number of terminal states based on adata.obsm['macrostates_fwd'], which is still 3. This however fixes only the 1st error (overwriting) and it will warn wrong #colors/names is found.

The 2nd error seems to still be present (calling cr.pl.terminal_states in succession) because of the inconsistency (4 categories in adata.obs['terminal_states'], but still only 3 columns in adata.obsm['macrostates_fwd'] and only the latter is being checked. So it happily creates just 3 names/colors (cr.pl.terminal_states(adata, discrete=False) works and shows only these 3 macrostates), but when plotting these as discrete, the created color mapper only maps 3 names to colors: https://github.com/theislab/cellrank/blob/master/cellrank/tl/estimators/_property.py#L441 (it silently fails because Epsilon is cut out).

I've fixed this in #563 (still need to write tests though).

@Marius1311 we should refactor the estimator (by removing _property.py + defining a state machine) so that it's easier to handle (fix bugs, extend, ...). I will create an issue.

zhangjiaobxy commented 3 years ago

Thanks @michalk8 for your quick reply. I am glad to hear that you found the causes so fast. Please feel free to let me know when you fix it. I am happy to test the code. :)

Marius1311 commented 3 years ago

Hi @michalk8, thanks for taking the initiative here - feel free to draft a refactoring of the estimators and we can discuss it!

Marius1311 commented 3 years ago

Hi @zhangjiaobxy, I believe the bug you reported is fixed in #563, if you don't mind, would be great if you could test this!

zhangjiaobxy commented 3 years ago

Thanks @Marius1311. I can have a test. Should I test it under the developmental version? pip install git+https://github.com/theislab/cellrank@dev

michalk8 commented 3 years ago

@zhangjiaobxy This is still not yet on dev, pip install git+https://github.com/theislab/cellrank@fix_inconsistent_state should do the trick. It's not perfect, but should solve the issue you've had. Proper fix can/wlll be issues when dealing with #564 (without refactoring, it's currently not possible).

zhangjiaobxy commented 3 years ago

@zhangjiaobxy This is still not yet on dev, pip install git+https://github.com/theislab/cellrank@fix_inconsistent_state should do the trick. It's not perfect, but should solve the issue you've had. Proper fix can/wlll be issues when dealing with #564 (without refactoring, it's currently not possible).

Thanks @michalk8 @Marius1311 ! I just tested this version. It works well. I guess the problem has been fixed. Thank you so much for your help!

michalk8 commented 3 years ago

closed via #563

michalk8 commented 3 years ago

Thanks a lot @zhangjiaobxy for testing the changes.

theislab / cellrank

g.set_terminal_states doesn't change terminal_states_colors and terminal_states_names #562

Versions: