Closed hyjforesight closed 2 years ago
What causes this inconsistency?
In the basics
tutorial, there are 2 calls to cr.tl.initial_states
and cr.tl.terminal_states
, which under the hood do something along these lines:
def run(compute_initial_states: bool):
backward = True if compute_initial_states else False
vk = cr.tl.kernels.Velocity(adata, backward=backward).compute_transition_matrix()
ck = cr.tl.kernels.ConnectivityKernel(adata, backward=backward).compute_transition_matrix()
k = (0.8 * vk + 0.2 * ck).compute_transition_matrix()
g = cr.tl.estimators.GPCCA(k)
...
g.compute_terminal_states()
return g
g_fwd = run(compute_initial_states=False) # will save "terminal_states_probs" etc.
g_bwd = run(compute_initial_states=True) # will save "initial_states_probs" etc.
...
whereas in your latter case you call g._compute_initial_states()
- GPCCA gives you macrostates, which can sometimes also correspond to initial states, though we don't use this function often (that's why it's private).
In short, to get both initial and terminal states, you have to initialize 2 types of kernels (backward and forward, respectively) and analyze them with the estimator in order to correctly mimic the high-level API behavior.
Hello @michalk8 Thanks for the response. My previous understanding of the workflow is as below. By following this workflow, I can only calculate lineage genes for terminal states, unavailable for initial states.
import kernel
kernel.compute_transition_matrix
import estimator
estimator.compute_schur()
estimator.compute_macrostates()
# run 1 of these 3 to automatically or manully set terminal states from macrostates
estimator.compute_terminal_states()
estimator.set_terminal_states_from_macrostates()
estimator.set_terminal_states()
# run 1 of these 2 to automatically or manully set initial states from macrostates
estimator._compute_initial_states()
estimator._set_initial_states_from_macrostates()
# draw fate map
estimator.compute_absorption_probabilities()
# calculate lineage genes for terminal states or initial states, but it fails for initial states
estimator.compute_lineage_drivers(lineages=' ')
Based on your explanation, my current understanding is that I should call the kernel again for calculating initial states. Am I right?
import kernel
kernel.compute_transition_matrix
import estimator
estimator.compute_schur()
estimator.compute_macrostates()
# run 1 of these 3 to automatically or manully set terminal states from macrostates
estimator.compute_terminal_states()
estimator.set_terminal_states_from_macrostates()
estimator.set_terminal_states()
# draw fate map
estimator.compute_absorption_probabilities()
# calculate lineage genes for terminal states
estimator.compute_lineage_drivers(lineages=' ')
import kernel
kernel.compute_transition_matrix
import estimator
estimator.compute_schur()
estimator.compute_macrostates()
# run 1 of these 2 to automatically or manully set initial states from macrostates
estimator._compute_initial_states()
estimator._set_initial_states_from_macrostates()
# draw fate map
estimator.compute_absorption_probabilities()
# calculate lineage genes for initial states
estimator.compute_lineage_drivers(lineages=' ')
Thanks!
Based on your explanation, my current understanding is that I should call the kernel again for calculating initial states. Am I right?
Yes, but with the inverted directionality, you shouldn't call (_compute_initial_states()
or _set_initial_states_from_macrostates()
in the 2nd case, since it will not work - the functions do update the estimator's attributes needed in compute_absorption_probabilities
.
When backward=True
, compute_terminal_states
actually computes the initial states - we couldn't find a better name for it so we overloaded this term (reasoning: the terminal states of the backward process correspond to the initial states of the forward process). To summarize, if you want to compute both initial and terminal states (and later use them e.g. in directed PAGA), you'd have to run:
# compute initial states (note the `backward=True`) and run the full estimator analysis
vk = cr.tl.kernels.VelocityKernel(adata, backward=True).compute_transition_matrix()
ck = cr.tl.kernels.ConnectivityKernel(adata, backward=True).compute_transition_matrix()
k = (0.8 * vk + 0.2 * ck).compute_transition_matrix()
g_bwd = cr.tl.estimators.GPCCA(k)
g_bwd.compute_schur()
g_bwd.compute_macrostates()
g_bwd.compute_terminal_states() # or g_bwd.set_terminal_states_from_macrostates() or g_bwd.set_terminal_states()
g_bwd.compute_absorption_probabilities()
g_bwd.compute_lineage_drivers()
# compute terminal states, same as above, but with `backward=False`
vk = cr.tl.kernels.VelocityKernel(adata, backward=False).compute_transition_matrix()
ck = cr.tl.kernels.ConnectivityKernel(adata, backward=False).compute_transition_matrix()
k = (0.8 * vk + 0.2 * ck).compute_transition_matrix()
g_fwd = cr.tl.estimators.GPCCA(k)
g_fwd.compute_schur()
g_fwd.compute_macrostates()
g_fwd.compute_terminal_states() # or g_fwd.set_terminal_states_from_macrostates() or g_fwd.set_terminal_states()
g_fwd.compute_absorption_probabilities()
g_fwd.compute_lineage_drivers()
@Marius1311 @WeilerP and I will be working on a set of updated tutorials where we better properly explained to avoid any confusion.
Hello @michalk8
It's so kind of you to give such a detailed explanation. Now I'm clear about the workflow for computing the initial states.
But, sorry for asking a lot, I'm just a little curious about the functions estimator._compute_initial_states()
and estimator._set_initial_states_from_macrostates()
. Does it mean that we should never use these 2 functions because you mentioned them in https://github.com/theislab/cellrank/issues/787 but we abort them now?
Thanks!
Best,
YJ
Hello @michalk8
I ran the codes as you recommended, but it got errors RuntimeError: No stationary distribution found in .eigendecomposition['stationary_dist']
when ran g_bwd.compute_lineage_drivers(lineages="Ngn3 low EP", return_drivers=True)
. Could you please help me to check this?
Thanks!
Best,
YJ
vk = cr.tl.kernels.VelocityKernel(adata, backward=True).compute_transition_matrix()
ck = cr.tl.kernels.ConnectivityKernel(adata, backward=True).compute_transition_matrix()
k = (0.8 * vk + 0.2 * ck).compute_transition_matrix()
g_bwd = cr.tl.estimators.GPCCA(k)
g_bwd.compute_schur(n_components=20, method='brandts')
g_bwd.compute_macrostates(n_states=3, n_cells=30, cluster_key="clusters")
g_bwd.compute_terminal_states()
g_bwd.compute_absorption_probabilities(n_jobs=16)
adata.obs['initial_states']
index
AAACCTGAGAGGGATA-1-3 NaN
AAACCTGAGGCAATTA-1-3 NaN
AAACCTGGTAAGTGGC-1-3 NaN
AAACCTGTCCCTCTTT-1-3 NaN
AAACGGGAGTAGCGGT-1-3 NaN
...
TTTGGTTTCCTTTCGG-1-3 NaN
TTTGTCAAGAATGTGT-1-3 NaN
TTTGTCAAGTGACATA-1-3 NaN
TTTGTCATCGAATGCT-1-3 NaN
TTTGTCATCTGTTTGT-1-3 NaN
Name: initial_states, Length: 2531, dtype: category
Categories (1, object): ['Ngn3 low EP']
g_bwd.compute_lineage_drivers(lineages="Ngn3 low EP", return_drivers=True)
WARNING: There is only 1 lineage present. Using stationary distribution instead
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_22804/1505291402.py in <module>
1 # 11. Find lineage driver genes
2 # In advance, terminal states should be identified, and estimator.compute_absorption_probabilities should be run. Then we can run estimator.compute_lineage_drivers(), which only calculate lineage driver genes for terminal states.
----> 3 g_bwd.compute_lineage_drivers(lineages="Ngn3 low EP", return_drivers=True) # if use kernal and estimator, calculate lineage driver genes by estimator.compute_lineage_drivers(), not cr.tl.lineage_drivers()
~\anaconda3\envs\HYJ_py38\lib\site-packages\cellrank\tl\estimators\mixins\_lineage_drivers.py in compute_lineage_drivers(self, lineages, method, cluster_key, clusters, layer, use_raw, confidence_level, n_perms, seed, **kwargs)
128 stat_dist = self.eigendecomposition.get("stationary_dist", None)
129 if stat_dist is None:
--> 130 raise RuntimeError(
131 "No stationary distribution found in `.eigendecomposition['stationary_dist']`."
132 )
RuntimeError: No stationary distribution found in `.eigendecomposition['stationary_dist']`.
But, sorry for asking a lot, I'm just a little curious about the functions estimator._compute_initial_states() and estimator._set_initial_states_from_macrostates(). Does it mean that we should never use these 2 functions because you mentioned them in https://github.com/theislab/cellrank/issues/787 but we abort them now?
They can be used, but we didn't anticipate people using them, so on the estimator level, there's now nice way of computing absorption probabilities towards them (since it doesn't set the correct attributes). However, this can be changed if you want, just open a new issue for this this and we can discuss with others as well.
I ran the codes as you recommended, but it got errors RuntimeError: No stationary distribution found in .eigendecomposition['stationary_dist'] when ran g_bwd.compute_lineage_drivers(lineages="Ngn3 low EP", return_drivers=True). Could you please help me to check this?
In the above examples, since you only have 1 initial state Ngn3 low EP
, the absorption probabilities towards this state would be a vector of all 1s
. To calculate the potential drivers, we correlate the absorption probabilities and gene expression, but for a constant vector, it's undefined.
What we do instead for 1 initial/terminal state
is that we correlate stationary distribution
with gene expression. The compute_schur
should also compute it (you can try accessing g_bwd._gpcca.stationary_probability
), but as far as I know we don't set it anywhere else (g_bwd.compute_eigendecomposition()
sets it correctly, so for now, running this after g_bwd.compute_schur()
should make the example runnable). Will fix this so that .compute_schur()
also properly sets this in g_bwd.eigendecomposition['stationary_dist']
.
@michalk8, in case there's only one terminal/initial state with all-1 fate probs, it simply does not make sense to ask for driver genes. In that case, I would just raise an exception, rather than using the stationary distribution. It's unexpected behaviour.
A nice error message is more informative in this case I think.
Hello @michalk8 @Marius1311 Thanks for all the guidance! Now, I can reproduce the results of high-level API by using low-level API. Based on my practice, because CytoTRACE works better than scVelo in our case, we use low-level API more often than the high-level. The most beautiful thing is that, by using low-level API, we can also create our own kernel (Monocle pseudotime, DPT pseudotime) to draw figures. I think it will be fantastic if the new tutorial could introduce both the ways to set initial states and terminal states (automatically and manually) to all the audience. I appreciate all the help! Thank you again! Best, YJ
Hello @michalk8 @Marius1311 Sorry, I reopened this issue because I cannot draw the gene trends of initial states.
adata = cr.datasets.pancreas_preprocessed()
# compute initial states and lineage genes
vk = cr.tl.kernels.VelocityKernel(adata, backward=True).compute_transition_matrix()
ck = cr.tl.kernels.ConnectivityKernel(adata, backward=True).compute_transition_matrix()
k = (0.8 * vk + 0.2 * ck).compute_transition_matrix()
g_bwd = cr.tl.estimators.GPCCA(k)
g_bwd.compute_schur(n_components=20, method='brandts')
g_bwd.compute_eigendecomposition()
g_bwd.compute_macrostates(n_states=3, n_cells=30, cluster_key="clusters")
g_bwd.compute_terminal_states()
g_bwd.compute_absorption_probabilities(n_jobs=16)
adata.obs['initial_states']
g_bwd.compute_lineage_drivers(lineages="Ngn3 low EP", return_drivers=True)
model_initial = cr.ul.models.GAM(adata) # save for drawing lineage genes of initial states
# It's a right way to run cr.pl.gene_trends() and cr.pl.heatmap() here for lineage genes of initial states,
# but these 2 functions requires latent_time for time_key argument, which can only be recovered after computing terminal states.
# compute terminal states and lineage genes
vk = cr.tl.kernels.VelocityKernel(adata, backward=False).compute_transition_matrix()
ck = cr.tl.kernels.ConnectivityKernel(adata, backward=False).compute_transition_matrix()
k = (0.8 * vk + 0.2 * ck).compute_transition_matrix()
g_fwd = cr.tl.estimators.GPCCA(k)
g_fwd.compute_schur(n_components=20, method='brandts')
g_fwd.compute_macrostates(n_states=3, n_cells=30, cluster_key="clusters")
g_fwd.compute_terminal_states()
g_fwd.compute_absorption_probabilities(n_jobs=16)
adata.obs['terminal_states']
g_fwd.compute_lineage_drivers(lineages="Alpha", return_drivers=True)
model_terminal = cr.ul.models.GAM(adata) # save for drawing lineage genes of terminal states
# recover latent time with initial_states_probs and terminal_states_probs
scv.tl.recover_latent_time(adata, vkey='velocity', root_key="initial_states_probs", end_key="terminal_states_probs")
# after recovering latent time, I can run cr.pl.gene_trends() and cr.pl.heatmap(), but they only work for g_fwd-proceeded adata.
However, when I input model=model_initial
in cr.pl.gene_trends()
, it still uses the adata proceeded by g_fwd
. I know cr.pl.gene_trends()
and cr.pl.heatmap()
for initial states should work if I run them right after model_initial = cr.ul.models.GAM(adata)
, but these 2 functions needs the argument time_key="latent_time"
, which should be computed after computing both initial states and terminal states.
cr.pl.gene_trends(adata, model=model_initial, data_key="X", genes=["Pak3"], time_key="latent_time", same_plot=True, hide_cells=True, n_test_points=200)
cr.pl.gene_trends(adata, model=model_terminal, data_key="X", genes=["Pak3"], time_key="latent_time", same_plot=True, hide_cells=True, n_test_points=200)
cr.pl.heatmap(adata, model=model_initial, genes=adata.varm['initial_lineage_drivers']["Ngn3 low EP_corr"].sort_values(ascending=False).index[:50], lineages="Ngn3 low EP", time_key="latent_time", show_absorption_probabilities=True, show_all_genes=True, n_jobs=1, backend="loky")
KeyError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_21448/3821477692.py in <module>
1 # 14. visualize the lineage drivers in a flow heatmap
2 # use Alpha terminal lineage for example, we smooth gene expression for the putative Alpha-drivers in pseudotime, using cell-level weights the Alpha fate probabilities
----> 3 cr.pl.heatmap(adata, model=model_initial, genes=adata.varm['initial_lineage_drivers']["Ngn3 low EP_corr"].sort_values(ascending=False).index[:50], lineages="Ngn3 low EP",
4 time_key="latent_time", show_absorption_probabilities=True, show_all_genes=True, n_jobs=1, backend="loky") # too many n_jobs kills this step
~\anaconda3\envs\HYJ_py38\lib\site-packages\cellrank\ul\_utils.py in _genesymbols(wrapped, instance, args, kwargs)
284 use_raw=kwargs.get("use_raw", False),
285 ):
--> 286 return wrapped(*args, **kwargs)
~\anaconda3\envs\HYJ_py38\lib\site-packages\cellrank\pl\_heatmap.py in heatmap(adata, model, genes, lineages, backward, mode, time_key, time_range, callback, cluster_key, show_absorption_probabilities, cluster_genes, keep_gene_order, scale, n_convolve, show_all_genes, cbar, lineage_height, fontsize, xlabel, cmap, dendrogram, return_genes, return_models, n_jobs, backend, show_progress_bar, figsize, dpi, save, **kwargs)
520 lineages = _unique_order_preserving(lineages)
521
--> 522 _ = adata.obsm[lineage_key][lineages]
523
524 if cluster_key is not None:
~\anaconda3\envs\HYJ_py38\lib\site-packages\cellrank\tl\_lineage.py in __getitem__(self, item)
365 item = item[::-1]
366
--> 367 obj = self.__getitem(item)
368
369 return obj.T if was_transposed else obj
~\anaconda3\envs\HYJ_py38\lib\site-packages\cellrank\tl\_lineage.py in __getitem(self, item)
435 return self._mixer(slice(None, None, None), item)
436 elif any(map(lambda i: isinstance(i, str), item)):
--> 437 item = (slice(None, None, None), self._maybe_convert_names(item))
438 col = item[1]
439
~\anaconda3\envs\HYJ_py38\lib\site-packages\cellrank\tl\_lineage.py in _maybe_convert_names(self, names, is_singleton, default, make_unique)
576 name = default
577 else:
--> 578 raise KeyError(
579 f"Invalid lineage name `{name}`. "
580 f"Valid names are: `{list(self.names)}`."
KeyError: "Invalid lineage name `Ngn3 low EP`. Valid names are: `['Epsilon', 'Alpha', 'Beta']`."
Hello @Marius1311 @michalk8
I found that there is a backward=True
argument in cr.pl.gene_trends()
and cr.pl.heatmap()
for ploting inistial states.
However, for computing terminal state heatmap, the result is not consistent with the CellRank basics
tutorial.
adata = cr.datasets.pancreas_preprocessed()
# compute initial states and lineage genes
vk = cr.tl.kernels.VelocityKernel(adata, backward=True).compute_transition_matrix()
ck = cr.tl.kernels.ConnectivityKernel(adata, backward=True).compute_transition_matrix()
k = (0.8 * vk + 0.2 * ck).compute_transition_matrix()
g_bwd = cr.tl.estimators.GPCCA(k)
g_bwd.compute_schur(n_components=20, method='brandts')
g_bwd.compute_eigendecomposition()
g_bwd.compute_macrostates(n_states=3, n_cells=30, cluster_key="clusters")
g_bwd.compute_terminal_states()
g_bwd.compute_absorption_probabilities(n_jobs=16)
adata.obs['initial_states']
g_bwd.compute_lineage_drivers(lineages="Ngn3 low EP", return_drivers=True)
# compute terminal states and lineage genes
vk = cr.tl.kernels.VelocityKernel(adata, backward=False).compute_transition_matrix()
ck = cr.tl.kernels.ConnectivityKernel(adata, backward=False).compute_transition_matrix()
k = (0.8 * vk + 0.2 * ck).compute_transition_matrix()
g_fwd = cr.tl.estimators.GPCCA(k)
g_fwd.compute_schur(n_components=20, method='brandts')
g_fwd.compute_macrostates(n_states=3, n_cells=30, cluster_key="clusters")
g_fwd.compute_terminal_states()
g_fwd.compute_absorption_probabilities(n_jobs=16)
adata.obs['terminal_states']
g_fwd.compute_lineage_drivers(lineages="Alpha", return_drivers=True)
# recover latent time with initial_states_probs and terminal_states_probs
scv.tl.recover_latent_time(adata, vkey='velocity', root_key="initial_states_probs", end_key="terminal_states_probs")
model = cr.ul.models.GAM(adata)
cr.pl.gene_trends(adata, model, data_key="X", genes=["Pak3", "Neurog3", "Ghrl"], time_key="latent_time", same_plot=True, hide_cells=True, n_test_points=200, backward=True) # for initial state, it works.
cr.pl.gene_trends(adata, model, data_key="X", genes=["Pak3", "Neurog3", "Ghrl"], time_key="latent_time", same_plot=True, hide_cells=True, n_test_points=200, backward=False) # result is not consistent with the 'CellRank basics` tutorial
cr.pl.heatmap(adata, model, genes=adata.varm['terminal_lineage_drivers']["Alpha_corr"].sort_values(ascending=False).index[:100], lineages="Alpha", time_key="latent_time", show_absorption_probabilities=True, show_all_genes=False, n_jobs=1, backend="loky", backward=False) # result is not consistent with the 'CellRank basics` tutorial
Whereas, in the 'CellRank basics` tutorial, it looks like
Hi @hyjforesight ,
this can be because in the basic tutorial, we are using the raw, unprocessed data and in the Kernels and estimators
, we use the saved data.
Although they should be same, maybe we either changed some parameters in the notebook so his no longer applied or it's because of the updates to scvelo/scanpy
.
Alternatively, there might be a different default values between the high/low level functions, but as far as I checked, they are the same.
Hi @hyjforesight, thanks for making us aware of this. As we're planning to deprecate the high-level API in version 2.0 and re-work most of the tutorials, I'll close this issue, but add it to my list of issues that we need to address when re-doing the tutorials.
hello @michalk8 @Marius1311 Sorry, I've been working on some wet experiments. Thanks for the response. Looking forward to CellRank v2.0! Awesome!
Hi @michalk8 and @Marius1311
I found that the heat map inconsistency is caused by n_states=3
of g_bwd.compute_macrostates(n_states=3, n_cells=30, cluster_key="clusters")
, which means I expected that there will be 3 initial states. When I changed it to n_states=1
, problem was solved.
Hello CellRank, I drew the directed PAGA plots on the results of tutorial [CellRank basics] and tutorial [Kernels and estimators]. Although their absorption probabilities plots are the same, their directed PAGA results are inconsistent. For detailed codes. Please check my jupyter notebooks here (https://github.com/hyjforesight/CellRank)
The same absorption probabilities plots between 2 tutorials:
Codes for tutorial [CellRank basics]
Codes for tutorial [Kernels and estimators]. To draw PAGA for this tuturial, I have to run
g.compute_terminal_states()
andg._compute_initial_states()
firstWhat causes this inconsistency? Thanks! Best, YJ