Heatmap expression cascade enriched for genes peaking in most terminal state of the lineage

reneemoerkens commented 1 year ago

... Hello and thanks for this great package and the very helpful tutorials!

I was generating Heatmap expression cascade plots to visualize temporal activation of genes along trajectories. I noticed that for quite many of my lineages, the plot has an enrichment for genes that peak in the terminal states of these lineages, instead of identifying genes that peak in the different stages. Like the plot here:

Accidentally, I plotted a driver gene list belonging to Lineage A on Lineage B in this plot and it actually looked more informative than the correct plot above.

This made me wonder, are my lineages assigned in an incorrect manner? In short, I computed macrostates (g.compute_macrostates), then set terminal states (g.set_terminal_states), I did not specifically set initial states, and then computed fate probabilities (g.compute_fate_probabilities). I didn't use data_key="magic_imputed_data" in plotting the Heatmap, as was done in the tutorial. And how can I verify which cells are in each Lineage?

Perhaps you encountered something like this before and have some feedback on how to improve it.

Thank you!

Marius1311 commented 1 year ago

Hi @reneemoerkens, this depends on the genes you selected to plot your heatmap. Please post the code snippet where you select genes and plot the actual heatmap.

As to your question of how to verify which cells are in each Lineage - you can simply look at the fate probabilities you computed, these determine the weight given to each cell during the lineage-specific expression trend plotting, e.g. in heatmaps. We don't threshold and explicitly assign cells to lineages, we rather assume gradual lineage commitment of cells to fates and use the fate-probabilities as cell-level weights when fitting GAMs to visualize expression trends. You can read more about this in the methods section for the CellRank 1 paper: https://www.nature.com/articles/s41592-021-01346-6

reneemoerkens commented 1 year ago

Hey @Marius1311, thank you for your answer. Here is the code snippet I used:

driver_genes = g.compute_lineage_drivers(
    lineages=["Enterocyte type 1"], cluster_key="annotation_res0.34_new"
)

model = cr.models.GAM(adata, n_knots=6)

cr.pl.heatmap(
    adata,
    model=model,  # use the model from before
    lineages="Enterocyte type 1",
    cluster_key="annotation_res0.34_new",
    show_fate_probabilities=True,
    genes=driver_genes.head(60).index,
    time_key="latent_time",
    figsize=(12, 10),
    show_all_genes=True,
    weight_threshold=(1e-3, 1e-3),
)

The genes I selected are the first 60 driver genes of the driver gene list (sorted on _corr column, highest first) found by the 'g.compute_lineage_drivers' function.

Thanks for referring to the paper for more information on the lineages, I will have a look. Because if I understand correctly the fate-probabilities are eventually used to select a subset of cells that are displayed in the Heatmap expression cascade plots, as depicted above. Right?

Marius1311 commented 1 year ago

Hi @reneemoerkens, I'll go trough this step by step

Thanks for posting the code snippet. If you're interested in genes peaking earlier (before the terminal cell population), try excluding this terminal cell cluster when computing driver genes. This can be done by explicitly passing the clusters to be used via clusters in g.compute_lineage_drivers, see https://cellrank.readthedocs.io/en/latest/api/_autosummary/estimators/cellrank.estimators.GPCCA.html#cellrank.estimators.GPCCA.compute_lineage_drivers
I think it would be good to verify that the gene fits you get actually look good. The noise model we use when fitting is Gaussian noise, i.e. we assume that you pass imputed data, as we show in the tutorials. If you pass non-imputed data, then your noise distribution might be very far from Gaussian, thus gene fits might actually look quite bad. In our gene trend plotting tutorial, we show how individual gene expression trends can be inspected: https://cellrank.readthedocs.io/en/latest/notebooks/tutorials/estimators/800_gene_trends.html. I would also encourage you to somehow impute your data before plotting gene expression trends (e.g. using MAGIC or simple k-NN imputation, e.g. scVelo moments), or using another tool like TradeSeq (https://www.bioconductor.org/packages/release/bioc/html/tradeSeq.html), which supports Negative Binominal noise distributions. This won't be directly compatible with CellRank as it stands though and it will be a bit difficult to get this to run based on CellRank fate probabilities.
We don't use fate probabilities to select discrete sets of cells, as I explained in my previous response, we use a soft weighing scheme which we believe is biologically more plausible given the gradual nature of cellular fate commitment. Please check the paper.

reneemoerkens commented 1 year ago

Hey @Marius1311,

Thank you again for your quick response.

Good suggestion, I will try this in my data!
Very good point, I indeed didn't use 'data_key="magic_imputed_data"', as was done in the tutorial. In my workflow, I did already run 'scv.pp.moments' function. Could I somehow use this as data_key input to generate the plot? I can also try to run MAGIC, but this requires Palantir pseudo time as well it seems (https://github.com/theislab/cellrank_reproducibility/blob/master/notebooks/fig_5_benchmarking/palantir/ML_2021-10-26_palantir.ipynb). Would there be a preference of using scVelo moments and latent_time (from dynamical modeling in scVelo), versus using Palantir pseudo-time and MAGIC imputation?
Thanks for the further explanation.

Marius1311 commented 1 year ago

Hi @reneemoerkens, Re your second point:

If you already used moments imputation, this should be stored as Ms in adata.layers (this corresponds to imputed spliced counts, Mu stores imputed unspliced counts). You should be able to use this data by passing data_key='Ms'. However, I encourage you to inspect and validate this imputation beforehand, e.g. by plotting a few genes in the imputed modality in a low dimensional representation and checking whether you see the expected patterns, or by looking at marker genes for your clusters in the imputed modality - basically, just make sure you're happy with your imputed data before you proceed! MAGIC does not require a Palantir pseudotime, the two are just run in the same notebook (that also would not make sense, the MAGIC tool was introduced much earlier than the Palantir tool!)

Re the choosing between scVelo latent time & moments vs. Palantir pseudotime & MAGIC-imputed data: you can mix and match these, you could use the scVelo latent time with MAGIC imputed data. Generally, I would recommend to just compare these, using whatever prior knowledge you have about your data, and then making some choice. They're all good methods and work well in general, however, they do of course have specific assumptions that can or can not be met in your data. I recommend checking e.g. the assumptions behind RNA velocity and carefully making sure that they are met in your data.

reneemoerkens commented 1 year ago

Thank you so much for your extensive explanation, I will apply your suggestions and inspect the imputation in my dataset. This will definitely help me further in the analysis.

Marius1311 commented 1 year ago

Great, thanks for the feedback, happy to help!

theislab / cellrank

Heatmap expression cascade enriched for genes peaking in most terminal state of the lineage #1127