theislab / scvelo

RNA Velocity generalized through dynamical modeling
https://scvelo.org
BSD 3-Clause "New" or "Revised" License
415 stars 102 forks source link

Understanding the phase portrait of genes better #210

Closed amunzur closed 3 years ago

amunzur commented 4 years ago

Hi Volker,

Thank you for a great package!

My question is regarding the phase portraits of genes, I attached a plot below. I often get plots like this where the spliced counts are zero for the gene, hence the vertical line in the plot. Sometimes the same thing happens for the unspliced counts as well. How should I interpret this? Is this just an artifact of my data? Also, this plot looks odd in general with all the lines coming out of it.

What do you think?

Thank you for your help!

image

VolkerBergen commented 4 years ago

This along with the tutorials might help to better understand how to interpret gene dynamics via phase portraits. Let me know, if you have any specific questions.

The phase portraits you've sent is not very informative b/c of low count numbers. Genes that have no/low count number in either spliced or unspliced mRNA don't contribute to the projected velocities - these are generally filtered out. That appears in genes whose splicing kinetics cannot interpreted with the available data - e.g. b/c it is not observed sufficiently along the developmental timecourse.

The 'stripes' are an artifact of imputation/moments when count numbers are very low. You can display the raw counts by setting use_raw=True in pl.scatter.

mimi3421 commented 4 years ago

@VolkerBergen Dear author, I have a similar question about this. As dropout effect is common in single cell data, I think there may be some information lost simply ignoring the zeros. While method like MNN in expression data infer the real data from neighbor non-zeros, can these zeros in un/s data be predicted by their neighbors in expression or un/s space and then calculate velocity basing on the imputated data? Thanks,

amunzur commented 4 years ago

I understand how genes with no spliced or no unspliced counts wouldn't contribute to projected velocities since we need both. However, the gene that I sent you was one of the top ranked velocity genes. I ran scv.tl.rank_velocity_genes first and then visualized the results. Why do you think this gene would appear as a top ranked gene even though it has such low counts? Maybe do you think applying more strict quality control measures would help filter out these genes?

VolkerBergen commented 4 years ago

@mimi3421 It internally does exactly what you just described. Imputing based on knn and then computing velocity on imputed data. This corresponds to the first-order moments in pp.moments.

@amunzur: scv.tl.rank_velocity_genes runs a differential test on velocity expression, thus it can also yield such low-count genes (in your example the blue cluster would be differentially expressed in velocity compared to the rest). Another approach is running scv.tl.rank_dynamical_genes, which order genes by their likelihood (obtained from the dynamical model) in particular clusters/regimes. On another note, have you filtered genes upfront with e.g. scv.pp.filter_and_normalize(adata, min_shared_counts=20, n_top_genes=2000)?

amunzur commented 4 years ago

So then I guess scv.tl.rank_velocity_genes only finds genes with different dynamics, regardless of whether they contribute to overall velocity or not. However the following code chunk should give the genes that contribute most to the velocity fields, right?

top_genes = adata.var['fit_likelihood'].sort_values(ascending=False).index
scv.pl.scatter(adata, basis=top_genes[:15], ncols=5, frameon=False)

I still get some genes with no spliced or unspliced counts when I run the chunk above, or use scv.tl.rank_dynamical_genes, is that normal? Also, instead of using scv.pp.filter_and_normalize, I was doing all my filtering and normalization using Scanpy, maybe this also causes some confusion.

WeilerP commented 3 years ago

So then I guess scv.tl.rank_velocity_genes only finds genes with different dynamics, regardless of whether they contribute to overall velocity or not. However the following code chunk should give the genes that contribute most to the velocity fields, right?

top_genes = adata.var['fit_likelihood'].sort_values(ascending=False).index
scv.pl.scatter(adata, basis=top_genes[:15], ncols=5, frameon=False)

Yes, that is correct.

I still get some genes with no spliced or unspliced counts when I run the chunk above, or use scv.tl.rank_dynamical_genes, is that normal?

@amunzur, you should filter genes prior to velocity analysis, i.e. use

scv.pp.filter_and_normalize(adata, min_shared_counts=20, n_top_genes=2000)

Doing so, top likelihood genes should always include unspliced/spliced counts.

Also, instead of using scv.pp.filter_and_normalize, I was doing all my filtering and normalization using Scanpy, maybe this also causes some confusion.

scv.pp.filter_and_normalize combines the typical preprocessing steps in a single function. In general, this matches a concatenation of functions from scanpy up to some nuances.