theislab / scvelo

RNA Velocity generalized through dynamical modeling
https://scvelo.org
BSD 3-Clause "New" or "Revised" License
409 stars 102 forks source link

scvelo invert direction issue #112

Closed mgood2 closed 3 years ago

mgood2 commented 4 years ago

After using scvelo on our data, I got the reverse direction trajectory compared to the other trajectory methods. Does it make sense to use -1 * (speed vector)?

VolkerBergen commented 4 years ago

That usually traces back to something went wrong in one or the other pipeline. While models are conceptually different, they should not result in fundamentally different directionality. What exactly did you compare? Dynamical vs. steady state? or scvelo vs. velocyto? Would be keen to get some figures + details here or via email (feedback@scvelo.org).

jenzopr commented 4 years ago

I have the same issue. My (biologist) collaborator notified me that directions are inverted, compared to what he expects from his expert knowledge..

What kind of details do you need? The adata object, details on the preprocessing, etc.?

VolkerBergen commented 4 years ago

@jenzopr If you get these result with all models

scv.tl.velocity(adata, mode='steady_state')
scv.tl.velocity(adata, mode='stochastic')
scv.tl.velocity(adata, mode='dynamical')

the directionality seems to be clearly supported by the splicing kinetics. Then you would want to find out how strong the splicing information is and whether the directionality is supported by marker genes; obtained by looking into phase portraits scv.pl.scatter(adata, basis=[list_of_genes]).

If you can share your file, that would speedup finding what triggers the directionality obtained in your data.

jenzopr commented 4 years ago

Nice, I will check all models and come back to you. Thanks @VolkerBergen

reginawongg commented 4 years ago

I'm also facing the same issue, where the directions are reversed based on what we expect from knowing the biology.

KaiyangZ commented 4 years ago

I also got the same issue, in some cases, I got the reverse direction in dynamic mode compared to stochastic mode:

scv.tl.velocity(adata, mode='stochastic')
scv.tl.velocity_graph(adata)
scv.tl.umap(adata)
scv.pl.velocity_embedding_stream(adata, basis='umap', color=['sample'])

H190_umap_dynamic_stoc

and

scv.tl.recover_dynamics(adata, max_iter=200, return_model=True)
scv.tl.velocity(adata, mode='dynamical')
scv.tl.velocity_graph(adata)
scv.pl.velocity_embedding_stream(adata, basis='umap', color=['sample'])

H190_umap_dynamic

VolkerBergen commented 4 years ago

Could you let me know the number of velocity_genes being used (which could be problem with a rather low number of cells), and show the phase portrait of the top likelihood genes

print(adata.var['velocity_genes'].sum(), adata.n_vars)
top_genes = adata.var_names[adata.var.fit_likelihood.argsort()[::-1]]
scv.pl.scatter(adata, basis=top_genes[:10], ncols=5)
KaiyangZ commented 4 years ago

Hi,

Thanks for your prompt reply! there are 417 velocity_genes and only 96 cells, is the number of cells too low for the analysis? print(adata.var['velocity_genes'].sum(), adata.n_vars, adata.n_obs) 417 3000 96 The phase portrait of the top 10 likelihood genes: H190_top10_scatter

jenzopr commented 4 years ago

@jenzopr If you get these result with all models

scv.tl.velocity(adata, mode='steady_state')
scv.tl.velocity(adata, mode='stochastic')
scv.tl.velocity(adata, mode='dynamical')

the directionality seems to be clearly supported by the splicing kinetics.

Hi Volker, all models result in the same directionality. When looking into phase portraits, I noticed that there are only 6 'velocity_genes'. Here are their phase portraits:

image

Is the number of velocity genes too low? What do you think?

VolkerBergen commented 4 years ago

@KaiyangZ Thanks for sharing. It's good to eventually see a use case where the dyn. model does not appear to be feasible. In most of your genes the dyn. model induces some state assignment (apparently mostly assigning to repression phases), which is not confidently assigned as it could equally well be in induction phase. That is due to the rather low cell numbers. I would have to incorporate that measure of 'state-confidence' (which is not directly reflected in the likelihoods) in the model or couple gene fits. For now, I must admit that you would have to apply the stochastic model which is insensitive to cell numbers and the mentioned drawbacks. I'll keep you up to date when we update the dyn. model to also account for some kind of confidence of overall state-assignments. [update for other readers: the steady-state ratios obtained from the stoch. and dyn. model should not deviate from each other as much and as systematically as in this case. That can indeed happen, if cell numbers are very low. If that is the case, please have a highly critical view on the resulting velocities and be cautious with interpretations.]

@jenzopr Only having 6 genes left is definitely too low. I'll add some exception to throw an error in that case. Gene are selected by an R-squared (or likelihood) threshold. You can adjust the threshold, e.g. scv.tl.velocity(adata, mode='steady_state', min_r2=1e-3). But I'd definitely have a closer look why you get such low R-squared values. After all, the splicing kinetics doesn't seem to be very informative in your case. Do you expect any dynamics at all?

jenzopr commented 4 years ago

@VolkerBergen thanks for your input. A warning message might be a good idea.

Yes we expect a lot of dynamics, but they might be hidden since we sampled cells from multiple timepoints. How would scvelo deal with cell populations that expand in the beginning (e.g. inflammatory processes), but return to normal at later stages? Splicing kinetics would be pointing in the opposite direction or form a circle.. Along this line, I tried to use the groups and groupby arguments to estimate kinetics for subsets of cells only. Is this intended usage?

KaiyangZ commented 4 years ago

Thanks a lot @VolkerBergen, I will use stoch. model, but how about the latent time then? if I understand correctly, it is estimated from the dyn. model, and what is the difference between velocity_pseudotime and latent time? Some explanations are really appreciate!

VolkerBergen commented 4 years ago

@KaiyangZ latent time is only available through the dyn. model. I'll provide explanations on latent time vs velocity pseudotime etc. in the docs within the next few days.

@jenzopr yes, groups and groupby arguments are intended to be used to estimate kinetics within subset of cells, e.g. per batch. I'll get back to you for the other question.

jonathan-f commented 4 years ago

Hi @VolkerBergen , I'm also facing an inversion of the direction of the velocities in the different models: I ran steady state, stochastic and dynamical model (with default parameters) on the same data and there seems to be a general agreement between the steady state and the dynamical, while the stochastic has the velocities inverted for many cells.

Screenshot 2019-11-28 at 18 09 18

I checked the velocity genes and they are exactly the same for steady state and stochastic (~400 genes) and they are a subset of the velocity genes that I get with the dynamical model.

I reasoned that the problem could reside in the steady state ratio estimate and, actually, looking at the scatter plots of the top 10 dynamics-driving genes (obtained from the dynamical model) in the 3 models (I'm showing one example below), the stochastic model gives a different estimate of the steady state ratio.

Screenshot 2019-11-28 at 18 15 23

Is there a way to assess the reliability of the steady state ratio computation (e.g., checking the computation of second order moments for the stochastic model)? Maybe the number of neighbours used in the computation of the moments is critical in this case in which the number of cells is pretty small (~100)?

Sorry for the long message and thanks!

Hsu-Che-Wei commented 4 years ago

Hello @VolkerBergen, thanks for the cool tool. I also found the RNA velocity in my model system is perfectly inverted in every model (steady, stochastic, and dynamic), and I am sure that the number of cells are high enough, each sample I have has at least 2K cells. In combined, I have over 40K. Below, I show you the u vs s plots of some genes, the blue cells are actually "stem cells/meristematic cells", the yellow and orange ones are "differentiated" cells. Do you have any explanation for why the latent time has to start from the bottom left corner? Why can't it be started at the up right corner? Cause in my case, it seems that the development is driven by degradation of spliced transcripts or repression. image

Also, do you know how to do the analysis in the case that the user have already the normalized counts value? You just skip the scv.pp.filter_and_normalize() step?

Thank you very much!

VolkerBergen commented 4 years ago

Thanks sharing these!

Latent time is rooted by the Markov diffusion process that is obtained from the velocity vector field; which for many genes is indeed in the origin; but it is not assumed to always start in there. In your cases it'd be rooted in the yellow clusters because it assigns the yellow-to-blue as induction phase. Here, it's hard for the dynamical model to pick up the right state assignment as there's hardly any curvature it can learn from. Maybe you could try decrease the number of neighbors in scv.pp.moments(adata, n_neighbors=10).

Overall it looks exactly like what you've already pointed out. It's mostly governed by degradation. Check out Suppl. Fig. 8b right phase portrait in the manuscript (https://doi.org/10.1101/820936); that nicely resembles the kinetics in your data. I'll see if I can quickly add some user-defined prior to account for predominant degradation.

VolkerBergen commented 4 years ago

scv.pp.filter_and_normalize does not touch your adata.X if it is preprocessed already. It then only goes after spliced and unspliced counts. If you set scv.settings.verbosity=3 (default) you should get some warning messages that it recognizes .X as being already preprocessed.

VolkerBergen commented 4 years ago

@jonathan-f Pardon, I've completely forgotten you. From only 100 cells is pretty hard to pick up any signal. However, if others genes display similar patterns like that one, splicing kinetics seems to support directionality from blue-to-orange.

Could you check how the stoch. model behaves as you decrease the number of neighbors in scv.pp.moments(adata, n_neighbors=10). I think 30 neighbors on 100 cells for second-oder moments is way too much. That default value was selected when testing against datasets of 3k - 35k cells.

jranek commented 4 years ago

Hi @VolkerBergen , thank you for a great set of tools! I'd love to join in on the conversation, for I'm facing a similar issue.

We differentiated stem cells into definitive endoderm, collecting three time points, (blue = d0, orange = d1, green = d2). Contrary to what we'd expect, the vector field is showing d2 cells primarily moving towards d1 and then d0. In addition, the vector field generated through the dynamical model is quite different from the steady-state approach.

The dimensionality after preprocessing was 4555 cells x 2000 genes.

**edit I've decided to remove my previous plots to give a different visual of the same issue. Below are two plots of the top 10 likelihood genes, one colored according to sample and the other colored according to latent time. You'll notice that the d2 cells (green) have a latent time of ~0, whereas the d0 cells (blue) have a latent time of ~1.

image

image

MagpiePKU commented 4 years ago

Hi,

I would like to raise a similar issue in which the directionalities of velocity_stream and PAGA were inversed:

image image

A very strange one.

grimwoo commented 3 years ago

@VolkerBergen Hi, I am new to python and scVelo. And I have the same issue: all three modes in scv.tl.velocity(adata, mode=...) gave same "wrong" direction (from differentiated to stem cell). What else could I do? Could scVelo simply "reverse/invert" the direction of arrows? Look forword to your reply. Thanks!

zhanglab2008 commented 3 years ago

Hi team, thanks for developing this awesome tool. But I also experienced the same inverted direction issue. From my study, I am pretty sure what the correct direction is because I collected time points data from healthy to disease. I tried all the three models steady, stochastic and dynamics but the directions are consistently wrong. I really hope you can fix the issue. Thanks!

Ivalyne commented 3 years ago

Thanks for the great toolkit.

As @VolkerBergen mentioned last year, we could get the top likelihood genes by using the code as below.

print(adata.var['velocity_genes'].sum(), adata.n_vars)
top_genes = adata.var_names[adata.var.fit_likelihood.argsort()[::-1]]
scv.pl.scatter(adata, basis=top_genes[:10], ncols=5)

However, in the latest version, there has some changes about this. Nowadays, when we want to get the top likelihood genes, you can get the genes by using this code.

top_genes = adata.var['fit_likelihood'].sort_values(ascending=False).index

As you know, this code is written in the tutorials.

This comment is just for somebody who read this issue and stack at the step of retrieving the top genes, like me.

yunkaizhang commented 3 years ago

Hi,

I would like to raise a similar issue in which the directionalities of velocity_stream and PAGA were inversed:

image image

A very strange one.

Hi there, Wondering if you have fix this problem? Can you also look into the latent_time or velocity_pseudotime for your dataset? Wondering these may be similar to PAGA (and inverted to velocity_stream) since PAGA are based on these.

WeilerP commented 3 years ago

@jranek, your phase portraits do not include any curvature s.t. the model cannot be fitted confidently. See e.g. #462 or #216.

WeilerP commented 3 years ago

@MagpiePKU, @yunkaizhang, @WT215 #456 was recently opened. Let's discuss the issue there since it is not related to the original issuer here.

paulitikka commented 2 years ago

Hi,

I have solved this. Use the Scvelo calculated UMAP and PCA values, and not the ones from coming from Seurat. Like: ...

...after the import of data and the libraries:

scv.pp.filter_and_normalize(adata, min_shared_counts=20, n_top_genes=2000) scv.pp.log1p(adata)

this below line is important!! no old umaps and pcas

sc.pl.umap(adata, frameon=False, legend_loc='on data', title='', save='a10222_celltypes_ok.pdf') sc.tl.pca(adata) sc.pp.neighbors(adata, n_pcs=30, n_neighbors=30) scv.pp.moments(adata, n_pcs=None, n_neighbors=None)

And then...

scv.tl.recover_dynamics(adata, n_jobs=8) #separately... (!!) takes around 4 m: 15:15-18 scv.tl.velocity(adata, mode="dynamical") #separately as a oneliner, not in the 'cell'... scv.tl.velocity_graph(adata) #separately... and the note in the next line the basis as 'umap' and not 'X_umap' scv.pl.velocity_embedding_stream(adata, basis="umap", legend_fontsize=12, title="", smooth=0.8, min_mass=4, save='dyn_rankaas_tikka10222_vN.png')

B.r., Pauli :)

stefanpeidli commented 2 years ago

This preprint from 2 days ago by Pachterlab mentions the issue of inverted velocities: https://www.biorxiv.org/content/10.1101/2022.02.12.480214v1

mayurdoke6 commented 1 year ago

Hi,

I have solved this. Use the Scvelo calculated UMAP and PCA values, and not the ones from coming from Seurat. Like: ...

...after the import of data and the libraries: scv.pp.filter_and_normalize(adata, min_shared_counts=20, n_top_genes=2000) scv.pp.log1p(adata) #this below line is important!! no old umaps and pcas sc.pl.umap(adata, frameon=False, legend_loc='on data', title='', save='a10222_celltypes_ok.pdf') sc.tl.pca(adata) sc.pp.neighbors(adata, n_pcs=30, n_neighbors=30) scv.pp.moments(adata, n_pcs=None, n_neighbors=None)

And then... scv.tl.recover_dynamics(adata, n_jobs=8) #separately... (!!) takes around 4 m: 15:15-18 scv.tl.velocity(adata, mode="dynamical") #separately as a oneliner, not in the 'cell'... scv.tl.velocity_graph(adata) #separately... and the note in the next line the basis as 'umap' and not 'X_umap' scv.pl.velocity_embedding_stream(adata, basis="umap", legend_fontsize=12, title="", smooth=0.8, min_mass=4, save='dyn_rankaas_tikka10222_vN.png')

B.r., Pauli :)

Hello, @paulitikka Thank you for your input on this issue. I have tried to use your code in my analysis. It did not help me rectify my issue with the velocity embedding graph generated with dynamical scvelo mode analysis. It shows the opposite of the stochastic mode of scvelo. I find stochastic mode is better at analyzing RNA velocity than the dynamical, especially in my data. I may be wrong but I am not sure. I am a little bit perplexed to explain this issue to reviewers for publication. please let me know your view on this.

Regards,

MAD.

Matthew1309 commented 1 year ago

I just found a paper where the authors ran into the issue of inverted pseudotime directionality, identified culprit genes, and removed them in order to resolve the inversion. Hopefully this is relevant/helpful. Paper: Coordinated changes in gene expression kinetics underlie both mouse and human erythroid maturation.

WeilerP commented 1 year ago

I just found a paper where the authors ran into the issue of inverted pseudotime directionality, identified culprit genes, and removed them in order to resolve the inversion. Hopefully this is relevant/helpful. Paper: Coordinated changes in gene expression kinetics underlie both mouse and human erythroid maturation.

@Matthew1309, the issue is a conceptual one. As discussed in the paper, the genes exhibiting a transcriptional burst are highly relevant for the overall process. Thus, you cannot simply ignore them. The goal is not to make arrows look good on a UMAP but recover correct Biology, IMO. See also here.