theislab / scvelo

RNA Velocity generalized through dynamical modeling
https://scvelo.org
BSD 3-Clause "New" or "Revised" License
411 stars 102 forks source link

pseudotime explanation #89

Closed ccshao closed 4 years ago

ccshao commented 5 years ago

Thanks for the very helpful package.

I am confused at the part of the pseudotime analysis in the tutorial of DG. What do the root cells and end cells mean exactly? And how to to deal with branches in complex datatsets? Some explanations are really appreciate!

VolkerBergen commented 5 years ago

The velocity graph determines a Markov diffusion process with transition probabilities. Terminal states (root and end points) are identified as source and sinks of the diffusion.

Velocity pseudotime is computed in analogy to DPT, with the distinction that transition probabilities are not obtained solely from similarities (neighbor graph), but from velocity-inferred directionality (velocity graph).

ccshao commented 5 years ago

Thanks very much for the quick reply! We are quite interested in the pseudotime and branches analysis from RNA velocity and would like to understand the results better.

root_cells, end points vlaue

In the tutorial, the pseudotime is obtained via

lineage = ['nIPC', 'Neuroblast', 'Granule immature', 'Granule mature', 'Cck-Tox', 'Mossy']
adata.obs['lineages'] = ['granule_lineage' if c in lineage else c for c in adata.obs.clusters]
scv.tl.terminal_states(adata, groupby='lineages')
scv.tl.velocity_pseudotime(adata, groupby='lineages')

which provides the running information

computing terminal states identified 1 root cells and 1 end points (Astrocytes) identified 1 root cells and 1 end points (Cajal Retzius) identified 1 root cells and 1 end points (Endothelial) identified 1 root cells and 1 end points (GABA) identified 1 root cells and 1 end points (Microglia) identified 1 root cells and 1 end points (OL) identified 1 root cells and 1 end points (OPC) identified 1 root cells and 1 end points (granule_lineage) finished (0:00:00) --> added 'root_cells', root cells of Markov diffusion process (adata.obs) 'end_points', end points of Markov diffusion process (adata.obs)

later we read the adata.obs to R


       lineages root_cells   end_points velocity_pseudotime
 1:      Astrocytes          1 1.432102e-03         0.000000000
 2:      Astrocytes          1 0.000000e+00         0.008993809
 3:      Astrocytes          1 7.233411e-03         0.019765943
 4:      Astrocytes          1 2.981058e-03         0.008488726
 5:   Cajal Retzius          1 7.247612e-02         0.560040236
 6:     Endothelial          1 6.794614e-03         0.149551302
 7:     Endothelial          1 6.794614e-03         0.149551287
 8:     Endothelial          1 0.000000e+00         0.000000000
 9:     Endothelial          1 6.794614e-03         0.149551108
10:            GABA          1 2.184830e-01         0.870551229
11:            GABA          1 3.368101e-01         0.850218058
12:       Microglia          1 0.000000e+00         0.167048752
13:       Microglia          1 2.200418e-02         0.247164428
14:              OL          1 0.000000e+00         0.000000000
15:             OPC          1 3.419406e-02         0.300515056
16:             OPC          1 0.000000e+00         0.000000000
17: granule_lineage          1 1.255349e-02         0.957588732
18: granule_lineage          1 1.091659e-08         0.068479121
19: granule_lineage          1 3.387960e-03         0.986178398

We are wondering, do the column of root_cells and end_points mean probility that they are true start point cells and end ponits cells, respectively?

Althought, it outputs

identified 1 root cells and 1 end points (Astrocytes)

Actually there are 4 cells with 1 in the column of root_cells, but only one cell with 0 in the end_points (cell number 2 in the above output), so cell number 2 is considered as the true root cell? Similar rules apply to end points?

And interestingly, the root cell does not have a 0 velocity_pseudotime, but the cell number 1 has. It is ture for the granule_lineage, a cell with root_cell == 1 and end_points == 6.679865e-10 has the 0 pseudotime.

branches from RNA velocity data

We got errors when playing with the n_branching parameter, (n_branchings=2 also failed) scv.tl.velocity_pseudotime(adata, groupby='lineages', n_branchings=1)

WARNING: detected group with only [0] cells WARNING: detected group with only [] cells


ValueError Traceback (most recent call last)

in ----> 1 scv.tl.velocity_pseudotime(adata, groupby='lineages', n_branchings=1) ~/miniconda3/lib/python3.7/site-packages/scvelo/tools/velocity_pseudotime.py in velocity_pseudotime(adata, vkey, groupby, groups, root, end, n_dcs, n_branchings, min_group_size, allow_kendall_tau_shift, use_velocity_field, save_diffmap, return_model) 147 vpt.pseudotime[np.isnan(dpt_root) & np.isnan(dpt_end)] = np.nan 148 --> 149 if n_branchings > 0: vpt.branchings_segments() 150 else: vpt.indices = vpt.pseudotime.argsort() 151 ~/miniconda3/lib/python3.7/site-packages/scanpy/tools/_dpt.py in branchings_segments(self) 187 for each segment. 188 """ --> 189 self.detect_branchings() 190 self.postprocess_segments() 191 self.set_segs_names() ~/miniconda3/lib/python3.7/site-packages/scanpy/tools/_dpt.py in detect_branchings(self) 262 segs_connects, 263 segs_undecided, --> 264 segs_adjacency, iseg, tips3) 265 # store as class members 266 self.segs = segs ~/miniconda3/lib/python3.7/site-packages/scanpy/tools/_dpt.py in detect_branching(self, segs, segs_tips, segs_connects, segs_undecided, segs_adjacency, iseg, tips3) 476 # branching on the segment, return the list ssegs of segments that 477 # are defined by splitting this segment --> 478 result = self._detect_branching(Dseg, tips3, seg) 479 ssegs, ssegs_tips, ssegs_adjacency, ssegs_connects, trunk = result 480 # map back to global indices ~/miniconda3/lib/python3.7/site-packages/scanpy/tools/_dpt.py in _detect_branching(self, Dseg, tips, seg_reference) 646 if len(np.flatnonzero(newseg)) <= 1: 647 logg.warning(f'detected group with only {np.flatnonzero(newseg)} cells') --> 648 secondtip = newseg[np.argmax(Dseg[tips[inewseg]][newseg])] 649 ssegs_tips.append([tips[inewseg], secondtip]) 650 undecided_cells = np.arange(Dseg.shape[0], dtype=int)[nonunique] <__array_function__ internals> in argmax(*args, **kwargs) ~/miniconda3/lib/python3.7/site-packages/numpy/core/fromnumeric.py in argmax(a, axis, out) 1151 1152 """ -> 1153 return _wrapfunc(a, 'argmax', axis=axis, out=out) 1154 1155 ~/miniconda3/lib/python3.7/site-packages/numpy/core/fromnumeric.py in _wrapfunc(obj, method, *args, **kwds) 59 60 try: ---> 61 return bound(*args, **kwds) 62 except TypeError: 63 # A TypeError occurs if the object does have such a method in its ValueError: attempt to get argmax of an empty sequence

What is the proper way to use the n_branching?

From the page of paga, it is said

We recommend, however, to only use dpt() for computing pseudotime (n_branchings=0) and to detect branchings via paga().

How to apply paga to the rna velocity data to infer branch? We do not have much experience with paga, and a minimal example is really appreciated for us to get started.

VolkerBergen commented 4 years ago

hi @ccshao apologies for taking eternity to come back to your issue; totally forgot about it.. For clarity, I will soon add detailed explanations in the API.

Going straight into your Qs: The terminal states consist of regions / set of cells (the 'identified 1 root cells' is indeed a set of root cells in a particular region), which corresponds to stationary states of the Markov chain. As you mentioned, you can be interpret a value as probability of being one of the root/end cells. The n_branchings is indeed a bit misleading, this should always be 0 because the way it computes diffusion pseudotime from the velocity graph already accounts for multiple potential end points. This parameter will be removed in the future.

you can compute directed paga with

scv.tl.velocity_graph(adata)

scv.tl.paga(adata)
scv.tl.paga(adata, use_rna_velocity=True)

scv.pl.paga(adata, transitions='transitions_confidence')

An improved implementation of directed paga for better robustness is work in progress.

KaiyangZ commented 4 years ago

Hi,

My question is also related to pseudotime, what's the difference between the velocity_pseudotime and the latent time? and could you provide some more information about the parameter min_corr in function scv.tl.recover_latent_time?

Thank you very much in advance!

VolkerBergen commented 4 years ago

Just updated the API-docs with more detailed explanations; hope that helps.

https://scvelo.readthedocs.io/scvelo.tl.velocity_pseudotime.html Velocity-pseudotime is similar to diffusion pseudotime. However, instead of having a prior user-defined root cell, it infers a distributions over root cells from the velocity-inferred transition matrix and then computes distances from diffusion random walks on the velocity graph (instead of similarity based diffusion kernel as in DPT).

https://scvelo.readthedocs.io/scvelo.tl.latent_time.html Latent time is based only on its transcriptional dynamics and obtained from coupling gene-specific timepoints from the dynamical model to a universal gene-shared latent time. Herein, min_corr (not used per default) is used to constraint the set of genes to those that correlate with velocity_pseudotime, just to remove the very noisy genes.

ccshao commented 4 years ago

Thanks a lot for all the updates @VolkerBergen! I have been playing with the scanpy the last few weeks and is impressed by the great results and versatile settings. We will try to cooperate and interpret the rna veloctiy analysis in our results.

wangjiawen2013 commented 3 years ago

Are the velocity_pseudotime(https://scvelo.readthedocs.io/scvelo.tl.velocity_pseudotime.html) and pseudotime (https://scvelo.readthedocs.io/scvelo.tl.latent_time.html) the same thing ? which one should I use to depict the development trajectory ?