satijalab / seurat-wrappers

Community-provided extensions to Seurat
GNU General Public License v3.0
300 stars 129 forks source link

RunVelocity() uses "data" rather than "counts" matrix #27

Open krejciadam opened 4 years ago

krejciadam commented 4 years ago

Hi! I've been playing with this Seurat wrapper for a while and I found something that confuses me a little.

velocyto.R::gene.relative.velocity.estimates() expects the input matrices to be raw count matrices (Well, at least I think so. But It really looks like that, both from the docs and from the fact that it starts with adding pseudocount and log transforming the matrices).

Now RunVelocity() uses GetAssayData() to pull data matrices from our "spliced" and "unspliced" assays. GetAssayData() returns the "data" matrix (i.e. cells@assays$spliced@data), which is identical to the "counts" matrix (i.e. cells@assays$spliced@counts), but only as long as the user did not run any normalization procedure.

If the user runs some normalization, then RunVelocity() will use the normalized matrix as an input for velocity calculation, which I think is incorrect. Things can get even wilder in a pretty imaginable scenario where the user normalizes and scales the "spliced" assay first, say to cluster the cells, but does not touch the "unspliced" assay, and the runs RunVelocity(). In that case, RunVelocity() will feed velocyto.R::gene.relative.velocity.estimates() with normalized "spliced" matrix and raw counts "unspliced" matrix.

This is not so much of a problem when users use Seurat::SCTransform(), like in the Vignette. SCTransform creates a new assay and then the untouched old assays are fed into RunVelocity(). But if one uses some different normalization procedure, the "data" matrix gets changed, which can lead to weird results, even tough the untouched "counts" matrix is still there and could have been used.

This got me confused for a while so I was thinking that using the "counts" matrix or writing a note into the documentation could prevent such confusion. Or is it actually meaningful to use the normalized matrix instead of raw counts? Thank you.

YichaoOU commented 3 years ago

New to velocity analysis. Thanks for this note! I'm wondering, in terms of the correctness, should we use the raw counts or some normalized values for RunVelocity()?

Thanks, Yichao

ljouneau commented 3 years ago

Hi ! @krejciadam, if you execute line by line the code contained in RunVelocity, you will see that before calling velocyto.R::gene.relative.velocity.estimates, the function creates a list containing the data (spliced-emat/unspliced-nmat) and the argument to pass to velocyto (deltaT, ...)

names(args) [1] "emat" "nmat" [3] "deltaT" "steady.state.cells" [5] "kCells" "kGenes" [7] "old.fit" "mult" [9] "min.nmat.smat.correlation" "min.nmat.emat.correlation" [11] "min.nmat.emat.slope" "zero.offset" [13] "deltaT2" "fit.quantile" [15] "diagonal.quantiles" "show.gene" [17] "do.par" "cell.dist" [19] "emat.size" "nmat.size" [21] "cell.emb" "cell.colors" [23] "expression.gradient" "residual.gradient" [25] "n.cores" "verbose"

If you look at the distribution of values of emat and nmat, you will see that they are identical to the distribution of values you have in your loom files

summary(args[["emat"]][,1]) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.000 0.000 0.000 3.929 0.000 529.000 summary(args[["nmat"]][,1]) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.000 0.000 0.000 1.268 0.000 265.000 summary(loom$exon[,1]) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.000 0.000 0.000 2.509 0.000 529.000 summary(loom$intron[,1]) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.0000 0.0000 0.0000 0.8518 0.0000 265.0000

Therefore, it seems pretty clear for me that RunVelocity provides raw counts to velocyto.R

Best regards