velocyto-team / velocyto.R

RNA velocity estimation in R
http://velocyto.org
179 stars 223 forks source link

cell trajectory modelling - running out of memory #31

Open smaslova opened 6 years ago

smaslova commented 6 years ago

I've been trying to use the show.velocity.on.embedding.eu method on my scRNA-seq data, but it won't finish running. It throws an out of memory error on the "constructing path graph... " step even with even with 125G of memory available.

Do you any ideas for how I can get it to work? Are there any parameters I can change that might reduce the memory requirements significantly?

ekernf01 commented 6 years ago

I am having a similar problem. If my reading of the code is correct, the bottleneck (cc according to discussion of #15 ) is a huge dense correlation matrix. It is N by N, where N is the number of cells. I wonder if it would be possible to work around this. It seems to be used only once, during formation of a matrix tp.

tp <- exp(cc/corr.sigma) * emb.knn

Unless there is already a fix in the works for this, I'd like to experiment by replacing cc with something sparser or smaller. First, let me make sure I understand what it is doing and what speed-up strategies might make sense.

In the supplemental info page 8, the velocyto team writes "We calculated a transition probability matrix P by applying an exponential kernel on the Pearson correlation coefficient between the velocity vector and cell state difference vectors." P is tp in the code, and cc is the Pearson correlations. As for how these are used downstream, the arrow for each cell is a weighted average of unit vectors pointing towards other cells, and the corresponding column of P gives the weights. Is that right?

To make cc and tp sparser or lower-dimensional, one could consider the k_cc nearest neighbors for each cell, or a subset of randomly chosen cells, or a set of metacells or grid-points. I considered various ways of using nearest neighbors, but I am afraid the fast ones introduce too much bias and the slow ones are slow. Randomly chosen cells would be less biased, but it neglects available information (and it makes the plot stochastic). Metacells are awkward because they require a suitably selected partitioning of cells. Grid-points are awkward because those with fewer nearby cells will be noisier, causing a downward bias in any naive correlation estimate. Velocyto devs, do you have thoughts on what might be best here?

Figure 3 of the paper has ~18k cells; was any special adaptation made to get that working?

ekernf01 commented 6 years ago

UMAP supports projection of new points, so for those using UMAP, another strategy would be to take the "future" position of each cell as predicted by velocyto and project it onto the UMAP. But, this would be very awkward to implement in R. You'd need to feed the Python UMAP object to sklearn's transform, but some packages -- Seurat is what I am familiar with -- only store the embeddings.

I tried positioning each "future" cell on top of its nearest neighbor in the embedding, then drawing the arrow to that spot, but it yields pretty messy-looking estimates. Maybe some variant of this strategy would be more and reliable.