theislab / scvelo

RNA Velocity generalized through dynamical modeling
https://scvelo.org
BSD 3-Clause "New" or "Revised" License
409 stars 102 forks source link

Compatilibility with STARsolo #525

Open YitengDang opened 3 years ago

YitengDang commented 3 years ago

Hi,

I've successfully used scvelo in the past on 10x data by using a similar pipeline as in the tutorial, i.e. by separately reading the (scanpy pre-processed) 10x data and the loom file generated by velocyto, and then combining them. However, this pipeline relies on the output of velocyto, which unfortunately has not been updated for ~3 years now. I'm currently unable to get velocyto to work, and as a result have been unable to use scvelo altogether.

Hence I wondered whether there are alternative methods to read the spliced/unspliced data, potentially not in the form of a loom file and not generated by another source than velocyto? I know two other tools to generate the splicing data, which are still actively maintained:

I have successfully generated the STARsolo output, but don't know how to combine it into the AnnData format to generate an object that is compatible with scvelo. It would be great if (1) there is a tool for using STARsolo output with scvelo, or (2) a tutorial explaining how to integrate different data types (aside from loom) for the splicing data into an AnnData object.

Thanks!

WeilerP commented 3 years ago

@YitengDang, theislab/scanpy/issues/1860 should help?

YitengDang commented 3 years ago

Yes, this is exactly what I was looking for, thanks! With the shared code under https://github.com/alexdobin/STAR/issues/774#issuecomment-850477636, together with some manual tweaking, I've been able to run scvelo on STARsolo aligned data. It would be great to integrate this into scvelo (or scanpy), but that's just an outsider suggestion :).

rbpatt2019 commented 3 years ago

Just starting to implement a pipeline using STARSolo output with scVelo and want to clarify on some of the points here. In the code snippet at alexdobin/STAR#774, the raw count matrix is put into adata.X while the spliced, unspliced, and ambiguous data are stored in adata.layers. The discussion at theislab/scanpy#1860 put the spliced at both adata.X and adata.layers and also adds unspliced at adata.layers, and (I think) doesn't use the raw count matrix. Am I correct in understanding that the best practice is thus to put spliced at adata.X, spliced and unspliced at adata.layers, and to not put the raw counts in adata for scVelo?

WeilerP commented 3 years ago

@rbpatt2019, that's correct. The current workflow works with the spliced counts in adata.X and expects the layers 'unspliced' and 'spliced'. You can, of course, store the raw count matrix in adata.layers as well if you need it. I quickly skimmed the scvelo code base and believe you could also store the raw data in adata.X without breaking anything - no guarantees on this, though, at this point. From what I saw, the only thing that would change is the neighbor graph (this will effect the imputation by moments), dimension reduction (i.e. PCA) and latent space embedding (e.g. UMAP) since these are calculated on adata.X. Did not investigate how this would look for/change the pancreas and dentate gyrus analysis/result.

rbpatt2019 commented 3 years ago

@WeilerP thanks for the quick response! I'm hoping to have some time for pipeline development this week. If I do, I'll poke around and see if having one or the other at adata.X breaks anything.

WeilerP commented 3 years ago

@rbpatt2019, did you manage to develop the pipeline to use STARSolo output in scVelo?