scverse / scanpy

Single-cell analysis in Python. Scales to >1M cells.
https://scanpy.readthedocs.io
BSD 3-Clause "New" or "Revised" License
1.82k stars 586 forks source link

STARsolo Matrix with Velocyto --> Anndata function #1860

Open JBreunig opened 3 years ago

JBreunig commented 3 years ago

The scanpy.read_10X_mtx works well for reading in the STARsolo output matrices, which are based on the CellRanger Outputs.

However, it would be nice to have a function or modification of the read_10X_mtx function (e.g. a boolean for STARsolo velocyto) to automate inputting the velocyto matrices that STARsolo outputs and placing them in the appropriate layers. A boolean switch for filtered versus raw matrices would be a good addition as well.

JBreunig commented 3 years ago

Example code: https://github.com/alexdobin/STAR/issues/774#issuecomment-850477636

ivirshup commented 3 years ago

Thanks for the suggestion.

@WeilerP, do you think this would be more appropriate in scvelo? (Side note, I have thought that tutorial of going from BAMs through scvelo would be quite useful)

WeilerP commented 3 years ago

@WeilerP, do you think this would be more appropriate in scvelo? (Side note, I have thought that tutorial of going from BAMs through scvelo would be quite useful)

Hm, not sure if the functionality would match the expectation. In scvelo, we'd store only unspliced and spliced counts (spliced both in adata.X and adata.layers). Based on the proposed code snippet in alexdobin/STAR#774 (comment), the expected output would be to read all of the available information?

ivirshup commented 3 years ago

It doesn't look to me like it's reading in much other than those. Just the obs and var that you'd get from any cell ranger experiment and an ambiguous layer.

What additional things are you concerned about?

WeilerP commented 3 years ago

Yes, agreed, ambiguous isn't really a problem (could be read in optionally via a keyword argument). I was more concerned about adata.X being the original count matrix - unless I am misunderstanding the code snippet ...

ivirshup commented 3 years ago

I was more concerned about adata.X

Ah, yeah I see what you mean. But isn't that fine for scvelo? I guess it wasn't obvious to me from the documentation that X wasn't supposed to be "total" counts.

Also it looks like the output formats may not be documented or finalized. It might be worth reaching out to the devs to see what's up before anything is implemented.

JBreunig commented 3 years ago

Just from an end user perspective, I hadn't recognized spliced is put into adata.X and it seems that the other STARsolo users in the original thread didn't either. For my edification, would it cause any problems for me to modify the code to put spiced into adata.X or am I missing something else?

WeilerP commented 3 years ago

@ivirshup, @JBreunig using the absolute counts isn't a problem per se. It's simply that the scvelo paper used the spliced counts in adata.X based on which the highly variable genes are selected and PCA, neighbor graph and UMAP embedding are calculated. @JBreunig, shouldn't be a problem to put spliced into adata.X as the dimensions of spliced and total counts are the same.