Discrepancy between exonic/intronic reads statistic and spliced/unspliced transcripts.

GrigoriiNos commented 3 years ago

Dear scVelo developers,

First off, thanks a lot for your amazing work and this cool software!

I encountered an issue while working with scVelo. I'm trying to do velocity analysis for CD4 t cells. In all my samples, according to web_summaries from cell ranger, an average percent of exonic reads is around 70% and intronic is 8%, which would suggest that I should have way more spliced transcripts than unspliced,

Screenshot 2021-08-09 at 14 00 10

whereas when running scv.pl.proportions() function, I get these statistics. Screenshot 2021-08-09 at 13 54 19

Could you please tell me what could possibly go wrong?

Thank you!

Grigorii

WeilerP commented 3 years ago

@GrigoriiNos, are you running this as the very first step of your pipeline, i.e. on the raw data immediately after reading the AnnData object?

GrigoriiNos commented 3 years ago

@WeilerP Thanks for the fast response I'm running this right after AnnData.concatenatee all loom files into one AnnData object

Per batch ratios are the same Screenshot 2021-08-09 at 15 25 00

WeilerP commented 3 years ago

@GrigoriiNos could you please provide a code snippet leading to the plots? I presume it is something along the lines of

import scanpy as sc
import scvelo as scv

adata_1 = sc.read(PATH_TO_DATA)
# Some concatenation
scv.pl.proportions(adata)

And these counts are raw, i.e. the output of e.g. velocyto s.t. your AnnData objects contain the layers 'unspliced', 'spliced', 'ambiguous'? There is a typo here s.t. the ambiguous layer is not picked up should it exist. If it exists and is called "ambiguous", could you please also provide the plots for scv.pl.proportions(adata=adata, layers=['unspliced', 'spliced', 'ambiguous'])?

GrigoriiNos commented 3 years ago

I've done basic scRNA-seq analysis in Seurat and transferred metadata to AnnData manually

### some metadata from Seurat analysis
meta = pd.read_csv('metadata.tsv', sep='\t')

### read looms
ldata_batch1 = scv.read('batch1_loom', cache=False)
ldata_batch1.var_names_make_unique()

ldata_batch2 = scv.read('batch2_loom', cache=False)
ldata_batch2.var_names_make_unique()

### (the same for all other batches)

### replace prefix with batch names in barcodes
cells_batch1 = pd.Series(ldata_batch1.obs.index).str.replace('possorted_genome_bam_.*\\:', 'batch1_')
ldata_batch1.obs.index = cells_batch1
ldata_batch1=ldata_batch1[ldata_batch1.obs.index.isin(meta.index)]

cells_batch2 = pd.Series(ldata_batch2.obs.index).str.replace('possorted_genome_bam_.*\\:', 'batch2_')
ldata_batch2.obs.index = cells_batch2
ldata_batch2=ldata_batch1[ldata_batch2.obs.index.isin(meta.index)]

### (the same for all other batches)

ldata = ldata_batch1.concatenate(ldata_batch2, ....# other batches
)

#### and now plot it
scv.pl.proportions(ldata)

################

###This is what I get with 

scv.pl.proportions(adata=ldata, layers=['unspliced', 'spliced', 'ambiguous'])

Screenshot 2021-08-09 at 15 51 39

WeilerP commented 3 years ago

I've done basic scRNA-seq analysis in Seurat and transferred metadata to AnnData manually

@GrigoriiNos, just to clarify: does this mean the counts are not raw (e.g. already normalized and filtered) or that you simply transferred e.g. cluster labels to the AnnData object? Also, did you verify manually (i.e. w/o using scVelo) that the proportions are as you'd expect?

GrigoriiNos commented 3 years ago

@WeilerP I didn't even transfer count matrix, I only assigned my Seurat @meta.data to anndata.obs and umap coordinates obsm['X_umap'], I was only going to use velocity in python

Yes all the cells are subset according to what I filtered analysing the data in R

How can I check this proportions manually?

WeilerP commented 3 years ago

Yes all the cells are subset according to what I filtered analysing the data in R

If you already filtered cells/genes in R, the proportions will/can change, right?

How can I check this proportions manually?

To check proportions manually, you'll need to calculate them yourself in Python or R.

GrigoriiNos commented 3 years ago

I don't filter genes, I don't think I need to

I'd be shocked if filtering out bad QC cells would change this proportion that drastically,

Here are raw proportions, without any filtering Screenshot 2021-08-10 at 11 33 25

WeilerP commented 3 years ago

@GrigoriiNos, you need to check the proportions yourself to verify that this is indeed a problem caused by scVelo (i.e. we are calculating them wrong) and not already present in your data. I cannot possible infer the problem by simply looking at plots.

WeilerP commented 3 years ago

@GrigoriiNos, I am closing this issue for now. Feel free to reopen once you checked that proportions are as expected in your AnnData object.

theislab / scvelo

Discrepancy between exonic/intronic reads statistic and spliced/unspliced transcripts. #627