velocyto-team / velocyto.py

RNA velocity estimation in Python
http://velocyto.org/velocyto.py/
BSD 2-Clause "Simplified" License
160 stars 83 forks source link

Poor results from a Drop-Seq bam file with "velocyto run" #27

Closed yeuyeuh closed 6 years ago

yeuyeuh commented 6 years ago

Hi,

Thanks for Velocyto, it's very interesting.

I try to launch "velocyto run" on our data. Our scRNA-seq protocol is inspired by the STRT-Seq protocol. The results are in UMI counts and the reads are from the 5 prime end of the mRNA (not like 10x where there are from the 3 prime end). I used the Drop-Seq pipeline (and STAR) to obtain a mapped bam file where the cell barcode and UMI barcode are contained in the XC and XM tags. I noticed that counter.py takes care of these specific tags from Drop-Seq.

Unfortunately, I have got very poor results when I launch "Velocyto run" on my bam which contains 96 cells. Only a few genes are detected and we do not have a lot of spliced, unspliced and ambiguous values.

Also, it may be part of this issue: In R, when I look at the rownames(ldat$spliced), there are a lot of duplicated genes...

Did I miss something? Normally (without velocyto), I have more than 1000 genes detected per cell.

Thank you

gioelelm commented 6 years ago

TL;DR velocyto.py right now does not support STRT. using velocyto.R SmartSeq2 pipeline might perform better than what you are reporting but it would still be an abuse of the pipeline logic.

velocyto.py logic is designed for 3' scRNA-Seq chemistry like 10X and DropSeq. It cannot be used "as is" to your protocol since it is STRT inspired. The "poor" results are expected for different reasons. For example:

We might support STRT in the future (after appropriate testing of a STRT specific logic) but right now I am afraid there is no good solution to make this work for your use case.

yeuyeuh commented 6 years ago

Thank you very much for your answer. In fact, in our protocol the transcript is read from the body towards the 5' end of the mRNA. Our reads can start at different locations per gene. More importantly, it is the reverse complement of the mRNA sequence, so the issue may come from the fact that in the bam our starting point (POS field) may correspond to the ending point of the true part of the transcript. In "velocyto run", do you only use the POS field from the bam and the length of the sequence to know the mapping interval of the read?

Also thanks for your second point, I will merge my different bam files from all the plates into one big bam, in this way I will have more cells.

Thanks