Poor results from a Drop-Seq bam file with "velocyto run"

yeuyeuh commented 6 years ago

Hi,

Thanks for Velocyto, it's very interesting.

I try to launch "velocyto run" on our data. Our scRNA-seq protocol is inspired by the STRT-Seq protocol. The results are in UMI counts and the reads are from the 5 prime end of the mRNA (not like 10x where there are from the 3 prime end). I used the Drop-Seq pipeline (and STAR) to obtain a mapped bam file where the cell barcode and UMI barcode are contained in the XC and XM tags. I noticed that counter.py takes care of these specific tags from Drop-Seq.

Unfortunately, I have got very poor results when I launch "Velocyto run" on my bam which contains 96 cells. Only a few genes are detected and we do not have a lot of spliced, unspliced and ambiguous values.

Also, it may be part of this issue: In R, when I look at the rownames(ldat$spliced), there are a lot of duplicated genes...

Did I miss something? Normally (without velocyto), I have more than 1000 genes detected per cell.

Thank you

gioelelm commented 6 years ago

TL;DR velocyto.py right now does not support STRT. using velocyto.R SmartSeq2 pipeline might perform better than what you are reporting but it would still be an abuse of the pipeline logic.

velocyto.py logic is designed for 3' scRNA-Seq chemistry like 10X and DropSeq. It cannot be used "as is" to your protocol since it is STRT inspired. The "poor" results are expected for different reasons. For example:

With 5' protocols you are getting one genomic mapping position per cell barcode, even if the mapping is intronic, you have only one evidence of mapping per molecule. Notice that in 3' protocols the transcript is read from the body (5') towards the poly-A tail (3') and the position that you are reading depends from the fragmentation step, therefore you have the chance to read transcripts at different locations and to combine this information you can classify them as spliced or unspliced. This means that STRT protocol should be analysed more like SmartSeq2 rather than 10X/Drop-Seq, just by counting UMI instead of reads.
analysing few cells will also cause trouble since velocyto.py is conservative on calling unspliced molecules and require at least one evidence of intron-exon spanning read in the whole dataset to count intron mapping. So few small cells might mean fewer evidence of this spanning evidence and therefore a lot of reads won't be counted. You could try to set --introns permissive to release this check but then you will be making your data more noisy.

We might support STRT in the future (after appropriate testing of a STRT specific logic) but right now I am afraid there is no good solution to make this work for your use case.

yeuyeuh commented 6 years ago

Thank you very much for your answer. In fact, in our protocol the transcript is read from the body towards the 5' end of the mRNA. Our reads can start at different locations per gene. More importantly, it is the reverse complement of the mRNA sequence, so the issue may come from the fact that in the bam our starting point (POS field) may correspond to the ending point of the true part of the transcript. In "velocyto run", do you only use the POS field from the bam and the length of the sequence to know the mapping interval of the read?

Also thanks for your second point, I will merge my different bam files from all the plates into one big bam, in this way I will have more cells.

Thanks

velocyto-team / velocyto.py

Poor results from a Drop-Seq bam file with "velocyto run" #27