velocyto-team / velocyto.py

RNA velocity estimation in Python
http://velocyto.org/velocyto.py/
BSD 2-Clause "Simplified" License
159 stars 83 forks source link

Some genes have much lower counts in Velocyto than 10x cellranger output #68

Closed yueqiw closed 6 years ago

yueqiw commented 6 years ago

Hi,

I'm using Velocyto on data generated from 10x genomics cellranger pipeline, and project velocity onto embeddings produced from Seurat and scanpy. I found that some of my marker genes are barely detected in Velocyto pipeline, but are detected in hundreds of cells from the cellranger output, .

For example:

'GSX-2' 
cellranger count (total 10k cells):
nonzero_cells     898.0
total_count      1233.0

Velocyto spliced count:
nonzero_cells    1.0
total_count      1.0

Velocyto un-spliced count:
nonzero_cells    0.0
total_count      0.0

Here is a plot of counts for ~20,000 genes: image

It looks like many genes are detected with far less counts in the Velocyto pipeline.

Below is the CLI code I used to produce the loom file: velocyto run10x -m ~/Data1/data/hrch38_rmsk.gtf ./14741X1 ~/Data1/data/refdata-cellranger-GRCh38-1.2.0/genes/genes.gtf

The bam files were generated using cellranger 2.0.2 using the same reference and sorted during velocyto run10x. I tried run10x without repeat mask and got similar result.

Do you know what's the reason for the much lower gene detection in Velocyto?

Thanks! Yueqi

gioelelm commented 6 years ago

Hi,

First of all thank you for using velocyto and a double thank you for opening this issue, users feedback is really appreciated.

I have some hypothesis on why this might be happening including differences in how multiple mappings, chimeric molecules and how umi are treated. I think I deal with many corner cases in a rather conservative way that maybe can be relaxed.

However, I think that to properly answer your question one needs to understand the details of both the pipelines. I am worried that without a detailed knowledge of the cellranger source code (that I unfortunately do not have yet) whatever I say at this point about the differences is going to be a little hand-wavy. It will take some time to do this comparison and will maybe require to have the access to the specific files from the users observing any unexpected behaviour (notice that is the first time I observe such such a difference so it might be sample specific somehow).

Notice also that we provide a way to inspect what the counting pipeline is doing by outputing a molecular counting report using the -d 1 (hd5 file detailed summary of each read mapping annotation) or -d p1(complete python pickle object with all the mapping decision) options. But again to get an appropriate comparative of all the corner cases is its own little project.

I am sorry if I cannot be more helpful than this at this point. I can try to come back to you in a week or so with some more suggestions. But the final extensive comparison will require a little extra time... maybe a discussion with cellranger core developer will speed up the process.

yueqiw commented 6 years ago

Hi Gioele,

Thanks for the fast reply! I wasn't sure if I was using velocyto properly, so I wanted to check if this issue has been observed before. Since it's never been reported, I have several more questions:

(1) If you look into the 10x datasets from your lab, do you observe similar issues? What about data from similar techniques, such as Dropseq?

(2) I looked at a separate experiment analyzed using the same cellranger and velocyto pipeline, and got very similar results. So the issue is not specific to that particular sample. Interestingly, the sets of genes that are under-represented in velocyto are very similar between two independent experiments (~600 genes are shared among 1000 under-represented genes from experiment 1 and 1000 under-represented genes from experiment 2. The samples are the same species, but quite different in terms of age and cell type composition). I guess the issue is specific to how a certain subset of genes are counted.

It will take some time to do this comparison and will maybe require to have the access to the specific files from the users observing any unexpected behaviour.

If it helps resolve the unexpected behavior and it's specific to our datasets, we'd be happy to provide the files.

Thanks!

gioelelm commented 6 years ago

(1) If you look into the 10x datasets from your lab, do you observe similar issues? What about data from similar techniques, such as Dropseq?

Of course I checked, but last time the comparison looked definitely better than your plots. However the codebase changed a little in the meanwhile. I will give a second check asap.

(2) I looked at a separate experiment analyzed using the same cellranger and velocyto pipeline, and got very similar results. So the issue is not specific to that particular sample. Interestingly, the sets of genes that are under-represented in velocyto are very similar between two independent experiments (~600 genes are shared among 1000 under-represented genes from experiment 1 and 1000 under-represented genes from experiment 2. The samples are the same species, but quite different in terms of age and cell type composition). I guess the issue is specific to how a certain subset of genes are counted.

Ok interesting. Given this evidence, I can think of only two effects that can give rise to this kind of consistency. 1) Annotation of introns that span over exons/introns of other genes. Even if this sounds like an unrealistic situation in animal genomes. I have seen at least couple of cases like this. Some of them were extreme, probably an annotation error, where two exons of the same transcript model where annotated one upstream and one downstream of a chromosomic region containing dozens of genes. This means that the intron between them was overlapping with hundreds of features (introns and exons). This in practice was causing velocyto to call chimeric molecules. I fixed that bug long ago by removing this unrealistic introns but maybe there are some more of modest length. If the problem is of this nature one needs to figure out a non trivial logic ( or a good heuristic ) to assign that read. Notice that this kind of weird annotations, would not cause any troubles to cellranger that, I guess, is not considering he introns of transcript models.

2) Multiple mappings. If cellranger counts the first match of multiple mappings (or somehow) then depending how the aligner is producing the output those would be systematically assigned to the same genes. I think it is good practice from the hold bulk-sequencing days to remove multiple mappings from the analysis and that is what I am doing in the current version of velocyto.

If it helps resolve the unexpected behaviour and it's specific to our datasets, we'd be happy to provide the files.

That would be great, please contact me over email and we can discuss how to share the data, however I will start first by repeating some diagnostics on my data.

I would like to close this issue for now since what is reported here is not an explicit bug but let's talk about it over email (gioele.la.manno@ki.se). I will reopen, and fix it immediately, if there is evidence that this was an unexpected behaviour and not a effect of further stringency necessary when considering introns of transcript models.

yueqiw commented 6 years ago

Hi Gioele,

Thanks for the detailed explanation. I'll approach you via email soon.

In the meantime when you run diagnostics, another factor I could think about is that we use human genome and the GRCh38-1.2.0 gtf provided by cellranger. So it's also possible that the effect is somehow related to human genome annotation.

jvikkula commented 5 years ago

Hi @gioelelm and @yueqiw,

Did you manage to solve this issue? Since I'm facing rather similar issue, but for me most of the genes and not just some genes are detected with significantly lower counts in Velocyto pipeline. I calculated an average initial cell size from the matrix that cellranger produces and it is ~4126, whereas the average initial cell size after Velocyto pipeline is ~171 for spliced, ~378 for unspliced and ~17 for ambiguous. Also it feels somewhat suspicious that there are a lot more unspliced reads than spliced reads. I have all together 14 samples and the results are similar with all the samples.

On the other hand, some genes, which are not detected at all by cellranger, are very highly expressed based on the Velocyto pipeline. I also started wondering if this is an issue related to the annotation file.

I am also running Velocyto for data generated from 10x genomics cellranger pipeline: velocyto run10x -m repeat_msk.gtf 10x_sample_folder refdata-cellranger-GRCh38-1.2.0/genes/genes.gtf

I tried running without the repeat mask file, but that did not have an effect.

I would appreciate it a lot if you were able to help me with this issue. Thanks!

-Johanna

LikaiTan commented 5 years ago

Hi @gioelelm and @yueqiw,

Did you manage to solve this issue? Since I'm facing rather similar issue, but for me most of the genes and not just some genes are detected with significantly lower counts in Velocyto pipeline. I calculated an average initial cell size from the matrix that cellranger produces and it is ~4126, whereas the average initial cell size after Velocyto pipeline is ~171 for spliced, ~378 for unspliced and ~17 for ambiguous. Also it feels somewhat suspicious that there are a lot more unspliced reads than spliced reads. I have all together 14 samples and the results are similar with all the samples.

On the other hand, some genes, which are not detected at all by cellranger, are very highly expressed based on the Velocyto pipeline. I also started wondering if this is an issue related to the annotation file.

I am also running Velocyto for data generated from 10x genomics cellranger pipeline: velocyto run10x -m repeat_msk.gtf 10x_sample_folder refdata-cellranger-GRCh38-1.2.0/genes/genes.gtf

I tried running without the repeat mask file, but that did not have an effect.

I would appreciate it a lot if you were able to help me with this issue. Thanks!

-Johanna

Hi We have exactly the same problem. I'm using 10x 5' chemicals for both TCR and transcriptome, don't know if it is the problem.

Likai

yueqiw commented 5 years ago

@LikaiTan @jvikkula I didn't find a solution. Based on the email correspondence, the author @gioelelm wasn't able to invest much time on this issue.

He did note in the email that

I have noticed that doing the counting with different gtfs can might have a rather noticeable effect on the results.

Since this issue has been closed, I'd suggest opening a new issue describing your problem and refer to this one.

Hope it helps!

jvikkula commented 5 years ago

@yueqiw Thanks for your reply. I spent still some time trying to solve this but wasn't able to do that, but I found out that you can also use for example dropEst to do the counting and that is working rather well for me.

yueqiw commented 5 years ago

@jvikkula Does dropEst count both spliced and unspliced transcripts? And were you able to use the counting result for RNA velocity? Thanks!

jvikkula commented 5 years ago

@yueqiw dropEst counts exonic, intronic and exon/intron spanning reads (https://dropest.readthedocs.io/en/latest/dropest.html#velocyto-integration). I'm assuming that exonic reads are same as spiliced ones and unspliced transcripts you get when you sum intronic and exon/intron spanning reads.

yueqiw commented 5 years ago

@jvikkula Thanks! That's very helpful!

LikaiTan commented 5 years ago

@yueqiw dropEst counts exonic, intronic and exon/intron spanning reads (https://dropest.readthedocs.io/en/latest/dropest.html#velocyto-integration). I'm assuming that exonic reads are same as spiliced ones and unspliced transcripts you get when you sum intronic and exon/intron spanning reads.

Hi @jvikkula , I have a stupid question. I tried to run the dropest pipline, however I don't know how to get the config.xml. I tried 10x.xml but didn't work out. Run: 07/22/2019 17:04:15. Can't open file with barcodes: './../data/barcodes/10x_aug_2016_split'

Many thanks in advance.

jvikkula commented 5 years ago

@LikaiTan I took only the estimation/dropEst part from the 10x config file and not the "tags search"/dropTag part at all, because if I understood correctly, that part is not needed for dropEst.

LikaiTan commented 5 years ago

@LikaiTan I took only the estimation/dropEst part from the 10x config file and not the "tags search"/dropTag part at all, because if I understood correctly, that part is not needed for dropEst.

Problem solved, now I get comparable counts as that from cellranger. Thanks very much!

Zifeng-L commented 4 years ago

@LikaiTan I took only the estimation/dropEst part from the 10x config file and not the "tags search"/dropTag part at all, because if I understood correctly, that part is not needed for dropEst.

Problem solved, now I get comparable counts as that from cellranger. Thanks very much!

Excuse me,can you tell me how to take the dropEst without config.xml?

LikaiTan commented 4 years ago

@LikaiTan I took only the estimation/dropEst part from the 10x config file and not the "tags search"/dropTag part at all, because if I understood correctly, that part is not needed for dropEst.

Problem solved, now I get comparable counts as that from cellranger. Thanks very much!

Excuse me,can you tell me how to take the dropEst without config.xml?

Hi I think you need a config.xml. It's too long ago and I forgot many details. for me it's like this:

【path to where you install dropest】/data/barcodes/10x_aug_2016_split const 0.2 2 1 100 20 1e-5 1e-7
Zifeng-L commented 4 years ago

@LikaiTan I took only the estimation/dropEst part from the 10x config file and not the "tags search"/dropTag part at all, because if I understood correctly, that part is not needed for dropEst.

Problem solved, now I get comparable counts as that from cellranger. Thanks very much!

Excuse me,can you tell me how to take the dropEst without config.xml?

Hi I think you need a config.xml. It's too long ago and I forgot many details. for me it's like this:

【path to where you install dropest】/data/barcodes/10x_aug_2016_split const 0.2 2 1 100 20
    <PreciseMerge>
        <max_merge_prob>1e-5</max_merge_prob>
        <max_real_merge_prob>1e-7</max_real_merge_prob>
    </PreciseMerge>
</Estimation>

Thanks a lot!!!! So it seems that the 5' data can be processed the same as the 3' data, no matter where the barcodes and UMI are?

LikaiTan commented 4 years ago

Yes, though it didn't gave me informative results 😂On 23 Jun 2020 13:47, Ann Li notifications@github.com wrote:

@LikaiTan I took only the estimation/dropEst part from the 10x config file and not the "tags search"/dropTag part at all, because if I understood correctly, that part is not needed for dropEst.

Problem solved, now I get comparable counts as that from cellranger.

Thanks very much!

Excuse me,can you tell me how to take the dropEst without config.xml?

Hi I think you need a config.xml. It's too long ago and I forgot many details. for me it's like this:

【path to where you install dropest】/data/barcodes/10x_aug_2016_split const 0.2 2 1 100 20
<PreciseMerge>

    <max_merge_prob>1e-5</max_merge_prob>

    <max_real_merge_prob>1e-7</max_real_merge_prob>

</PreciseMerge>

Thanks a lot!!!! So it seems that the 5' data can be processed the same as the 3' data, no matter where the barcodes and UMI are?

—You are receiving this because you were mentioned.Reply to this email directly, view it on GitHub, or unsubscribe.

Zifeng-L commented 4 years ago

Yes, though it didn't gave me informative results 😂On 23 Jun 2020 13:47, Ann Li notifications@github.com wrote: @LikaiTan I took only the estimation/dropEst part from the 10x config file and not the "tags search"/dropTag part at all, because if I understood correctly, that part is not needed for dropEst. Problem solved, now I get comparable counts as that from cellranger. Thanks very much! Excuse me,can you tell me how to take the dropEst without config.xml? Hi I think you need a config.xml. It's too long ago and I forgot many details. for me it's like this: 【path to where you install dropest】/data/barcodes/10x_aug_2016_split const 0.2 2 1 100 20 1e-5 1e-7 Thanks a lot!!!! So it seems that the 5' data can be processed the same as the 3' data, no matter where the barcodes and UMI are? —You are receiving this because you were mentioned.Reply to this email directly, view it on GitHub, or unsubscribe.

Let me try! Thank you very much!

hyjforesight commented 2 years ago

Same issue here. Any suggestions?

sandrav-CGEN commented 2 years ago

Same issue here. I will try the dropEst approach now...

hyjforesight commented 2 years ago

@sandrav-CGEN check this https://github.com/theislab/scvelo/discussions/813