statgen / popscle

A suite of population scale analysis tools for single-cell genomics data including implementation of Demuxlet / Freemuxlet methods and auxilary tools
https://github.com/statgen/popscle/wiki
Apache License 2.0
43 stars 16 forks source link

Popscle demuxlet vs freemuxlet output stability. #41

Open xmignot opened 3 years ago

xmignot commented 3 years ago

Hi, I'm trying to demultiplex the sequence results of a series of 10x experiments (both 3' and 5' chemistry). I started by using demuxlet (we have gwas data available for the samples), but also ran freemuxlet using 1000 genomes VCF filtered as described in the tutorial as a reference. We additionally have multiseq results (a more involved demultiplexing protocol that I'm treating as ground truth) on just the 3' data. I'm a little concerned about the results from freemuxlet, as they appear to map very noisily to the demuxlet/multiseq sample ids.
I built a mapping of consensus SNG barcodes between each protocol, and while demuxlet maps very cleanly to the multiseq labels in the 3' data for both the 3' and the 5' data the freemuxlet clusters are distributed across lots of sample ids.
As an example, here are some rows from each mapping:

[demuxlet to multiseq]
109D12: ['109D12: 0.9238', '61C07: 0.0092', '119A02: 0.0074', '119A04: 0.0067', '113E02: 0.006']
...
[freemuxlet to multiseq]
6: ['61C04: 0.2754', '119A03: 0.2748', '119A02: 0.1642', '61D08: 0.1314', '119B12: 0.0609']

Do you have any advice on how to debug this or insights into what could be going on? I haven't tried passing the variant gwas positions used in demuxlet to freemuxlet as a reference, but I imagine this should give more consistent results. However, I want to be able to use the 1000 genomes variants as it seems this would be another way to independently validate the demultiplexed barcodes - additionally I've been advised they are probably more effective for freemuxlet.
Thanks!

hyunminkang commented 3 years ago

Have you compared the VCFs generated from freemuxlet with the original genotypes? Would it be possible that the data is a lot of ambient mRNAs so that ambient RNAs were represented as one cluster? freemuxlet tried to avoid such a case, but it may be imperfect.

Thanks, Hyun.

Hyun Min Kang, Ph.D. Associate Professor of Biostatistics University of Michigan, Ann Arbor Email : hmkang@umich.edu

On Mon, Feb 15, 2021 at 5:23 PM xmignot notifications@github.com wrote:

Hi, I'm trying to demultiplex the sequence results of a series of 10x experiments (both 3 and 5 chemistry). I started by using demuxlet (we have gwas data available for the samples), but also ran freemuxlet using 1000 genomes VCF filtered as described in the tutorial as a reference. We additionally have multiseq results (a more involved demultiplexing protocol that I'm treating as ground truth) on just the 3 data. I'm a little concerned about the results from freemuxlet, as they appear to map very noisily to the demuxlet/multiseq sample ids. I built a mapping of consensus SNG barcodes between each protocol, and while demuxlet maps very cleanly to the multiseq labels in the 3 data for both the 3 and the 5 data the freemuxlet clusters are distributed across lots of sample ids. As an example, here are some rows from each mapping:

[demuxlet to multiseq] 109D12: ['109D12: 0.9238', '61C07: 0.0092', '119A02: 0.0074', '119A04: 0.0067', '113E02: 0.006'] ... [freemuxlet to multiseq] 6: ['61C04: 0.2754', '119A03: 0.2748', '119A02: 0.1642', '61D08: 0.1314', '119B12: 0.0609']

Do you have any advice on how to debug this or insights into what could be going on? I haven't tried passing the variant gwas positions used in demuxlet to freemuxlet as a reference, but I imagine this should give more consistent results. However, I want to be able to use the 1000 genomes variants as it seems this would be another way to independently validate the demultiplexed barcodes - additionally I've been advised they are probably more effective for freemuxlet. Thanks!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/statgen/popscle/issues/41, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPY5ONAHU65M3PFVN7UOB3S7GNG5ANCNFSM4XVMVKGA .

xmignot commented 3 years ago

How would you recommend comparing those generated VCF files to our genotype data? If one of the clusters is ambient mRNA wouldn't you expect to see just one cluster mapping very noisily and then all of the others mapping fairly well to particular sample ids? Or maybe a much higher fraction of DBL assignments?
It's possible that this is the case but I'm using the filtered_feature_matrix 10x output barcodes so there should already be some degree of QC - I'm wondering if because this noisiness showed up in both the 5' and 3' data this indicates the problem is more likely to be related to the reference VCF file?
Thanks for the prompt reply! I appreciate the help - Xavier

hyunminkang commented 3 years ago

I would check the genotype concordance on overlapping variants first. It is a bit tricky to achieve though. It is hard to figure out what the problem is without knowing the nature of data, populations, the degree of multiplexing, etc.

Thanks, Hyun.

Hyun Min Kang, Ph.D. Associate Professor of Biostatistics University of Michigan, Ann Arbor Email : hmkang@umich.edu

On Mon, Feb 15, 2021 at 6:17 PM xmignot notifications@github.com wrote:

How would you recommend comparing those generated VCF files to our genotype data? If one of the clusters is ambient mRNA wouldn't you expect to see just one cluster mapping very noisily and then all of the others mapping fairly well to particular sample ids? Or maybe a much higher fraction of DBL assignments? It's possible that this is the case but I'm using the filtered_feature_matrix 10x output barcodes so there should already be some degree of QC - I'm wondering if because this noisiness showed up in both the 5' and 3' data this indicates the problem is more likely to be related to the reference VCF file? Thanks for the prompt reply! I appreciate the help - Xavier

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/statgen/popscle/issues/41#issuecomment-779489165, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPY5ONFY2Y3WG7TREZTYTTS7GTQBANCNFSM4XVMVKGA .

xmignot commented 3 years ago

Alright, I will start with that! Thanks for the pointers.