statgen / demuxlet

Genetic multiplexing of barcoded single cell RNA-seq
Apache License 2.0
117 stars 25 forks source link

Samples (almost) missing from results #58

Open ocqub opened 4 years ago

ocqub commented 4 years ago

I've got some strange results when running demuxlet on a dataset of 4 pooled samples; of the ~3500 singlets detected, around 3450 are coming from two samples (split fairly evenly), leaving the two other samples practically absent from the results (indeed, one of these samples only has 2 barcodes assigned to it). In other words, while I would expect the 3500 singlet barcodes to be split somewhat evenly between the 4 samples, my results are saying that the vast majority are from 2 samples and that the other 2 are basically missing.

I also ran the same dataset through freemuxlet, and in that case the cells did split more evenly between the 4 samples, in line with expectations.

I would appreciate any suggestions as to what might be going wrong here.

hyunminkang commented 4 years ago

It is possible that genotype is incorrect? It would be useful to try freemuxlet in the popscle package to see whether you indeed see 4 clusters when genotype data is not used.

Thanks, Hyun.

Hyun Min Kang, Ph.D. Associate Professor of Biostatistics University of Michigan, Ann Arbor Email : hmkang@umich.edu

On Wed, Jan 8, 2020 at 4:59 AM ocqub notifications@github.com wrote:

I've got some strange results when running demuxlet on a dataset of 4 pooled samples; of the ~3500 singlets detected, around 3450 are coming from two samples (split fairly evenly), leaving the two other samples practically absent from the results (indeed, one of these samples only has 2 barcodes assigned to it). In other words, while I would expect the 3500 singlet barcodes to be split somewhat evenly between the 4 samples, my results are saying that the vast majority are from 2 samples and that the other 2 are basically missing.

I also ran the same dataset through freemuxlet, and in that case the cells did split more evenly between the 4 samples, in line with expectations.

I would appreciate any suggestions as to what might be going wrong here.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/statgen/demuxlet/issues/58?email_source=notifications&email_token=ABPY5OLAR4NYXPTVNCA5SUTQ4WPZBA5CNFSM4KEFWOCKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IEW3UOQ, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPY5OIEX7R3NYQADORHINTQ4WPZBANCNFSM4KEFWOCA .

ocqub commented 4 years ago

@hyunminkang Upon closer inspection, my 'missing' samples are mostly contained within barcodes determined to be doublets. I'm getting far more doublets than would be expected (about 1/3 of cells). I thought this might have been because I didn't specify --alpha 0 --alpha 0.5 when running, so I re-ran demuxlet using these parameters, but I'm getting the same outcome.

Running the data through freemuxlet gives results closer to our expectations, i.e. the singlet cells are split between the 4 samples (however with freemuxlet we can't determine which of our 4 samples each 'individual' corresponds to, hence the use of demuxlet - or am I wrong here?).

hyunminkang commented 4 years ago

I have a script to compare the identity of genotypes between VCF and freemuxlet output. You may want to use this to verify whether one sample indeed have incorrect genotypes. How is your genotyped VCF formatted? I'll probably include this in the popscle package..

Hyun.

Hyun Min Kang, Ph.D. Associate Professor of Biostatistics University of Michigan, Ann Arbor Email : hmkang@umich.edu

On Thu, Jan 9, 2020 at 5:05 AM ocqub notifications@github.com wrote:

@hyunminkang https://github.com/hyunminkang Upon closer inspection, my 'missing' samples are mostly contained within barcodes determined to be doublets. I'm getting far more doublets than would be expected (about 1/3 of cells). I thought this might have been because I didn't specify --alpha 0 --alpha 0.5 when running, so I re-ran demuxlet using these parameters, but I'm getting the same outcome.

Running the data through freemuxlet gives results closer to our expectations, i.e. the singlet cells are split between the 4 samples (however with freemuxlet we can't determine which of our 4 samples each 'individual' corresponds to, hence the use of demuxlet - or am I wrong here?).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/statgen/demuxlet/issues/58?email_source=notifications&email_token=ABPY5OMAVXPKO2LA45TI2XLQ43ZGFA5CNFSM4KEFWOCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIPXGQI#issuecomment-572486465, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPY5OJNGU6K6BCOM5CDTQLQ43ZGFANCNFSM4KEFWOCA .

ocqub commented 4 years ago

I have a script to compare the identity of genotypes between VCF and freemuxlet output.

Thanks, that would be interesting.

I used bcftools call to generate the VCF. As for its formatting, uploading .vcf doesn't seem to be supported here, so I converted a short section of my VCF into a .txt file, hopefully that gives an idea of what it looks like. Perhaps I should have used something like GATK instead? vcf_example.txt

hyunminkang commented 4 years ago

The input VCF for demuxlet should be multi-sample VCF, but this seems single-sample. Are you providing multi-sample VCF where all three possible genotypes are available?

Hyun.

Hyun Min Kang, Ph.D. Associate Professor of Biostatistics University of Michigan, Ann Arbor Email : hmkang@umich.edu

On Tue, Jan 14, 2020 at 4:44 AM ocqub notifications@github.com wrote:

I have a script to compare the identity of genotypes between VCF and freemuxlet output.

Thanks, that would be interesting.

I used bcftools call to generate the VCF. As for its formatting, uploading .vcf doesn't seem to be supported here, so I converted a short section of my VCF into a .txt file, hopefully that gives an idea of what it looks like. Perhaps I should have used something like GATK instead? vcf_example.txt https://github.com/statgen/demuxlet/files/4058362/vcf_example.txt

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/statgen/demuxlet/issues/58?email_source=notifications&email_token=ABPY5OMFUS4Y55DVIM4EOE3Q5WCQ3A5CNFSM4KEFWOCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEI36VWQ#issuecomment-574089946, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPY5OPTWOWKJ7HLCMZSNXTQ5WCQ3ANCNFSM4KEFWOCA .

ocqub commented 4 years ago

I ran bcftools merge with 4 single-sample VCFs to get this multi-sample VCF; the last 4 columns correspond to the genotype for each of the 4 samples, N1, N2, G1 and G2 (as I understand it, at least). But perhaps I'm missing something?

Edit; I should add that demuxlet is definitely recognising that the VCF refers to 4 samples, I get the following line in the log;

Finished identifying 4 samples to load from VCF/BCF

hyunminkang commented 4 years ago

I see.. I misunderstood but still does not report homref genotypes, is that correct?

On Tue, Jan 14, 2020, 8:08 AM ocqub notifications@github.com wrote:

I ran bcftools merge with 4 single-sample VCFs to get this multi-sample VCF; the last 4 columns correspond to the genotype for each of the 4 samples, N1, N2, G1 and G2 (as I understand it, at least). But perhaps I'm missing something?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/statgen/demuxlet/issues/58?email_source=notifications&email_token=ABPY5OOYWV2IHYC3YXU5UALQ5W2NZA5CNFSM4KEFWOCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEI4RDKA#issuecomment-574165416, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPY5OL26MCOGGALMWE5QHLQ5W2NZANCNFSM4KEFWOCA .

ocqub commented 4 years ago

That seems to be the case, though I am far from an expert on VCF/variant calling. Presumably bcftools call does not report homref genotype in the output. If it's necessary for demuxlet, perhaps I would have to create my VCF in GATK instead.

hyunminkang commented 4 years ago

It would depend on the settings. I do not recall what options does what, but you could look into your VCF to confirm the case. You will need to have genotypes (or) genotype likelihood of all possible genotypes. We usually assume that you have genotypes from GWAS but it seems that you have genotypes from other external sequenced dataset.

Hyun.

Hyun Min Kang, Ph.D. Associate Professor of Biostatistics University of Michigan, Ann Arbor Email : hmkang@umich.edu

On Tue, Jan 14, 2020 at 8:53 AM ocqub notifications@github.com wrote:

That seems to be the case, though I am far from an expert on VCF/variant calling. Presumably bcftools call does not report homref genotype in the output. If it's necessary for demuxlet, perhaps I would have to create my VCF in GATK instead.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/statgen/demuxlet/issues/58?email_source=notifications&email_token=ABPY5OKBONG4L4H3JJK7US3Q5W7WHA5CNFSM4KEFWOCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEI4VYKI#issuecomment-574184489, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPY5OOU6EQATSQL3AN7ERTQ5W7WHANCNFSM4KEFWOCA .