statgen / demuxlet

Genetic multiplexing of barcoded single cell RNA-seq
Apache License 2.0
116 stars 25 forks source link

Test Run on Seqwell Datasets #76

Open angelussong opened 3 years ago

angelussong commented 3 years ago

Hi,

Thanks for making this very useful package!

Our lab recently purchased a 10x controller and we would like to use demuxlet to lower the cost. I am trying to have a test run of demuxlet on some previously sequenced datasets to see how demuxlet performs.

What we have are two Seqwell-based scRNA-seq datasets from two different patients aligned with STAR aligner. And I used gatk haplotypecaller on the aligned bam files to get the germline variants.

Here are some reads from the merged bam files:

A00111:421:H37HMDSXY:1:1240:23384:6292 16 1 11765 0 50M * 0 0 GTGTTGAGAATGACTGCGCAAATTTGCCGGATTTCCTTTGCTGTTCCTGC FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF XC:Z:ACCACCGGAATC MD:Z:50 XF:Z:INTERGENIC NH:i:6 NM:i:0 XM:Z:CTCAGCTA UQ:i:0 AS:i:49 CY:Z::FFFFFFFF:FF UY:Z:FFFFFFFF RG:Z:PA_PR5249_N1_S1_L001 PG:Z:STAR

A00111:421:H37HMDSXY:1:1426:17173:17080 0 1 11772 0 50M * 0 0 GAATTACTGCGCAAATTTGCCGGATTTCCTTTGCTGTTCCTGCATGTAGT :FFFFFFFFFFFF,FFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFF:F, XC:Z:TGGCGTGATGAA MD:Z:4G45 XF:Z:INTERGENIC NH:i:6 NM:i:1 XM:Z:GCGGGGGA UQ:i:37 AS:i:47 CY:Z::FFFFFFFFFFF UY:Z:FFFF:FFF RG:Z:PA_PR5249_N1_S1_L001 PG:Z:STAR

And here are some germline mutations from haplotypecaller:

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT PA_PR5249_N1_S1_L001 PA_PR5251_T1_S7_L001

1 53177 . A T 79.29 . AC=2;AF=1.00;AN=2;DP=2;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=NaN;QD=25.36;SOR=2.303 GT:AD:DP:GQ:PL ./. 1/1:0,2:2:6:90,6,0 1 53182 . G T 79.29 . AC=2;AF=1.00;AN=2;DP=2;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=NaN;QD=28.73;SOR=2.303 GT:AD:DP:GQ:PL ./. 1/1:0,2:2:6:90,6,0

In these data, PA_PR**** are the sample names. I'm trying to follow the demuxlet syntax to run this but I don't seem to have a CB or UB tag in the bam files. In our aligning pipeline, I think XC is supposed to be the barcode tag and XM the UMI. When I ran the demuxlet on this, I got some really large files for results such as this one:

BARCODE RD.TOTL RD.PASS RD.UNIQ N.SNP BEST SNG.1ST SNG.LLK1 SNG.2ND SNG.LLK2 SNG.LLK0 DBL.1ST DBL.2ND ALPHA LLK12 LLK1 LLK2 LLK10 LLK20 LLK00 PRB.DBL PRB.SNG1 AAAAAAAAAAAA 277 17 17 17 SNG-PA_PR5251_T1_S7_L001 PA_PR5251_T1_S7_L001 -7.9723 PA_PR5249_N1_S1_L001 -16.6879 -8.1799 PA_PR5251_T1_S7_L001 PA_PR5249_N1_S1_L001 0.200 -7.6759 -7.9723 -16.6879 -7.6759 -16.6879 -8.1799 0.577 1

However, I don't see this barcode in my bam file. I'm wondering if I made any mistakes or there are some other parameters that I should alter?

Thanks!

hyunminkang commented 3 years ago

Please use --tag-group XC and --tag-umi XM as options.

Hyun.

Hyun Min Kang, Ph.D. Associate Professor of Biostatistics University of Michigan, Ann Arbor Email : hmkang@umich.edu

On Mon, Nov 23, 2020 at 4:10 PM Hanbing Song notifications@github.com wrote:

Hi,

Thanks for making this very useful package!

Our lab recently purchased a 10x controller and we would like to use demuxlet to lower the cost. I am trying to have a test run of demuxlet on some previously sequenced datasets to see how demuxlet performs.

What we have are two Seqwell-based scRNA-seq datasets from two different patients aligned with STAR aligner. And I used gatk haplotypecaller on the aligned bam files to get the germline variants.

Here are some reads from the merged bam files:

A00111:421:H37HMDSXY:1:1240:23384:6292 16 1 11765 0 50M * 0 0 GTGTTGAGAATGACTGCGCAAATTTGCCGGATTTCCTTTGCTGTTCCTGC FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF XC:Z:ACCACCGGAATC MD:Z:50 XF:Z:INTERGENIC NH:i:6 NM:i:0 XM:Z:CTCAGCTA UQ:i:0 AS:i:49 CY:Z::FFFFFFFF:FF UY:Z:FFFFFFFF RG:Z:PA_PR5249_N1_S1_L001 PG:Z:STAR

A00111:421:H37HMDSXY:1:1426:17173:17080 0 1 11772 0 50M * 0 0 GAATTACTGCGCAAATTTGCCGGATTTCCTTTGCTGTTCCTGCATGTAGT :FFFFFFFFFFFF,FFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFF:F, XC:Z:TGGCGTGATGAA MD:Z:4G45 XF:Z:INTERGENIC NH:i:6 NM:i:1 XM:Z:GCGGGGGA UQ:i:37 AS:i:47 CY:Z::FFFFFFFFFFF UY:Z:FFFF:FFF RG:Z:PA_PR5249_N1_S1_L001 PG:Z:STAR

And here are some germline mutations from haplotypecaller:

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT PA_PR5249_N1_S1_L001

PA_PR5251_T1_S7_L001 1 53177 . A T 79.29 . AC=2;AF=1.00;AN=2;DP=2;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=NaN;QD=25.36;SOR=2.303 GT:AD:DP:GQ:PL ./. 1/1:0,2:2:6:90,6,0 1 53182 . G T 79.29 . AC=2;AF=1.00;AN=2;DP=2;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=NaN;QD=28.73;SOR=2.303 GT:AD:DP:GQ:PL ./. 1/1:0,2:2:6:90,6,0

In these data, PA_PR**** are the sample names. I'm trying to follow the demuxlet syntax to run this but I don't seem to have a CB or UB tag in the bam files. In our aligning pipeline, I think XC is supposed to be the barcode tag and XM the UMI. When I ran the demuxlet on this, I got some really large files for results such as this one:

BARCODE RD.TOTL RD.PASS RD.UNIQ N.SNP BEST SNG.1ST SNG.LLK1 SNG.2ND SNG.LLK2 SNG.LLK0 DBL.1ST DBL.2ND ALPHA LLK12 LLK1 LLK2 LLK10 LLK20 LLK00 PRB.DBL PRB.SNG1 AAAAAAAAAAAA 277 17 17 17 SNG-PA_PR5251_T1_S7_L001 PA_PR5251_T1_S7_L001 -7.9723 PA_PR5249_N1_S1_L001 -16.6879 -8.1799 PA_PR5251_T1_S7_L001 PA_PR5249_N1_S1_L001 0.200 -7.6759 -7.9723 -16.6879 -7.6759 -16.6879 -8.1799 0.577 1

However, I don't see this barcode in my bam file. I'm wondering if I made any mistakes or there are some other parameters that I should alter?

Thanks!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/statgen/demuxlet/issues/76, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPY5OI57HRO3TEREATVHEDSRLFVVANCNFSM4UAAFSOA .

angelussong commented 3 years ago

Hi Hyun,

Thanks for your reply! I tested with these options and got several result files for outputs. These are the options I used for this run: "--tag-group XC --tag-UMI XM --field GT".

However, when I checked the .best file, here are the first few lines: BARCODE RD.TOTL RD.PASS RD.UNIQ N.SNP BEST SNG.1ST SNG.LLK1 SNG.2ND SNG.LLK2 SNG.LLK0 DBL.1ST DBL.2ND ALPHA LLK12 LLK1 LLK2 LLK10 LLK20 LLK00 PRB.DBL PRB.SNG1 AAAAAAAAAAAA 277 17 17 17 DBL-PA_PR5251_T1_S7_L001-PA_PR5249_N1_S1_L001-0.300 PA_PR5251_T1_S7_L001 -11.0482 PA_PR5249_N1_S1_L001 -16.7324 -8.4518 PA_PR5251_T1_S7_L001 PA_PR5249_N1_S1_L001 0.300 -8.1135 -11.0482 -16.7324 -8.1135 -16.7324 -8.4518 0.949 0.997 AAAAAAAAAAAC 30 2 2 2 AMB-PA_PR5249_N1_S1_L001-PA_PR5251_T1_S7_L001-PA_PR5251_T1_S7_L001/PA_PR5249_N1_S1_L001 PA_PR5249_N1_S1_L001 -0.6996 PA_PR5251_T1_S7_L001 -0.7006 -0.5850 PA_PR5251_T1_S7_L001 PA_PR5249_N1_S1_L001 0.500 -0.5850 -0.7006 -0.6996 -0.5850 -0.6996 -0.5850 0.495 0.5 AAAAAAAAAAAG 15 1 1 1 SNG-PA_PR5249_N1_S1_L001 PA_PR5249_N1_S1_L001 -0.6931 PA_PR5251_T1_S7_L001 -4.8840 -1.3712 PA_PR5249_N1_S1_L001 PA_PR5251_T1_S7_L001 0.100 -0.7968 -0.6931 -4.8840 -0.6931 -2.8680 -1.3712 0.474 0.985 AAAAAAAAAAAT 26 3 3 3 SNG-PA_PR5249_N1_S1_L001 PA_PR5249_N1_S1_L001 -0.0226 PA_PR5251_T1_S7_L001 -2.0792 -0.8780 PA_PR5249_N1_S1_L001 PA_PR5251_T1_S7_L001 0.100 -0.1753 -0.0226 -2.0792 -0.0226 -1.7974 -0.8780 0.424 0.887 AAAAAAAAAACT 16 1 1 1 AMB-PA_PR5251_T1_S7_L001-PA_PR5249_N1_S1_L001-PA_PR5251_T1_S7_L001/PA_PR5249_N1_S1_L001 PA_PR5251_T1_S7_L001 -0.0075 PA_PR5249_N1_S1_L001 -0.6931 -0.2927 PA_PR5251_T1_S7_L001 PA_PR5249_N1_S1_L001 0.100 -0.0584 -0.0075 -0.6931 -0.0584 -0.6931 -0.2927 0.474 0.665 AAAAAAAAAAGC 8 3 3 3 AMB-PA_PR5249_N1_S1_L001-PA_PR5251_T1_S7_L001-PA_PR5249_N1_S1_L001/PA_PR5251_T1_S7_L001 PA_PR5249_N1_S1_L001 -0.7081 PA_PR5251_T1_S7_L001 -1.3937 -0.8780 PA_PR5249_N1_S1_L001 PA_PR5251_T1_S7_L001 0.100 -0.7160 -0.7081 -1.3937 -0.7081 -1.2567 -0.8780 0.495 0.665 AAAAAAAAAAGT 3 2 1 1 AMB-PA_PR5249_N1_S1_L001-PA_PR5251_T1_S7_L001-PA_PR5249_N1_S1_L001/PA_PR5251_T1_S7_L001 PA_PR5249_N1_S1_L001 -0.0075 PA_PR5251_T1_S7_L001 -0.6931 -0.2927 PA_PR5249_N1_S1_L001 PA_PR5251_T1_S7_L001 0.100 -0.0584 -0.0075 -0.6931 -0.0075 -0.5991 -0.2927 0.474 0.665 AAAAAAAAAATA 28 2 2 2 AMB-PA_PR5251_T1_S7_L001-PA_PR5249_N1_S1_L001-PA_PR5249_N1_S1_L001/PA_PR5251_T1_S7_L001 PA_PR5251_T1_S7_L001 -0.7006 PA_PR5249_N1_S1_L001 -0.7006 -0.5853 PA_PR5249_N1_S1_L001 PA_PR5251_T1_S7_L001 0.500 -0.5853 -0.7006 -0.7006 -0.7006 -0.5853 -0.5853 0.495 0.5 AAAAAAAAAATC 11 1 1 1 AMB-PA_PR5249_N1_S1_L001-PA_PR5251_T1_S7_L001-PA_PR5249_N1_S1_L001/PA_PR5251_T1_S7_L001 PA_PR5249_N1_S1_L001 -0.0075 PA_PR5251_T1_S7_L001 -0.6931 -0.2927 PA_PR5249_N1_S1_L001 PA_PR5251_T1_S7_L001 0.100 -0.0584 -0.0075 -0.6931 -0.0075 -0.5991 -0.2927 0.474 0.665

There are >5 million different barcodes according to this file but that does not agree with what I had originally in the bam file. I believe I only captured ~3000 cells from these two samples and the barcode "AAAAAAAAAAAA" wasn't in the barcode list. Do you happen to know what might cause this and how to solve this problem?

I'm wondering if it's possible to also use a barcode list as part of the input so demuxlet could potentially filter the results and filter out all barcodes not present in the list?

Thank you so much for your help!