statgen / demuxlet

Genetic multiplexing of barcoded single cell RNA-seq
Apache License 2.0
117 stars 25 forks source link

Likelihoods not caclulated (10x Chromium) #20

Closed harmbrugge closed 5 years ago

harmbrugge commented 6 years ago

I'm trying to run demuxlet on reads generated by the 10x chromium pipeline. The bam files are generated using cellranger. On each lane we've pooled 6 individuals, so the input files (bam, vcf) contain the information of 6 indivuals. For the group-list argument we use the corrected barcode output file yielded by cellranger.

The complete command I'm using:

demuxlet \
      --sam possorted_genome_bam.bam \
      --tag-group CB \
      --tag-UMI UB \
      --field GP \
      --vcf sorted.vcf.gz \
      --out output_filtered_barcodes \
      --group-list barcodes.tsv

The software runs but the likelihoods in the generated output files are all set to 0. output_filtered_barcodes.best:

BARCODE RD.TOTL RD.PASS RD.UNIQ N.SNP   BEST    SNG.1ST SNG.LLK1    SNG.2ND SNG.LLK2    SNG.LLK0    DBL.1ST DBL.2ND ALPHA   LLK12   LLK1    LLK2    LLK10   LLK20   LLK00   PRB.DBL PRB.SNG1
AAACCTGGTGCCTTGG-1  19039   3240    1126    1016    AMB-sample1-sample1-sample1/sample1 sample1 0.0000  sample1 0.0000  nan sample1sample1  0.000   0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  nan nan
AAACGGGAGAGAACAG-1  12617   2276    721 669 AMB-sample1-sample1-sample1/sample1 sample1 0.0000  sample1 0.0000  nan sample1sample1  0.000   0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  nan nan

Error log:

NOTICE [2018/01/25 16:49:38] - Finished loading 4309 droplet/cell barcodes to consider
NOTICE [2018/01/25 16:49:38] - Finished identifying 6 samples to load from VCF/BCF
NOTICE [2018/01/25 16:49:38] - Reading 0 reads at 1:1 and skipping 0
NOTICE [2018/01/25 16:49:38] - WARNING: Cannot find Droplet/Cell tag UB from 748-th read K00296:61:HHVN5BBXX:5:1222:13007:17913 at 1:14187-566351. Treating all of them as a single group
...
NOTICE [2018/01/25 16:49:38] - WARNING: Cannot find Droplet/Cell tag UB from 15981-th read K00296:61:HHVN5BBXX:4:2224:6076:40367 at 1:134996-135133. Treating all of them as a single group
NOTICE [2018/01/25 16:49:38] - WARNING: Suppressing 10+ missing Droplet/Cell tag warnings...
NOTICE [2018/01/25 16:49:39] - WARNING: Cannot find UMI tag UB from 44270-th read K00296:61:HHVN5BBXX:6:2109:32491:25615 at 1:564485-564623. Treating all of them as a single UMI
[W::vcf_parse] FILTER 'GENOTYPED_ONLY' is not defined in the header
[W::vcf_parse] FILTER 'GENOTYPED' is not defined in the header
NOTICE [2018/01/25 16:49:44] - WARNING: Cannot find UMI tag UB from 88246-th read K00296:61:HHVN5BBXX:7:1111:22313:30943 at 1:565645-565783. Treating all of them as a single UMI
...
NOTICE [2018/01/25 16:49:59] - WARNING: Cannot find UMI tag UB from 108070-th read K00296:61:HHVN5BBXX:1:1117:10439:43040 at 1:565692-565830. Treating all of them as a single UMI
NOTICE [2018/01/25 16:49:59] - WARNING: Suppressing 10+ UMI warnings...
NOTICE [2018/01/25 16:50:33] - Reading 10000 variants at 1:1229225, Skipping 9210, Missing 0.
...
NOTICE [2018/01/26 13:59:12] - Reading 39230000 variants at 9:140680332, Skipping 33823488, Missing 0.
NOTICE [2018/01/26 13:59:36] - Reading 275000000 reads at MT:1987 and skipping 154598269
NOTICE [2018/01/26 13:59:47] - Reading 278000000 reads at MT:8002 and skipping 156520112
NOTICE [2018/01/26 14:00:15] - Reading 286000000 reads at X:48835761 and skipping 162389102
NOTICE [2018/01/26 14:00:18] - Reading 287000000 reads at X:71492517 and skipping 162877106
NOTICE [2018/01/26 14:00:22] - Reading 288000000 reads at X:71736339 and skipping 163745120
NOTICE [2018/01/26 14:01:51] - WARNING: Suppressed a total of 38387 UMI warnings...
NOTICE [2018/01/26 14:01:51] - WARNING: Suppressed a total of 4386863 droplet/cell barcode warnings...
NOTICE [2018/01/26 14:01:51] - Finished reading 5406968 markers from the VCF file
NOTICE [2018/01/26 14:01:51] - Total number input reads : 314454500
NOTICE [2018/01/26 14:01:51] - Total number of read-QC-passed reads : 126383207 
NOTICE [2018/01/26 14:01:51] - Total number of skipped reads with ignored barcodes : 20439398
NOTICE [2018/01/26 14:01:51] - Total number of non-skipped reads with considered barcodes : 105882246
NOTICE [2018/01/26 14:01:51] - Total number of gapped/noninformative reads : 89224135
NOTICE [2018/01/26 14:01:51] - Total number of base-QC-failed reads : 1489872
NOTICE [2018/01/26 14:01:51] - Total number of redundant reads : 9605337
NOTICE [2018/01/26 14:01:51] - Total number of pass-filtered reads : 5562902
NOTICE [2018/01/26 14:01:51] - Total number of pass-filtered reads overlapping with multiple SNPs : 630819
NOTICE [2018/01/26 14:01:51] - Starting to prune out cells with too few reads...
NOTICE [2018/01/26 14:01:51] - Finishing pruning out 0 cells with too few reads...
NOTICE [2018/01/26 14:01:51] - Starting to identify best matching individual IDs
NOTICE [2018/01/26 14:01:51] - Processing 10000 markers...
...
NOTICE [2018/01/26 14:01:55] - Processing 5400000 markers...
NOTICE [2018/01/26 14:01:55] - Identifying best-matching individual..
NOTICE [2018/01/26 14:01:55] - Processing 1000 droplets...
NOTICE [2018/01/26 14:01:55] - Processing 2000 droplets...
NOTICE [2018/01/26 14:01:55] - Processing 3000 droplets...
NOTICE [2018/01/26 14:01:55] - Processing 4000 droplets...
NOTICE [2018/01/26 14:01:55] - Finished processing 4309 droplets total
NOTICE [2018/01/26 14:02:09] - Processing 0 cells..
...
NOTICE [2018/01/26 14:04:16] - Processing 4300 cells..
NOTICE [2018/01/26 14:04:16] - Finished writing output files

Header and first 2 lines of the vcf file (this file is correctly sorted):

##fileformat=VCFv4.2
##filedate=2017.5.20
##INFO=<ID=AF,Number=1,Type=Float,Description="Estimated Alternate Allele Frequency">
##filedate=2016.5.11
##source=Minimac3
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DS,Number=1,Type=Float,Description="Estimated Alternate Allele Dosage : [P(0/1)+2*P(1/1)]">
##FORMAT=<ID=GP,Number=3,Type=Float,Description="Estimated Posterior Probabilities for Genotypes 0/0, 0/1 and 1/1">
##INFO=<ID=MAF,Number=1,Type=Float,Description="Estimated Alternate Allele Frequency">
##INFO=<ID=R2,Number=1,Type=Float,Description="Estimated Imputation Accuracy">
##INFO=<ID=ER2,Number=1,Type=Float,Description="Empirical (Leave-One-Out) R-square (available only for genotyped variants)">
##INFO=<ID=SF,Number=.,Type=String,Description="Source File (index to sourceFiles, f when filtered)">
##INFO=<ID=AC,Number=.,Type=Integer,Description="Allele count in genotypes">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##contig=<ID=1,length=249250621,assembly=b37>
##contig=<ID=10,length=135534747,assembly=b37>
##contig=<ID=11,length=135006516,assembly=b37>
##contig=<ID=12,length=133851895,assembly=b37>
##contig=<ID=13,length=115169878,assembly=b37>
##contig=<ID=14,length=107349540,assembly=b37>
##contig=<ID=15,length=102531392,assembly=b37>
##contig=<ID=16,length=90354753,assembly=b37>
##contig=<ID=17,length=81195210,assembly=b37>
##contig=<ID=18,length=78077248,assembly=b37>
##contig=<ID=19,length=59128983,assembly=b37>
##contig=<ID=2,length=243199373,assembly=b37>
##contig=<ID=20,length=63025520,assembly=b37>
##contig=<ID=21,length=48129895,assembly=b37>
##contig=<ID=22,length=51304566,assembly=b37>
##contig=<ID=3,length=198022430,assembly=b37>
##contig=<ID=4,length=191154276,assembly=b37>
##contig=<ID=5,length=180915260,assembly=b37>
##contig=<ID=6,length=171115067,assembly=b37>
##contig=<ID=7,length=159138663,assembly=b37>
##contig=<ID=8,length=146364022,assembly=b37>
##contig=<ID=9,length=141213431,assembly=b37>
##contig=<ID=X,length=155270560,assembly=b37>
##contig=<ID=Y,length=59373566,assembly=b37>
##contig=<ID=MT,length=16569,assembly=b37>
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  sample1 sample2 sample3 sample4 sample5 sample6
1   13380   1:13380 C   G   .   PASS    AC=0;AF=0.00003;AN=94;MAF=0.00003;R2=0.00000;SF=0,1 GT:DS:GP    0|0:0.000:1.000,0.000,0.000 0/0:0.000:1.000,0.000,0.000 0/0:0.000:1.000,0.000,0.000 0/0:0.000:1.000,0.000,0.000 0/0:0.000:1.000,0.000,0.000 0/0:0.000:1.000,0.000,0.000
1   16071   1:16071 G   A   .   PASS    AC=0;AF=0.00006;AN=94;MAF=0.00006;R2=0.00004;SF=0,1 GT:DS:GP    0|0:0.000:1.000,0.000,0.000 0/0:0.000:1.000,0.000,0.000 0/0:0.000:1.000,0.000,0.000 0/0:0.000:1.000,0.000,0.000 0/0:0.000:1.000,0.000,0.000 0/0:0.000:1.000,0.000,0.000

If i subset the VCF and BAM file for only chromosome 1 the likelihoods do seem to be correctly calculated:

BARCODE RD.TOTL RD.PASS RD.UNIQ N.SNP   BEST    SNG.1ST SNG.LLK1    SNG.2ND SNG.LLK2    SNG.LLK0    DBL.1ST DBL.2ND ALPHA   LLK12   LLK1    LLK2    LLK10   LLK20   LLK00   PRB.DBL PRB.SNG1
AAACCTGGTGCCTTGG-1  1985    417 121 118 DBL-sample3-1_LLDeep_0938-0.100 sample3 -32.9460    sample1 -91.1793    -53.7561    sample3 sample4 0.100   -30.8391    -32.9460    -104.3739   -32.9155    -79.9862    -53.7624    0.345   1
AAACGGGAGAGAACAG-1  1127    204 67  66  SNG-sample2 sample2 -28.5922    LLDeep_1619 -87.9795    -44.3673    sample2 sample1 0.100   -27.3026    -28.5922    -87.9795    -27.3026    -87.9806    -44.3754    0.228   1

Do you have any idea what might cause this issue

hyunminkang commented 6 years ago

Not sure why it happens. Is the chromosome order in VCF header correct? Is it consistent to that of BAM?

harmbrugge commented 6 years ago

Yes I checked this thoroughly. The order of the header of the VCF file corresponds to ordering of the file and the BAM file is ordered in the same manner.

hyunminkang commented 6 years ago

If the order of chromosomes in VCF header also consistent with the order of chromosomes in the VCF records?

On Sat, Apr 7, 2018 at 5:06 PM Harm Brugge notifications@github.com wrote:

Yes I checked this thoroughly. The order of the header of the VCF file corresponds to ordering of the file and the BAM file is ordered in the same manner.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/statgen/demuxlet/issues/20#issuecomment-379447710, or mute the thread https://github.com/notifications/unsubscribe-auth/AF-OuYqTEr5J4JVVoNv10t3zAQ2uRYk9ks5tmHN9gaJpZM4TGqqu .

hyunminkang commented 6 years ago

Sorry for the typo -- I mean to start with "Is", not "If"

On Mon, Apr 9, 2018 at 5:19 AM Hyun Min Kang hmkang@umich.edu wrote:

If the order of chromosomes in VCF header also consistent with the order of chromosomes in the VCF records?

On Sat, Apr 7, 2018 at 5:06 PM Harm Brugge notifications@github.com wrote:

Yes I checked this thoroughly. The order of the header of the VCF file corresponds to ordering of the file and the BAM file is ordered in the same manner.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/statgen/demuxlet/issues/20#issuecomment-379447710, or mute the thread https://github.com/notifications/unsubscribe-auth/AF-OuYqTEr5J4JVVoNv10t3zAQ2uRYk9ks5tmHN9gaJpZM4TGqqu .

harmbrugge commented 6 years ago

Yes the header of the VCF file is consistent with the order of the records: awk 'NR>17 {print $1}' sorted.vcf | uniq

##contig=<ID=1,length=249250621,assembly=b37>
##contig=<ID=10,length=135534747,assembly=b37>
##contig=<ID=11,length=135006516,assembly=b37>
##contig=<ID=12,length=133851895,assembly=b37>
##contig=<ID=13,length=115169878,assembly=b37>
##contig=<ID=14,length=107349540,assembly=b37>
##contig=<ID=15,length=102531392,assembly=b37>
##contig=<ID=16,length=90354753,assembly=b37>
##contig=<ID=17,length=81195210,assembly=b37>
##contig=<ID=18,length=78077248,assembly=b37>
##contig=<ID=19,length=59128983,assembly=b37>
##contig=<ID=2,length=243199373,assembly=b37>
##contig=<ID=20,length=63025520,assembly=b37>
##contig=<ID=21,length=48129895,assembly=b37>
##contig=<ID=22,length=51304566,assembly=b37>
##contig=<ID=3,length=198022430,assembly=b37>
##contig=<ID=4,length=191154276,assembly=b37>
##contig=<ID=5,length=180915260,assembly=b37>
##contig=<ID=6,length=171115067,assembly=b37>
##contig=<ID=7,length=159138663,assembly=b37>
##contig=<ID=8,length=146364022,assembly=b37>
##contig=<ID=9,length=141213431,assembly=b37>
##contig=<ID=X,length=155270560,assembly=b37>
##contig=<ID=Y,length=59373566,assembly=b37>
##contig=<ID=MT,length=16569,assembly=b37>
#CHROM
1
10
11
12
13
14
15
16
17
18
19
2
20
21
22
3
4
5
6
7
8
9
harmbrugge commented 6 years ago

Do you have any other suggestions?

hyunminkang commented 6 years ago

I can't think of a reason why the likelihood would not be calculated when using all chromosomes while it works for chr1. I need to see some example file to be able to debug. I'm assuming both VCF and BAM files are indexed using tabix and samtools?

Thanks, Hyun.

On Wed, Apr 11, 2018 at 4:30 PM Harm Brugge notifications@github.com wrote:

Do you have any other suggestions?

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/statgen/demuxlet/issues/20#issuecomment-380355565, or mute the thread https://github.com/notifications/unsubscribe-auth/AF-OudT6vMGtK7vnSKrSauXEk_SBpOhyks5tnbD8gaJpZM4TGqqu .