samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html
http://samtools.github.io/bcftools/
Other
673 stars 240 forks source link

unexpected behavior of bcftools merge --gvcf in certain situations #931

Closed gmagoon closed 5 years ago

gmagoon commented 5 years ago

Here are my input files:

$ zcat SampleA.bcftools.bug.g.vcf.gz 
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##reference=file://hs38DH.fa
##contig=<ID=chr1,length=248956422>
##contig=<ID=chr2,length=242193529>
##contig=<ID=chr3,length=198295559>
##contig=<ID=chr4,length=190214555>
##contig=<ID=chr5,length=181538259>
##contig=<ID=chr6,length=170805979>
##contig=<ID=chr7,length=159345973>
##contig=<ID=chr8,length=145138636>
##contig=<ID=chr9,length=138394717>
##contig=<ID=chr10,length=133797422>
##contig=<ID=chr11,length=135086622>
##contig=<ID=chr12,length=133275309>
##contig=<ID=chr13,length=114364328>
##contig=<ID=chr14,length=107043718>
##contig=<ID=chr15,length=101991189>
##contig=<ID=chr16,length=90338345>
##contig=<ID=chr17,length=83257441>
##contig=<ID=chr18,length=80373285>
##contig=<ID=chr19,length=58617616>
##contig=<ID=chr20,length=64444167>
##contig=<ID=chr21,length=46709983>
##contig=<ID=chr22,length=50818468>
##contig=<ID=chrX,length=156040895>
##contig=<ID=chrY,length=57227415>
##contig=<ID=chrM,length=16569>
##ALT=<ID=*,Description="Represents allele(s) other than observed.">
##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL.">
##INFO=<ID=IDV,Number=1,Type=Integer,Description="Maximum number of reads supporting an indel">
##INFO=<ID=IMF,Number=1,Type=Float,Description="Maximum fraction of reads supporting an indel">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">
##INFO=<ID=VDB,Number=1,Type=Float,Description="Variant Distance Bias for filtering splice-site artefacts in RNA-seq data (bigger is better)",Version="3">
##INFO=<ID=RPB,Number=1,Type=Float,Description="Mann-Whitney U test of Read Position Bias (bigger is better)">
##INFO=<ID=MQB,Number=1,Type=Float,Description="Mann-Whitney U test of Mapping Quality Bias (bigger is better)">
##INFO=<ID=BQB,Number=1,Type=Float,Description="Mann-Whitney U test of Base Quality Bias (bigger is better)">
##INFO=<ID=MQSB,Number=1,Type=Float,Description="Mann-Whitney U test of Mapping Quality vs Strand Bias (bigger is better)">
##INFO=<ID=SGB,Number=1,Type=Float,Description="Segregation based metric.">
##INFO=<ID=MQ0F,Number=1,Type=Float,Description="Fraction of MQ0 reads (smaller is better)">
##INFO=<ID=I16,Number=16,Type=Float,Description="Auxiliary tag used for calling, see description of bcf_callret1_t in bam2bcf.h">
##INFO=<ID=QS,Number=R,Type=Float,Description="Auxiliary tag used for calling">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Number of high-quality bases">
##FORMAT=<ID=SP,Number=1,Type=Integer,Description="Phred-scaled strand bias P-value">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">
##INFO=<ID=MinDP,Number=1,Type=Integer,Description="Minimum per-sample depth in this gVCF block">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  SampleA
chr20   6281864 .   C   <*> .   .   END=6281909;MinDP=1;QS=1,0  PL:DP   0,3,27:1
chr20   6283999 .   C   <*> .   .   END=6284344;MinDP=1;QS=1,0  PL:DP   0,3,4:1
$ zcat SampleB.bcftools.bug.g.vcf.gz 
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##reference=file://hs38DH.fa
##contig=<ID=chr1,length=248956422>
##contig=<ID=chr2,length=242193529>
##contig=<ID=chr3,length=198295559>
##contig=<ID=chr4,length=190214555>
##contig=<ID=chr5,length=181538259>
##contig=<ID=chr6,length=170805979>
##contig=<ID=chr7,length=159345973>
##contig=<ID=chr8,length=145138636>
##contig=<ID=chr9,length=138394717>
##contig=<ID=chr10,length=133797422>
##contig=<ID=chr11,length=135086622>
##contig=<ID=chr12,length=133275309>
##contig=<ID=chr13,length=114364328>
##contig=<ID=chr14,length=107043718>
##contig=<ID=chr15,length=101991189>
##contig=<ID=chr16,length=90338345>
##contig=<ID=chr17,length=83257441>
##contig=<ID=chr18,length=80373285>
##contig=<ID=chr19,length=58617616>
##contig=<ID=chr20,length=64444167>
##contig=<ID=chr21,length=46709983>
##contig=<ID=chr22,length=50818468>
##contig=<ID=chrX,length=156040895>
##contig=<ID=chrY,length=57227415>
##contig=<ID=chrM,length=16569>
##ALT=<ID=*,Description="Represents allele(s) other than observed.">
##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL.">
##INFO=<ID=IDV,Number=1,Type=Integer,Description="Maximum number of reads supporting an indel">
##INFO=<ID=IMF,Number=1,Type=Float,Description="Maximum fraction of reads supporting an indel">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">
##INFO=<ID=VDB,Number=1,Type=Float,Description="Variant Distance Bias for filtering splice-site artefacts in RNA-seq data (bigger is better)",Version="3">
##INFO=<ID=RPB,Number=1,Type=Float,Description="Mann-Whitney U test of Read Position Bias (bigger is better)">
##INFO=<ID=MQB,Number=1,Type=Float,Description="Mann-Whitney U test of Mapping Quality Bias (bigger is better)">
##INFO=<ID=BQB,Number=1,Type=Float,Description="Mann-Whitney U test of Base Quality Bias (bigger is better)">
##INFO=<ID=MQSB,Number=1,Type=Float,Description="Mann-Whitney U test of Mapping Quality vs Strand Bias (bigger is better)">
##INFO=<ID=SGB,Number=1,Type=Float,Description="Segregation based metric.">
##INFO=<ID=MQ0F,Number=1,Type=Float,Description="Fraction of MQ0 reads (smaller is better)">
##INFO=<ID=I16,Number=16,Type=Float,Description="Auxiliary tag used for calling, see description of bcf_callret1_t in bam2bcf.h">
##INFO=<ID=QS,Number=R,Type=Float,Description="Auxiliary tag used for calling">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Number of high-quality bases">
##FORMAT=<ID=SP,Number=1,Type=Integer,Description="Phred-scaled strand bias P-value">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">
##INFO=<ID=MinDP,Number=1,Type=Integer,Description="Minimum per-sample depth in this gVCF block">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  SampleB
chr20   6281909 .   T   <*> 0   .   DP=2;I16=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0;QS=0,0;MQ0F=0  PL:DP:SP    0,0,0:0:0
chr20   6281910 .   A   <*> .   .   END=6281930;MinDP=1;QS=1,0  PL:DP   0,3,25:1
chr20   6283321 .   T   <*> .   .   END=6283435;MinDP=4;QS=1,0  PL:DP   0,12,14:4
chr20   6283436 .   G   A,<*>   0   .   DP=4;I16=0,1,2,1,37,1369,87,2619,3,9,9,27,9,81,71,1691;QS=0.25,0.75,0;VDB=0.154358;SGB=-0.511536;RPB=1;MQB=1;MQSB=1;BQB=1;MQ0F=0    PL:DP:SP    6,5,0,9,9,7:4:0
chr20   6283437 .   A   <*> .   .   END=6283744;MinDP=2;QS=1,0  PL:DP   0,6,22:2

Now I run bcftools merge --gvcf and get the following output:

$ $HOME/bcftools-1.9/bcftools merge --gvcf $HOME/hs38DH.fa SampleA.bcftools.bug.g.vcf.gz SampleB.bcftools.bug.g.vcf.gz | grep -v "^#"
chr20   6281864 .   C   <*> .   .   END=6281908;MinDP=1;QS=1,0  PL:DP   0,3,27:1    .:.
chr20   6281909 .   T   <*> 0   .   MQ0F=0;DP=2;I16=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0;MinDP=1;QS=1,0  PL:DP:SP    0,3,27:1:.  0,0,0:0:0
chr20   6283436 .   G   <*>,A   0   .   VDB=0.154358;SGB=-0.511536;RPB=1;MQB=1;MQSB=1;BQB=1;MQ0F=0;DP=4;I16=0,1,2,1,37,1369,87,2619,3,9,9,27,9,81,71,1691;MinDP=1;QS=1.25,0,0.75    PL:DP:SP    0,3,27,.,.,.:1:.    6,9,7,5,9,0:4:0

The main issue I wanted to point out is that I would expect that the SampleA result at 6283436 is .:.:. (and this is indeed the case if the 6281909 entry in SampleB vcf is omitted).

However, in putting this example together, I also noticed that there are number of apparently missing entries in the merged output, including 6281910, 6283321, 6283437, 6283999.

pd3 commented 5 years ago

Thank you for reporting the problem and the test case! I believe the issue should be fixed by 958180e. Please let me know if you spot any other problems.

gmagoon commented 5 years ago

great, thanks Petr!