samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html
http://samtools.github.io/bcftools/
Other
663 stars 240 forks source link

bcftools norm not working as expected for Number=G fields #2114

Closed SteampunkIslande closed 6 months ago

SteampunkIslande commented 7 months ago

Hi, I'm having a trouble understanding Number=G in format fields, from what I understand for each allele there should be 3 fields (one for homozygous ref, one for heterozygous, and one for homozygous alt). So far so good.

However, I noticed that for a tri-allelic site:

##fileformat=VCFv4.3
##FILTER=<ID=PASS,Description="All filters passed">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PGT,Number=1,Type=String,Description="Phased Genotype">
##FORMAT=<ID=PL,Number=G,Type=String,Description="Phred Score for genotype likelihood">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##contig=<ID=chr1,length=249250621>
##contig=<ID=chr2,length=243199373>
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  Sample1
chr1    200 .   C   G,GG,GT 60  PASS    DP=15   PL  1,2,3,4,5,6,7,8,9,10

then the following

bcftools norm -m -any test.vcf 2>/dev/null | bcftools norm -m +any - 2>/dev/null

should be a no-op (-m -any, pipe the result to -m +any).

But it doesn't get me back to where I started:

##fileformat=VCFv4.3
##FILTER=<ID=PASS,Description="All filters passed">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PGT,Number=1,Type=String,Description="Phased Genotype">
##FORMAT=<ID=PL,Number=G,Type=String,Description="Phred Score for genotype likelihood">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##contig=<ID=chr1,length=249250621>
##contig=<ID=chr2,length=243199373>
##bcftools_normVersion=1.19+htslib-1.19.1
##bcftools_normCommand=norm -m -any vcf2parquet-py/tests/test.vcf; Date=Mon Mar  4 12:25:10 2024
##bcftools_normCommand=norm -m +any -; Date=Mon Mar  4 12:25:10 2024
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  Sample1
chr1    200     .       C       G,GG,GT 60      PASS    DP=15   PL      1,2,3,4,.,6,7,.,.,10

By the way, I don't understand why there should be 10 fields for a tri-allelic site, where for a bi-allelic site it's six.

Any help would be very much appreciated !

Also, I think it might be related to #1246 but I don't quite get it.

Have a nice day

pd3 commented 6 months ago

With Number=G tags, each combination of alleles has its own value. For diploid genotypes, one has for two alleles A,B three genotypes AA,AB,BB. For three alleles A,B,C, there are six genotypes AA,AB,BB,AC,BC,CC. For four alleles A,B,C,D there are 10 combinations

  A  B  C  D
A aa
B ab bb
C ac bc cc
D ad bd cd dd

When a multiallelic site is split, some of these values get permanently lost, and are not known when the alleles are merged back.

For the reference, this is described in the 'Genotype Ordering' section of the VCF specification https://samtools.github.io/hts-specs/VCFv4.3.pdf