Closed SteampunkIslande closed 6 months ago
With Number=G tags, each combination of alleles has its own value. For diploid genotypes, one has for two alleles A,B three genotypes AA,AB,BB. For three alleles A,B,C, there are six genotypes AA,AB,BB,AC,BC,CC. For four alleles A,B,C,D there are 10 combinations
A B C D
A aa
B ab bb
C ac bc cc
D ad bd cd dd
When a multiallelic site is split, some of these values get permanently lost, and are not known when the alleles are merged back.
For the reference, this is described in the 'Genotype Ordering' section of the VCF specification https://samtools.github.io/hts-specs/VCFv4.3.pdf
Hi, I'm having a trouble understanding
Number=G
in format fields, from what I understand for each allele there should be 3 fields (one for homozygous ref, one for heterozygous, and one for homozygous alt). So far so good.However, I noticed that for a tri-allelic site:
then the following
should be a no-op (-m -any, pipe the result to -m +any).
But it doesn't get me back to where I started:
By the way, I don't understand why there should be 10 fields for a tri-allelic site, where for a bi-allelic site it's six.
Any help would be very much appreciated !
Also, I think it might be related to #1246 but I don't quite get it.
Have a nice day