odelaneau / shapeit4

Segmented HAPlotype Estimation and Imputation Tool
MIT License
89 stars 17 forks source link

Phasing multiallelic variants #25

Open alimanfoo opened 4 years ago

alimanfoo commented 4 years ago

Hi there, does shapeit4 currently support phasing multiallelic variants (2 or more ALT alleles) or are only biallelic variants supported?

Many thanks.

odelaneau commented 4 years ago

Hi,

S4 only supports bi-allelic variants. Multi-allelics if not first decomposed into bi-allelics will be automatically skipped when reading input files.

Best

alimanfoo commented 4 years ago

Thanks @odelaneau.

Do you think it's reasonable to decompose multiallelic variants into multiple separate biallelic variants? (I always wondered what should happen there regarding the coding of alleles. E.g., if you have a triallelic variant and three samples with genotypes 0/1, 0/2 and 1/2, what do those genotypes become if you decompose into two biallelic variants?)

odelaneau commented 4 years ago

Not entirely sure, but I guess each alternative alleles at a multi-allelic site is encoded using a bi-allelic variant. So 0/1, 0/2 and 1/2 would become [0/1+0/0], [0/0+0/1] and [0/1+0/1], respectively.

alimanfoo commented 4 years ago

Not entirely sure, but I guess each alternative alleles at a multi-allelic site is encoded using a bi-allelic variant. So 0/1, 0/2 and 1/2 would become [0/1+0/0], [0/0+0/1] and [0/1+0/1], respectively.

Thanks, that what I suspected might happen, but it seems wrong to me because it replaces a non-reference allele with a reference allele. I.e., there becomes some ambiguity regarding what "0" means, because in some cases it means the reference allele was really observed in the individual, and in some cases it stands for some other non-reference allele. This ambiguity seems bad for phasing.

I.e., if you have to decompose, it would seem better if 0/1, 0/2 and 1/2 became [0/1 + 0/?], [0/? + 0/1] and [1/? + ?/1], where "?" stands for some other non-reference allele.

mbyrska commented 1 year ago

@odelaneau I have a related question. If I split multiallelics using "bcftools norm --atomize --atom-overlaps . " which essentially breaks down 1/2 genotypes into "1/." and "./1" (as opposed to "1/0" and "0/1", as is done by default) and is more biologically correct, will SHAPEIT4 phase such sites? or can it only handle genotypes with biallelic representation where both alleles are non-missing?