Open freeseek opened 3 years ago
Hi,
Thanks for pointing this out. No idea of what is going on. I'll look at it when I have some time and come back to you. Would be good to have some real data illustrating this issue though. Few words on how I tested the "imputation module" in SHAPEIT4. I used real data into which I introduced an increased amounts of missing genotypes (for which I know the truth), from a few percent to 50% missing. For each, I computed switch error and imputation error rates. In this setting, I could not find any problem nor weird behavior. Let me know if you have any news on this.
Best,
Olivier.
The way I have discovered this issue was when phasing a cohort where by mistake a significant number of variants (significant in absolute terms while very small in relative terms) had very high levels of missingness, possibly greater than 50%. When I run a PCA together with genotypes from the 1000 Genomes project, we realized all phased samples were shifted compared to 1000 Genomes project samples. Upon inspection, the PCA shift was mostly driven by variants with mostly missing genotypes before phasing and for which the minor allele was the reference allele. It then became clear to me that SHAPEIT4 was filling these missing genotypes with reference alleles, despite this being the minor allele in 1000 Genomes samples, and in the end while 1000 Genomes samples had the reference allele as the minor allele, the phased samples had the alternate allele as the minor allele. This behavior across many variants across the genome caused a significant shift observable in one of the main PCs. I have not observed this problem for variants with small levels of missingness, but I thought it was a weird SHAPEIT4 behavior and the example I provided seems to exactly recapitulate the root issue.
I have noticed that SHAPEIT4.2.1, when dealing with variants with high missingness rate, will sometimes fill the variants with very unlikely genotypes, sometimes even completely flipping which one is the minor allele if enough genotypes are missing.
I have come up with a way to generate an example of this behavior:
This generates a cohort with two SNPs only. I could generate more, but maybe this is enough to expose the odd behavior. To look at the haplotypes:
Now, after phasing the cohort with this command:
I obtain:
It does not seem to make any sense that some missing genotypes were filled in with
0|0
genotypes while they should have all been filled with1|1
.Notice that if I flip reference and alternate homozygote genotypes:
I obtain the sort of symmetrical cohort:
Now, after phasing the cohort with this command:
I obtain:
And now all the missing genotypes are correctly filled with
0|0
and no missing genotypes were filled with1|1
. This seems to cause some very puzzling artifacts in some corner cases. It might be better to even error out when too many missing genotypes are present if this is the current behavior.