odelaneau / shapeit4

Segmented HAPlotype Estimation and Imputation Tool
MIT License
90 stars 18 forks source link

Weird and incorrect behavior when filling missing genotypes #61

Open freeseek opened 3 years ago

freeseek commented 3 years ago

I have noticed that SHAPEIT4.2.1, when dealing with variants with high missingness rate, will sometimes fill the variants with very unlikely genotypes, sometimes even completely flipping which one is the minor allele if enough genotypes are missing.

I have come up with a way to generate an example of this behavior:

(echo "##fileformat=VCFv4.2"
echo "##contig=<ID=chr1>"
echo "##FORMAT=<ID=GT,Number=1,Type=String,Description=\"Phased genotypes\">"
echo -en "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT"
for i in {1..100}; do echo -en "\tSM$i"; done
echo
echo -en "chr1\t10000\t.\tC\tT\t.\t.\t.\tGT"
for i in {1..10}; do echo -en "\t0/1"; done
for i in {11..100}; do echo -en "\t0/0"; done
echo
echo -en "chr1\t20000\t.\tA\tG\t.\t.\t.\tGT"
for i in {1..10}; do echo -en "\t0/1"; done
for i in {11..88}; do echo -en "\t1/1"; done
for i in {89..100}; do echo -en "\t./."; done
echo) | bcftools view --no-version -Ob -o input.bcf && bcftools index -f input.bcf

This generates a cohort with two SNPs only. I could generate more, but maybe this is enough to expose the odd behavior. To look at the haplotypes:

$ paste <(bcftools query -f "[%GT\n]" -r chr1:10000-10000 input.bcf) <(bcftools query -f "[%GT\n]" -r chr1:20000-20000 input.bcf)
0/1 0/1
0/1 0/1
0/1 0/1
0/1 0/1
0/1 0/1
0/1 0/1
0/1 0/1
0/1 0/1
0/1 0/1
0/1 0/1
0/0 1/1
0/0 1/1
0/0 1/1
0/0 1/1
0/0 1/1
0/0 1/1
0/0 1/1
0/0 1/1
0/0 1/1
0/0 1/1
0/0 1/1
0/0 1/1
0/0 1/1
0/0 1/1
0/0 1/1
0/0 1/1
0/0 1/1
0/0 1/1
0/0 1/1
0/0 1/1
0/0 1/1
0/0 1/1
0/0 1/1
0/0 1/1
0/0 1/1
0/0 1/1
0/0 1/1
0/0 1/1
0/0 1/1
0/0 1/1
0/0 1/1
0/0 1/1
0/0 1/1
0/0 1/1
0/0 1/1
0/0 1/1
0/0 1/1
0/0 1/1
0/0 1/1
0/0 1/1
0/0 1/1
0/0 1/1
0/0 1/1
0/0 1/1
0/0 1/1
0/0 1/1
0/0 1/1
0/0 1/1
0/0 1/1
0/0 1/1
0/0 1/1
0/0 1/1
0/0 1/1
0/0 1/1
0/0 1/1
0/0 1/1
0/0 1/1
0/0 1/1
0/0 1/1
0/0 1/1
0/0 1/1
0/0 1/1
0/0 1/1
0/0 1/1
0/0 1/1
0/0 1/1
0/0 1/1
0/0 1/1
0/0 1/1
0/0 1/1
0/0 1/1
0/0 1/1
0/0 1/1
0/0 1/1
0/0 1/1
0/0 1/1
0/0 1/1
0/0 1/1
0/0 ./.
0/0 ./.
0/0 ./.
0/0 ./.
0/0 ./.
0/0 ./.
0/0 ./.
0/0 ./.
0/0 ./.
0/0 ./.
0/0 ./.
0/0 ./.

Now, after phasing the cohort with this command:

$ shapeit4 --input input.bcf --region chr1 --output output.bcf && bcftools index -f output.bcf

I obtain:

$ paste <(bcftools query -f "[%GT\n]" -r chr1:10000-10000 output.bcf) <(bcftools query -f "[%GT\n]" -r chr1:20000-20000 output.bcf)
0|1 0|1
0|1 0|1
0|1 0|1
0|1 0|1
0|1 0|1
0|1 0|1
0|1 0|1
1|0 0|1
0|1 0|1
0|1 0|1
0|0 1|1
0|0 1|1
0|0 1|1
0|0 1|1
0|0 1|1
0|0 1|1
0|0 1|1
0|0 1|1
0|0 1|1
0|0 1|1
0|0 1|1
0|0 1|1
0|0 1|1
0|0 1|1
0|0 1|1
0|0 1|1
0|0 1|1
0|0 1|1
0|0 1|1
0|0 1|1
0|0 1|1
0|0 1|1
0|0 1|1
0|0 1|1
0|0 1|1
0|0 1|1
0|0 1|1
0|0 1|1
0|0 1|1
0|0 1|1
0|0 1|1
0|0 1|1
0|0 1|1
0|0 1|1
0|0 1|1
0|0 1|1
0|0 1|1
0|0 1|1
0|0 1|1
0|0 1|1
0|0 1|1
0|0 1|1
0|0 1|1
0|0 1|1
0|0 1|1
0|0 1|1
0|0 1|1
0|0 1|1
0|0 1|1
0|0 1|1
0|0 1|1
0|0 1|1
0|0 1|1
0|0 1|1
0|0 1|1
0|0 1|1
0|0 1|1
0|0 1|1
0|0 1|1
0|0 1|1
0|0 1|1
0|0 1|1
0|0 1|1
0|0 1|1
0|0 1|1
0|0 1|1
0|0 1|1
0|0 1|1
0|0 1|1
0|0 1|1
0|0 1|1
0|0 1|1
0|0 1|1
0|0 1|1
0|0 1|1
0|0 1|1
0|0 1|1
0|0 1|1
0|0 1|1
0|0 1|1
0|0 1|1
0|0 0|0
0|0 0|0
0|0 0|0
0|0 0|0
0|0 0|0
0|0 1|1
0|0 1|1
0|0 1|1
0|0 1|1

It does not seem to make any sense that some missing genotypes were filled in with 0|0 genotypes while they should have all been filled with 1|1.

Notice that if I flip reference and alternate homozygote genotypes:

(echo "##fileformat=VCFv4.2"
echo "##contig=<ID=chr1>"
echo "##FORMAT=<ID=GT,Number=1,Type=String,Description=\"Phased genotypes\">"
echo -en "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT"
for i in {1..100}; do echo -en "\tSM$i"; done
echo
echo -en "chr1\t10000\t.\tC\tT\t.\t.\t.\tGT"
for i in {1..10}; do echo -en "\t0/1"; done
for i in {11..100}; do echo -en "\t1/1"; done
echo
echo -en "chr1\t20000\t.\tA\tG\t.\t.\t.\tGT"
for i in {1..10}; do echo -en "\t0/1"; done
for i in {11..88}; do echo -en "\t0/0"; done
for i in {89..100}; do echo -en "\t./."; done
echo) | bcftools view --no-version -Ob -o input2.bcf && bcftools index -f input2.bcf

I obtain the sort of symmetrical cohort:

$ paste <(bcftools query -f "[%GT\n]" -r chr1:10000-10000 input2.bcf) <(bcftools query -f "[%GT\n]" -r chr1:20000-20000 input2.bcf)
0/1 0/1
0/1 0/1
0/1 0/1
0/1 0/1
0/1 0/1
0/1 0/1
0/1 0/1
0/1 0/1
0/1 0/1
0/1 0/1
1/1 0/0
1/1 0/0
1/1 0/0
1/1 0/0
1/1 0/0
1/1 0/0
1/1 0/0
1/1 0/0
1/1 0/0
1/1 0/0
1/1 0/0
1/1 0/0
1/1 0/0
1/1 0/0
1/1 0/0
1/1 0/0
1/1 0/0
1/1 0/0
1/1 0/0
1/1 0/0
1/1 0/0
1/1 0/0
1/1 0/0
1/1 0/0
1/1 0/0
1/1 0/0
1/1 0/0
1/1 0/0
1/1 0/0
1/1 0/0
1/1 0/0
1/1 0/0
1/1 0/0
1/1 0/0
1/1 0/0
1/1 0/0
1/1 0/0
1/1 0/0
1/1 0/0
1/1 0/0
1/1 0/0
1/1 0/0
1/1 0/0
1/1 0/0
1/1 0/0
1/1 0/0
1/1 0/0
1/1 0/0
1/1 0/0
1/1 0/0
1/1 0/0
1/1 0/0
1/1 0/0
1/1 0/0
1/1 0/0
1/1 0/0
1/1 0/0
1/1 0/0
1/1 0/0
1/1 0/0
1/1 0/0
1/1 0/0
1/1 0/0
1/1 0/0
1/1 0/0
1/1 0/0
1/1 0/0
1/1 0/0
1/1 0/0
1/1 0/0
1/1 0/0
1/1 0/0
1/1 0/0
1/1 0/0
1/1 0/0
1/1 0/0
1/1 0/0
1/1 0/0
1/1 ./.
1/1 ./.
1/1 ./.
1/1 ./.
1/1 ./.
1/1 ./.
1/1 ./.
1/1 ./.
1/1 ./.
1/1 ./.
1/1 ./.
1/1 ./.

Now, after phasing the cohort with this command:

$ shapeit4 --input input2.bcf --region chr1 --output output2.bcf && bcftools index -f output2.bcf

I obtain:

$ paste <(bcftools query -f "[%GT\n]" -r chr1:10000-10000 output2.bcf) <(bcftools query -f "[%GT\n]" -r chr1:20000-20000 output2.bcf)
0|1 0|1
0|1 0|1
0|1 0|1
0|1 0|1
0|1 0|1
0|1 0|1
0|1 0|1
0|1 0|1
0|1 0|1
0|1 0|1
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0
1|1 0|0

And now all the missing genotypes are correctly filled with 0|0 and no missing genotypes were filled with 1|1. This seems to cause some very puzzling artifacts in some corner cases. It might be better to even error out when too many missing genotypes are present if this is the current behavior.

odelaneau commented 3 years ago

Hi,

Thanks for pointing this out. No idea of what is going on. I'll look at it when I have some time and come back to you. Would be good to have some real data illustrating this issue though. Few words on how I tested the "imputation module" in SHAPEIT4. I used real data into which I introduced an increased amounts of missing genotypes (for which I know the truth), from a few percent to 50% missing. For each, I computed switch error and imputation error rates. In this setting, I could not find any problem nor weird behavior. Let me know if you have any news on this.

Best,

Olivier.

freeseek commented 3 years ago

The way I have discovered this issue was when phasing a cohort where by mistake a significant number of variants (significant in absolute terms while very small in relative terms) had very high levels of missingness, possibly greater than 50%. When I run a PCA together with genotypes from the 1000 Genomes project, we realized all phased samples were shifted compared to 1000 Genomes project samples. Upon inspection, the PCA shift was mostly driven by variants with mostly missing genotypes before phasing and for which the minor allele was the reference allele. It then became clear to me that SHAPEIT4 was filling these missing genotypes with reference alleles, despite this being the minor allele in 1000 Genomes samples, and in the end while 1000 Genomes samples had the reference allele as the minor allele, the phased samples had the alternate allele as the minor allele. This behavior across many variants across the genome caused a significant shift observable in one of the main PCs. I have not observed this problem for variants with small levels of missingness, but I thought it was a weird SHAPEIT4 behavior and the example I provided seems to exactly recapitulate the root issue.