rwdavies / STITCH

STITCH - Sequencing To Imputation Through Constructing Haplotypes
http://www.nature.com/ng/journal/v48/n8/abs/ng.3594.html
GNU General Public License v3.0
73 stars 19 forks source link

Missing genotypes even after imputation #58

Open EveTC opened 2 years ago

EveTC commented 2 years ago

Hello,

I have successfully imputed the genotypes of interest which has greatly improved the missinigness. However, after imputation there is still some missingness in the data - see plots below. Is there a way to force STITCH not to leave any missingness? I am hoping to apply a method where no missingness is allowed.

Missingness prior to imputation (site level) Missingness post imputation (site level)
Eng_MissPreImpute_siteLevel Eng_MissPostImpute_siteLevel

Thank you for your help in advance

rwdavies commented 2 years ago

If you use the dosages, there is no missing data. Similarly, you have the genotype posterior probabilities for each sample for each of the three genotypes, and from this, you can reset the GT value to always be 0/0, 0/1 or 1/1 based on whichever has the highest genotype posterior. You can do this either manually in the VCF (using e.g. R or Python etc), or I imagine bcftools or perhaps the GATK has this functionality

EveTC commented 2 years ago

Hi @rwdavies - thank you so much for your speedy response!

My apologies but I am very new to imputation - what do you mean by "dosages"?

Thank you :)

rwdavies commented 2 years ago

Sure, happy to explain.

So practically these are in the DS entry in the per-sample, per-SNP entry in the VCF. e.g.

GT:GP:DS        0/1:0.025,0.975,0.000:0.975     1/1:0.000,0.062,0.938:1.938

in the header of the VCF file it says

##FORMAT=<ID=GT,Number=1,Type=String,Description="Best Guessed Genotype with posterior probability threshold of 0.9">
##FORMAT=<ID=GP,Number=3,Type=Float,Description="Posterior genotype probability of 0/0, 0/1, and 1/1">
##FORMAT=<ID=DS,Number=1,Type=Float,Description="Dosage">

Now as to what they represent. First start with the GP entries the genotype posterior probabilities for the genotypes 0/0, 0/1 and 1/1. These are, under the model, P(Genotype = 0/0 | Data, parameters), for each of 0/0, 0/1 and 1/1. As there are only three genotypes under consideration, these probabilities will sum to 1.

Now the "dosage" DS entry is the expected number of alternate alleles. It is dosage = Eexpected # of alternate alleles = sum genotype * P(genotype | data, parameters) This means it is 0 times the genotype posterior probability for a genotype of 0/0, 1 times the GP for a genotype of 0/1, and 2 times the GP for a genotype of 1/1. It is therefore a value scaled between 0 and 2.

Finally the genotype entry GT, in STITCH, is the most likely genotype, but scaled to missing if the GP is less than 0.90.

So e.g. GT:GP:DS 0/1:0.025,0.975,0.000:0.975 is a sample that has genotye posterior probabilities of 0.025 for hom ref (0/0), 0.975 for het (1/1), and 0.000 for hom alt (1/1). This gives the corresponding dosage (0 0.025 + 0.975 1 + 0 * 0), and similarly, as the most likely genotype posterior is for the het genotype (0/1), and is above 0.90 (0.975 > 0.90), the GT entry is 0/1

Hope that helps, let me know if you have more questions

EveTC commented 2 years ago

Oh my goodness - this makes so much more sense now! Fantastically well explained - thank you so much.

So I need to find a way to either 1) change the "but scaled to missing if the GP is less than 0.90" so that it is more lenient? or 2) use the posterior probabilities (or dosage) to update the ./. to the correct GT?

Any ideas to point me in the right direction? I cant seem to find anything in bcftools

Thanks again

rwdavies commented 2 years ago

I didn't find anything obvious in bcftools myself, though maybe I didn't look hard enough.

I thought my Python was good enough to write this in 5 minutes but took more like 20. Oh well! Below should work with something like the following. It does the thing I just described. I hope it works, please do some sanity checking on your real data.

gunzip -c ~/proj/STITCH/test-results/mouse_tests/whole_region/stitch.chr19.vcf.gz | python temp.py - 
import sys

for line in sys.stdin:
    if line[0] == "#":
        print(line)
    else:
        x = line.split("\t")
        print(len(x))
        for i in range(9, len(x)):
            x2 = x[i].split(":")
            gp = [float(a) for a in x2[1].split(",")]
            max_index = gp.index(max(gp))
            if max_index == 0:
                x2[0] = "0/0"
            elif max_index == 1:
                x2[0] = "0/1"
            elif max_index == 2:
                x2[0] = "1/1"
            x[i] = x2[0] + ":" + x2[1] + ":"  + x2[2]
        print("\t".join(x).replace('\n',''))
EveTC commented 2 years ago

This is incredible - thank you so much. What took you 20 minutes would have likely taken me weeks! I will have a go with it today and check that it works - thank you so much again for all your help.

EveTC commented 2 years ago

It works! 👍 One thing I have noticed is that it inserts a number (for me its 543) before each SNP record. I can just remove this with something like sed but thought I would let you know.

I have just ran this line of code to remove any line that contain 543 and any empy lines as well: sed -e '/^543$/d' infilled.vcf | sed '/^[[:space:]]*$/d' > infilled_ed.vcf

Thank you again for your help, you have saved me alot of time and headaches!