rwdavies / STITCH

STITCH - Sequencing To Imputation Through Constructing Haplotypes
http://www.nature.com/ng/journal/v48/n8/abs/ng.3594.html
GNU General Public License v3.0
76 stars 17 forks source link

genfile meaning and r^2 #87

Open SelinaKlees opened 11 months ago

SelinaKlees commented 11 months ago

Dear Robert,

I have a question regarding the genfile. Actually, it is not totally clear to me if the genotypes of high coverage seq data (i.e. the genfile) is actually used for the impuation itself (as some kind of reference panel) or if it is just used to obtain a quality coefficient (r^2). Further, in my results plots, I found the r^2 plot and the value but just for the "goodonly", as it is called. How can I obtain the r^2 value for all SNPs?

Thank you for your help, Selina

rwdavies commented 11 months ago

Hi,

The high coverage genotypes (the genfile) is not used for imputation, it is only used to provide a measure of accuracy during imputation. The reason to include it during imputation is to get an estimate of the accuracy at each iteration, to guide parameter choices, e.g. the number of iterations.

The other plot you mention should be a plot of of the "real" allele frequency (from the pileup) vs the estimated allele frequency (from the imputation). It makes the most sense for SNPs that pass QC.

Nonetheless if you want to plot all SNPs you can take a look at the function

            plotEstimatedAgainstReal(
                outputdir = outputdir,alleleCount=alleleCount,
                estimatedAlleleFrequency=estimatedAlleleFrequency,
                which=passQC,chr=chr,regionName=regionName
            )

and the code for this function here https://github.com/rwdavies/STITCH/blob/ecba5f771a86313c74465a0c54fefbf96688fceb/STITCH/R/functions.R#L4994

you should be able to load the results (RData/EM.a.ll.* as appropriate) and by sourcing the STITCH/R/functions.R file then you can call something like

            plotEstimatedAgainstReal(
                outputdir = outputdir,alleleCount=alleleCount,
                estimatedAlleleFrequency=estimatedAlleleFrequency,
                which=rep(TRUE, length(estimatedAlleleFrequency),chr=chr,regionName=regionName
            )

and it should work (though would override your original file)

Hope that helps, Robbie