rwdavies / STITCH

STITCH - Sequencing To Imputation Through Constructing Haplotypes
http://www.nature.com/ng/journal/v48/n8/abs/ng.3594.html
GNU General Public License v3.0
73 stars 19 forks source link

A question about info score #75

Open Antennaria opened 1 year ago

Antennaria commented 1 year ago

Hi! I'm now running STITCH on plants, and I wasn't able to get a good distribution of info score -- there are a lot of SNPs with info score between 0.2 and 1. I see that you set 0.4 as the threshold of info score for the plots of allele frequencies, so I also used 0.4 as a threshold, but I wonder how to interpret this score anyway.. Can you please share some ideas about what the info score reflects and how to define a reasonable threshold?

rwdavies commented 1 year ago

Hi,

Sorry for my slow reply, I've been involved with undergrad interviews here at Oxford the last few days, which has been all encompassing.

Feel free to let me know a bit more about your project, and the parameters you used to do the imputation, so that I can comment if some changes might be beneficial and potentially increase the average INFO score.

The INFO score used here is a standard one used in imputation, which can be read about, for instance here https://www.well.ox.ac.uk/~gav/snptest/#info_measures Informally, it is closer to 1 if the imputation process is confident, and closer to 0 if it is less confident In slightly more detail, confidence comes from the distribution of genotype posteriors. If the genotype posteriors are fully confident, i.e. always 0 or 1, then the INFO score should be close to 1. If the genotype posteriors are not confident, i.e. close to 1/3, the INFO score should be close to 0.

Now, generally, STITCH is very well calibrated, so the INFO score at a variant should be monotonically related to the expected imputation accuracy. Ideally you'd have some truth data set, that would allow you to compare how an INFO score threshold correlates with accuracy. In the past, I and others have found 0.4 seems to be a reasonable threshold, so that's why I suggest it. If you have some other way to measure accuracy, or your own truth data, you might find a different threshold to be more reasonable.

Best, Robbie